Commit Graph

1292 Commits

Author SHA1 Message Date
Linus Torvalds
426b4ca2d6 fs.setgid.v6.0
-----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYvIYIgAKCRCRxhvAZXjc
 omE8AQDAZG2YjNJfMnUUaaWYO3+zTaHlQp7OQkQTXIHfcfViXQD4vPt3Wxh3rrF+
 J8fwNcWmXhSNei8HP6cA06QmSajnDQ==
 =GF9/
 -----END PGP SIGNATURE-----

Merge tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull setgid updates from Christian Brauner:
 "This contains the work to move setgid stripping out of individual
  filesystems and into the VFS itself.

  Creating files that have both the S_IXGRP and S_ISGID bit raised in
  directories that themselves have the S_ISGID bit set requires
  additional privileges to avoid security issues.

  When a filesystem creates a new inode it needs to take care that the
  caller is either in the group of the newly created inode or they have
  CAP_FSETID in their current user namespace and are privileged over the
  parent directory of the new inode. If any of these two conditions is
  true then the S_ISGID bit can be raised for an S_IXGRP file and if not
  it needs to be stripped.

  However, there are several key issues with the current implementation:

   - S_ISGID stripping logic is entangled with umask stripping.

     For example, if the umask removes the S_IXGRP bit from the file
     about to be created then the S_ISGID bit will be kept.

     The inode_init_owner() helper is responsible for S_ISGID stripping
     and is called before posix_acl_create(). So we can end up with two
     different orderings:

     1. FS without POSIX ACL support

        First strip umask then strip S_ISGID in inode_init_owner().

        In other words, if a filesystem doesn't support or enable POSIX
        ACLs then umask stripping is done directly in the vfs before
        calling into the filesystem:

     2. FS with POSIX ACL support

        First strip S_ISGID in inode_init_owner() then strip umask in
        posix_acl_create().

        In other words, if the filesystem does support POSIX ACLs then
        unmask stripping may be done in the filesystem itself when
        calling posix_acl_create().

     Note that technically filesystems are free to impose their own
     ordering between posix_acl_create() and inode_init_owner() meaning
     that there's additional ordering issues that influence S_ISGID
     inheritance.

     (Note that the commit message of commit 1639a49ccd ("fs: move
     S_ISGID stripping into the vfs_*() helpers") gets the ordering
     between inode_init_owner() and posix_acl_create() the wrong way
     around. I realized this too late.)

   - Filesystems that don't rely on inode_init_owner() don't get S_ISGID
     stripping logic.

     While that may be intentional (e.g. network filesystems might just
     defer setgid stripping to a server) it is often just a security
     issue.

     Note that mandating the use of inode_init_owner() was proposed as
     an alternative solution but that wouldn't fix the ordering issues
     and there are examples such as afs where the use of
     inode_init_owner() isn't possible.

     In any case, we should also try the cleaner and generalized
     solution first before resorting to this approach.

   - We still have S_ISGID inheritance bugs years after the initial
     round of S_ISGID inheritance fixes:

       e014f37db1 ("xfs: use setattr_copy to set vfs inode attributes")
       01ea173e10 ("xfs: fix up non-directory creation in SGID directories")
       fd84bfdddd ("ceph: fix up non-directory creation in SGID directories")

  All of this led us to conclude that the current state is too messy.
  While we won't be able to make it completely clean as
  posix_acl_create() is still a filesystem specific call we can improve
  the S_SIGD stripping situation quite a bit by hoisting it out of
  inode_init_owner() and into the respective vfs creation operations.

  The obvious advantage is that we don't need to rely on individual
  filesystems getting S_ISGID stripping right and instead can
  standardize the ordering between S_ISGID and umask stripping directly
  in the VFS.

  A few short implementation notes:

   - The stripping logic needs to happen in vfs_*() helpers for the sake
     of stacking filesystems such as overlayfs that rely on these
     helpers taking care of S_ISGID stripping.

   - Security hooks have never seen the mode as it is ultimately seen by
     the filesystem because of the ordering issue we mentioned. Nothing
     is changed for them. We simply continue to strip the umask before
     passing the mode down to the security hooks.

   - The following filesystems use inode_init_owner() and thus relied on
     S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs,
     hfsplus, hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs,
     overlayfs, ramfs, reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs,
     bpf, tmpfs.

     We've audited all callchains as best as we could. More details can
     be found in the commit message to 1639a49ccd ("fs: move S_ISGID
     stripping into the vfs_*() helpers")"

* tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  ceph: rely on vfs for setgid stripping
  fs: move S_ISGID stripping into the vfs_*() helpers
  fs: Add missing umask strip in vfs_tmpfile
  fs: add mode_strip_sgid() helper
2022-08-09 09:52:28 -07:00
Yang Xu
1639a49ccd
fs: move S_ISGID stripping into the vfs_*() helpers
Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.

Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.

When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.

However, there are several key issues with the current implementation:

* S_ISGID stripping logic is entangled with umask stripping.

  If a filesystem doesn't support or enable POSIX ACLs then umask
  stripping is done directly in the vfs before calling into the
  filesystem.
  If the filesystem does support POSIX ACLs then unmask stripping may be
  done in the filesystem itself when calling posix_acl_create().

  Since umask stripping has an effect on S_ISGID inheritance, e.g., by
  stripping the S_IXGRP bit from the file to be created and all relevant
  filesystems have to call posix_acl_create() before inode_init_owner()
  where we currently take care of S_ISGID handling S_ISGID handling is
  order dependent. IOW, whether or not you get a setgid bit depends on
  POSIX ACLs and umask and in what order they are called.

  Note that technically filesystems are free to impose their own
  ordering between posix_acl_create() and inode_init_owner() meaning
  that there's additional ordering issues that influence S_SIGID
  inheritance.

* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
  stripping logic.

  While that may be intentional (e.g. network filesystems might just
  defer setgid stripping to a server) it is often just a security issue.

This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.

So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.

We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).

The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:

sys_mknod()
-> do_mknodat(mode)
   -> .mknod = ovl_mknod(mode)
      -> ovl_create(mode)
         -> vfs_mknod(mode)

get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.

Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.

The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.

All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.

Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.

In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:

* btrfs allows the creation of filesystem objects through various
  ioctls(). Snapshot creation literally takes a snapshot and so the mode
  is fully preserved and S_ISGID stripping doesn't apply.

  Creating a new subvolum relies on inode_init_owner() in
  btrfs_new_subvol_inode() but only creates directories and doesn't
  raise S_ISGID.

* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
  xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
  the actual extents ocfs2 uses a separate ioctl() that also creates the
  target file.

  Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
  inode_init_owner() to strip the S_ISGID bit. This is the only place
  where a filesystem needs to call mode_strip_sgid() directly but this
  is self-inflicted pain.

* spufs doesn't go through the vfs at all and doesn't use ioctl()s
  either. Instead it has a dedicated system call spufs_create() which
  allows the creation of filesystem objects. But spufs only creates
  directories and doesn't allo S_SIGID bits, i.e. it specifically only
  allows 0777 bits.

* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.

The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.

Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.

Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37db1 ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e10 ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdddd ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-21 11:34:16 +02:00
Yang Xu
ac6800e279
fs: Add missing umask strip in vfs_tmpfile
All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.

Link: https://lore.kernel.org/r/1657779088-2242-2-git-send-email-xuyang2018.jy@fujitsu.com
Fixes: 60545d0d46 ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...")
Cc: <stable@vger.kernel.org> # 4.19+
Reported-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-07-19 15:13:20 +02:00
Al Viro
3bd8bc8971 step_into(): move fetching ->d_inode past handle_mounts()
... and lose messing with it in __follow_mount_rcu()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06 13:16:07 -04:00
Al Viro
4cb6402480 lookup_fast(): don't bother with inode
Note that validation of ->d_seq after ->d_inode fetch is gone, along
with fetching of ->d_inode itself.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06 13:16:03 -04:00
Al Viro
b16c001de0 follow_dotdot{,_rcu}(): don't bother with inode
step_into() will fetch it, TYVM.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06 13:15:59 -04:00
Al Viro
a4f5b52167 step_into(): lose inode argument
make handle_mounts() always fetch it.  This is just the first step -
the callers of step_into() will stop trying to calculate the sucker,
etc.

The passed value should be equal to dentry->d_inode in all cases;
in RCU mode - fetched after we'd sampled ->d_seq.  Might as well
fetch it here.  We do need to validate ->d_seq, which duplicates
the check currently done in lookup_fast(); that duplication will
go away shortly.

After that change handle_mounts() always ignores the initial value of
*inode and always sets it on success.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06 13:15:52 -04:00
Al Viro
03fa86e9f7 namei: stash the sampled ->d_seq into nameidata
New field: nd->next_seq.  Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered.  Used instead of an arseload
of local variables, arguments, etc.

step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.

There are two requirements for RCU pathwalk:
	1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
	2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.

The use of seq numbers is the way we achieve that.  Invariant we want
to maintain is:
	if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq

	For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged.  Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
	For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed.  Same reasoning as
for the previous case applies.
	For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk.  That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
	For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk.  The same reasoning as in the
previous case applies.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06 13:14:42 -04:00
Al Viro
6e18032715 namei: move clearing LOOKUP_RCU towards rcu_read_unlock()
try_to_unlazy()/try_to_unlazy_next() drop LOOKUP_RCU in the
very beginning and do rcu_read_unlock() only at the very end.
However, nothing done in between even looks at the flag in
question; might as well clear it at the same time we unlock.

Note that try_to_unlazy_next() used to call legitimize_mnt(),
which might drop/regain rcu_read_lock() in some cases.  This
is no longer true, so we really have rcu_read_lock() held
all along until the end.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-06 12:40:31 -04:00
Al Viro
7e4745a094 switch try_to_unlazy_next() to __legitimize_mnt()
The tricky case (__legitimize_mnt() failing after having grabbed
a reference) can be trivially dealt with by leaving nd->path.mnt
non-NULL, for terminate_walk() to drop it.

legitimize_mnt() becomes static after that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-05 16:18:21 -04:00
Al Viro
51c6546c30 follow_dotdot{,_rcu}(): change calling conventions
Instead of returning NULL when we are in root, just make it return
the current position (and set *seqp and *inodep accordingly).
That collapses the calls of step_into() in handle_dots()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-05 16:16:28 -04:00
Al Viro
82ef069805 namei: get rid of pointless unlikely(read_seqcount_retry(...))
read_seqcount_retry() et.al. are inlined and there's enough annotations
for compiler to figure out that those are unlikely to return non-zero.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-05 16:16:16 -04:00
Al Viro
20aac6c609 __follow_mount_rcu(): verify that mount_lock remains unchanged
Validate mount_lock seqcount as soon as we cross into mount in RCU
mode.  Sure, ->mnt_root is pinned and will remain so until we
do rcu_read_unlock() anyway, and we will eventually fail to unlazy if
the mount_lock had been touched, but we might run into a hard error
(e.g. -ENOENT) before trying to unlazy.  And it's possible to end
up with RCU pathwalk racing with rename() and umount() in a way
that would fail with -ENOENT while non-RCU pathwalk would've
succeeded with any timings.

Once upon a time we hadn't needed that, but analysis had been subtle,
brittle and went out of window as soon as RENAME_EXCHANGE had been
added.

It's narrow, hard to hit and won't get you anything other than
stray -ENOENT that could be arranged in much easier way with the
same priveleges, but it's a bug all the same.

Cc: stable@kernel.org
X-sky-is-falling: unlikely
Fixes: da1ce0670c "vfs: add cross-rename"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-05 16:14:14 -04:00
Linus Torvalds
952923ddc0 Several cleanups in fs/namei.c.
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYpvrrgAKCRBZ7Krx/gZQ
 6+eZAP9r0c8E1UnUxRI32kQrzndJO2z9mKq6rI9D7GgJe+MtogEAlfynzklqbHzo
 fFvczeD0tzLu4aDExtGn6GNGf5gSpw4=
 =1blQ
 -----END PGP SIGNATURE-----

Merge tag 'pull-18-rc1-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs pathname updates from Al Viro:
 "Several cleanups in fs/namei.c"

* tag 'pull-18-rc1-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  namei: cleanup double word in comment
  get rid of dead code in legitimize_root()
  fs/namei.c:reserve_stack(): tidy up the call of try_to_unlazy()
2022-06-04 19:07:15 -07:00
Linus Torvalds
2c5ca23f74 overlayfs update for 5.19
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYo+nGwAKCRDh3BK/laaZ
 PBouAP0VBH/jygclzc42jlRkKjp+wJnF1FifpWOJEtTPiYqhtAD/UWjR/2Sy4TMT
 fRsw9N9/FXxcXShjg3U42fpCNSVEqgM=
 =oY4z
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:

 - Support idmapped layers in overlayfs (Christian Brauner)

 - Add a fix to exportfs that is relevant to open_by_handle_at(2) as
   well

 - Introduce new lookup helpers that allow passing mnt_userns into
   inode_permission()

* tag 'ovl-update-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: support idmapped layers
  ovl: handle idmappings in ovl_xattr_{g,s}et()
  ovl: handle idmappings in layer open helpers
  ovl: handle idmappings in ovl_permission()
  ovl: use ovl_copy_{real,upper}attr() wrappers
  ovl: store lower path in ovl_inode
  ovl: handle idmappings for layer lookup
  ovl: handle idmappings for layer fileattrs
  ovl: use ovl_path_getxattr() wrapper
  ovl: use ovl_lookup_upper() wrapper
  ovl: use ovl_do_notify_change() wrapper
  ovl: pass layer mnt to ovl_open_realfile()
  ovl: pass ofs to setattr operations
  ovl: handle idmappings in creation operations
  ovl: add ovl_upper_mnt_userns() wrapper
  ovl: pass ofs to creation operations
  ovl: use wrappers to all vfs_*xattr() calls
  exportfs: support idmapped mounts
  fs: add two trivial lookup helpers
2022-05-30 11:19:16 -07:00
Linus Torvalds
6f664045c8 Not a lot of material this cycle. Many singleton patches against various
subsystems.   Most notably some maintenance work in ocfs2 and initramfs.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo/6xQAKCRDdBJ7gKXxA
 jkD9AQCPczLBbRWpe1edL+5VHvel9ePoHQmvbHQnufdTh9rB5QEAu0Uilxz4q9cx
 xSZypNhj2n9f8FCYca/ZrZneBsTnAA8=
 =AJEO
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2022-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc updates from Andrew Morton:
 "The non-MM patch queue for this merge window.

  Not a lot of material this cycle. Many singleton patches against
  various subsystems. Most notably some maintenance work in ocfs2
  and initramfs"

* tag 'mm-nonmm-stable-2022-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (65 commits)
  kcov: update pos before writing pc in trace function
  ocfs2: dlmfs: fix error handling of user_dlm_destroy_lock
  ocfs2: dlmfs: don't clear USER_LOCK_ATTACHED when destroying lock
  fs/ntfs: remove redundant variable idx
  fat: remove time truncations in vfat_create/vfat_mkdir
  fat: report creation time in statx
  fat: ignore ctime updates, and keep ctime identical to mtime in memory
  fat: split fat_truncate_time() into separate functions
  MAINTAINERS: add Muchun as a memcg reviewer
  proc/sysctl: make protected_* world readable
  ia64: mca: drop redundant spinlock initialization
  tty: fix deadlock caused by calling printk() under tty_port->lock
  relay: remove redundant assignment to pointer buf
  fs/ntfs3: validate BOOT sectors_per_clusters
  lib/string_helpers: fix not adding strarray to device's resource list
  kernel/crash_core.c: remove redundant check of ck_cmdline
  ELF, uapi: fixup ELF_ST_TYPE definition
  ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
  ipc: update semtimedop() to use hrtimer
  ipc/sem: remove redundant assignments
  ...
2022-05-27 11:22:03 -07:00
Tom Rix
30476f7e6d namei: cleanup double word in comment
Remove the second 'to'.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-19 23:26:29 -04:00
Al Viro
52dba645ca get rid of dead code in legitimize_root()
Combination of LOOKUP_IS_SCOPED and NULL nd->root.mnt is impossible
after successful path_init().  All places where ->root.mnt might
become NULL do that only if LOOKUP_IS_SCOPED is not there and
path_init() itself can return success without setting nd->root
only if ND_ROOT_PRESET had been set (in which case nd->root
had been set by caller and never changed) or if the name had
been a relative one *and* none of the bits in LOOKUP_IS_SCOPED
had been present.

Since all calls of legitimize_root() must be downstream of successful
path_init(), the check for !nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED)
is pure paranoia.

FWIW, it had been discussed (and agreed upon) with Aleksa back when
scoped lookups had been merged; looks like that had fallen through the
cracks back then.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-19 23:26:29 -04:00
Al Viro
e5ca024e16 fs/namei.c:reserve_stack(): tidy up the call of try_to_unlazy()
!foo() != 0 is a strange way to spell !foo(); fallout from
"fs: make unlazy_walk() error handling consistent"...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-05-19 23:26:29 -04:00
Julius Hemanth Pitti
c7031c1440 proc/sysctl: make protected_* world readable
protected_* files have 600 permissions which prevents non-superuser from
reading them.

Container like "AWS greengrass" refuse to launch unless
protected_hardlinks and protected_symlinks are set.  When containers like
these run with "userns-remap" or "--user" mapping container's root to
non-superuser on host, they fail to run due to denied read access to these
files.

As these protections are hardly a secret, and do not possess any security
risk, making them world readable.

Though above greengrass usecase needs read access to only
protected_hardlinks and protected_symlinks files, setting all other
protected_* files to 644 to keep consistency.

Link: http://lkml.kernel.org/r/20200709235115.56954-1-jpitti@cisco.com
Fixes: 800179c9b8 ("fs: add link restrictions")
Signed-off-by: Julius Hemanth Pitti <jpitti@cisco.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 16:58:15 -07:00
Matthew Wilcox (Oracle)
27a7791344 namei: Call aops write_begin() and write_end() directly
pagecache_write_begin() and pagecache_write_end() are now trivial
wrappers, so call the aops directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:45 -04:00
Matthew Wilcox (Oracle)
2d878178ba namei: Convert page_symlink() to use memalloc_nofs_save()
Stop using AOP_FLAG_NOFS in favour of the scoped memory API.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:18 -04:00
Matthew Wilcox (Oracle)
56f5746c41 namei: Merge page_symlink() and __page_symlink()
There are no callers of __page_symlink() left, so we can remove that
entry point.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
2022-05-08 14:28:18 -04:00
Christian Brauner
00675017e0 fs: add two trivial lookup helpers
Similar to the addition of lookup_one() add a version of
lookup_one_unlocked() and lookup_one_positive_unlocked() that take
idmapped mounts into account. This is required to port overlay to
support idmapped base layers.

Cc: <linux-fsdevel@vger.kernel.org>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-04-28 16:31:10 +02:00
NeilBrown
b3d4650d82 VFS: filename_create(): fix incorrect intent.
When asked to create a path ending '/', but which is not to be a
directory (LOOKUP_DIRECTORY not set), filename_create() will never try
to create the file.  If it doesn't exist, -ENOENT is reported.

However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
->lookup() function, even though there is no intent to create.  This is
misleading and can cause incorrect behaviour.

If you try

   ln -s foo /path/dir/

where 'dir' is a directory on an NFS filesystem which is not currently
known in the dcache, this will fail with ENOENT.

But as the name is not in the dcache, nfs_lookup gets called with
LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
lookup, with the expectation that a subsequent call to create the target
will be made, and the lookup can be combined with the creation.  In the
case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
made.  Instead filename_create() sees that the dentry is not (yet)
positive and returns -ENOENT - even though the directory actually
exists.

So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
create, and use the absence of these flags to decide if -ENOENT should
be returned.

Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
split that out and store it in 'reval_flag'.  __lookup_hash() then gets
reval_flag combined with whatever create flags were determined to be
needed.

Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-14 15:53:43 -07:00
Linus Torvalds
4897e722b5 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmHz0QsACgkQnJ2qBz9k
 QNkN+AgA6XqWHKYyElfgJFt1UqaoNMz/Faz9H/+PKiBNSTf6/+67D+V7DFz6jJrv
 dDwHNzfDg9kR+pbAAPwhl2jfnQoUlsr019Hrqa5HpWlj5geVpbdunYUzS2WOkwqD
 /m+brcLgPdKb2uIysj6wOh9B7wa8V9ksl3EjQvvwaHaU0p1YLUqidVXucYvs8DUo
 bgXNaj9GmeysxnmU+aILotWuuXH2vOP4Q2Uk4qz3rN6xW9eEXtpQ4y7gWBp/GA8y
 Ia8FtFdQdvlSDOJYMdPOTBu5RB7gY9ElrapvVaWNYtCWI/jRv666nZsLaERYNhLN
 uUsG4MWjYbOqW5XqFDbSOwbDqvMh5Q==
 =mEFA
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fixes from Jan Kara:
 "Fixes for userspace breakage caused by fsnotify changes ~3 years ago
  and one fanotify cleanup"

* tag 'fsnotify_for_v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: fix fsnotify hooks in pseudo filesystems
  fsnotify: invalidate dcache before IN_DELETE event
  fanotify: remove variable set but not used
2022-01-28 17:51:31 +02:00
Amir Goldstein
a37d9a17f0 fsnotify: invalidate dcache before IN_DELETE event
Apparently, there are some applications that use IN_DELETE event as an
invalidation mechanism and expect that if they try to open a file with
the name reported with the delete event, that it should not contain the
content of the deleted file.

Commit 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of
d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
will have access to a positive dentry.

This allowed a race where opening the deleted file via cached dentry
is now possible after receiving the IN_DELETE event.

To fix the regression, create a new hook fsnotify_delete() that takes
the unlinked inode as an argument and use a helper d_delete_notify() to
pin the inode, so we can pass it to fsnotify_delete() after d_delete().

Backporting hint: this regression is from v5.3. Although patch will
apply with only trivial conflicts to v5.4 and v5.10, it won't build,
because fsnotify_delete() implementation is different in each of those
versions (see fsnotify_link()).

A follow up patch will fix the fsnotify_unlink/rmdir() calls in pseudo
filesystem that do not need to call d_delete().

Link: https://lore.kernel.org/r/20220120215305.282577-1-amir73il@gmail.com
Reported-by: Ivan Delalande <colona@arista.com>
Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
Fixes: 49246466a9 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-01-24 14:16:46 +01:00
Luis Chamberlain
9c011be132 fs: move namei sysctls to its own file
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move namei's own sysctl knobs to its own file.

Other than the move we also avoid initializing two static variables to 0
as this is not needed:

  * sysctl_protected_symlinks
  * sysctl_protected_hardlinks

Link: https://lkml.kernel.org/r/20211129205548.605569-8-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-22 08:33:36 +02:00
David Howells
1bd9c4e4f0 vfs, cachefiles: Mark a backing file in use with an inode flag
Use an inode flag, S_KERNEL_FILE, to mark that a backing file is in use by
the kernel to prevent cachefiles or other kernel services from interfering
with that file.

Alter rmdir to reject attempts to remove a directory marked with this flag.
This is used by cachefiles to prevent cachefilesd from removing them.

Using S_SWAPFILE instead isn't really viable as that has other effects in
the I/O paths.

Changes
=======
ver #3:
 - Check for the object pointer being NULL in the tracepoints rather than
   the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819630256.215744.4815885535039369574.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906931596.143852.8642051223094013028.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967141000.1823006.12920680657559677789.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021541207.640689.564689725898537127.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:41:32 +00:00
Linus Torvalds
9ac211426f File locking changes for v5.16
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmF77NQTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFfP9D/wN8rCAPA2J6SpBjdJXSpSQS4PoAOqC
 Z002bOc7sq/zg2cWk+pX1aOR/+wUpk+PvaQdyvfO+o4TVCpsTOklRh/yfYuOkJdP
 PoINUR7vb43/CGqd04YI45+pxOFMJk9JoLkNha0uIY4ukXdt9mA6u/+qBEDboyDQ
 Jbn1JGitRc9WYaE7BV26ba0l+Deb5h2/4c1DiDlsgmLkDPhpowkOznovUCkBnH7H
 bfwlssjZ5P0K5ttZDw6VlkC2xE+Yr56BsEco2bXO42LwUHOx6r6ZNp04rh9zh1Zp
 hFPybgU+ur17EOOmBbCq8aHZqRRcjQQDH/rZ1heHSOfTrEWWth1xNjmeewSgRZHL
 0oMi3yIJPwvuDBQPEQg0VD87k5Z8xbRPql6eM6GeGxDZvzXWqqYKW2OYXtNxG91m
 bGvu2OOGkdF/4WGYBixdjUQb5KjcOqdIFkq3/oHfLQ+cS2uc6eOfnCdxa7cTnTdd
 BcFDgZmWQDLFs6/DIbwUI0KWMAiLFMZnZACE937JvlE74EGiHu47JMzwcU15J6zO
 VD0Oq0XsWQN+TgY2RnjuTFqm6DTpbrkgw88sNDr5g3jZbgJZiZ/r/3M55lcBVWvk
 PFT4fjKhD1mT6/SscAAmOxUKUeDbN7EktiRsZmH9C2sUCERufDb/cmY/RYZ00C4t
 01ovPjs7VukS/w==
 =bcaq
 -----END PGP SIGNATURE-----

Merge tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
 "Most of this is just follow-on cleanup work of documentation and
  comments from the mandatory locking removal in v5.15.

  The only real functional change is that LOCK_MAND flock() support is
  also being removed, as it has basically been non-functional since the
  v2.5 days"

* tag 'locks-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: remove leftover comments from mandatory locking removal
  locks: remove changelog comments
  docs: fs: locks.rst: update comment about mandatory file locking
  Documentation: remove reference to now removed mandatory-locking doc
  locks: remove LOCK_MAND flock lock support
2021-11-01 09:06:53 -07:00
Jeff Layton
482e00075d fs: remove leftover comments from mandatory locking removal
Stragglers from commit f7e33bdbd6 ("fs: remove mandatory file locking
support").

Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-10-26 12:20:50 -04:00
Al Viro
ea47ab1116 putname(): IS_ERR_OR_NULL() is wrong here
Mixing NULL and ERR_PTR() just in case is a Bad Idea(tm).  For
struct filename the former is wrong - failures are reported
as ERR_PTR(...), not as NULL.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07 16:14:05 -04:00
Stephen Brennan
b4a4f213a3 namei: Standardize callers of filename_create()
filename_create() has two variants, one which drops the caller's
reference to filename (filename_create) and one which does
not (__filename_create). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_create, rename __filename_create
to filename_create, and convert all callers.

Link: https://lore.kernel.org/linux-fsdevel/f6238254-35bd-7e97-5b27-21050c745874@oracle.com/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07 16:11:01 -04:00
Stephen Brennan
794ebcea86 namei: Standardize callers of filename_lookup()
filename_lookup() has two variants, one which drops the caller's
reference to filename (filename_lookup), and one which does
not (__filename_lookup). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_lookup, rename __filename_lookup
to filename_lookup, and convert all callers. The cost is a few slightly
longer functions, but the clarity is greater.

[AV: consuming a reference is not at all unusual, actually; look at
e.g. do_mkdirat(), for example.  It's more that we want non-consuming
variant for close relative of that function...]

Link: https://lore.kernel.org/linux-fsdevel/YS+dstZ3xfcLxhoB@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07 16:07:47 -04:00
Al Viro
c5f563f9e9 rename __filename_parentat() to filename_parentat()
... in separate commit, to avoid noise in previous one

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07 15:57:42 -04:00
Stephen Brennan
0766ec82e5 namei: Fix use after free in kern_path_locked
In 0ee50b4753 ("namei: change filename_parentat() calling
conventions"), filename_parentat() was made to always call putname() on
the  filename before returning, and kern_path_locked() was migrated to
this calling convention.  However, kern_path_locked() uses the "last"
parameter to lookup and potentially create a new dentry. The last
parameter contains the last component of the path and points within the
filename, which was recently freed at the end of filename_parentat().
Thus, when kern_path_locked() calls __lookup_hash(), it is using the
filename after it has already been freed.

In other words, these calling conventions had been wrong for the
only remaining caller of filename_parentat().  Everything else
is using __filename_parentat(), which does not drop the reference;
so should kern_path_locked().

Switch kern_path_locked() to use of __filename_parentat() and move
getting/dropping struct filename into wrapper.  Remove filename_parentat(),
now that we have no remaining callers.

Fixes: 0ee50b4753 ("namei: change filename_parentat() calling conventions")
Link: https://lore.kernel.org/linux-fsdevel/YS9D4AlEsaCxLFV0@infradead.org/
Link: https://lore.kernel.org/linux-fsdevel/YS+csMTV2tTXKg3s@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+fb0d60a179096e8c2731@syzkaller.appspotmail.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Co-authored-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-09-07 15:56:16 -04:00
Linus Torvalds
14726903c8 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "173 patches.

  Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
  pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
  bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
  hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
  oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
  mm/madvise: add MADV_WILLNEED to process_madvise()
  mm/vmstat: remove unneeded return value
  mm/vmstat: simplify the array size calculation
  mm/vmstat: correct some wrong comments
  mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
  selftests: vm: add COW time test for KSM pages
  selftests: vm: add KSM merging time test
  mm: KSM: fix data type
  selftests: vm: add KSM merging across nodes test
  selftests: vm: add KSM zero page merging test
  selftests: vm: add KSM unmerge test
  selftests: vm: add KSM merge test
  mm/migrate: correct kernel-doc notation
  mm: wire up syscall process_mrelease
  mm: introduce process_mrelease system call
  memblock: make memblock_find_in_range method private
  mm/mempolicy.c: use in_task() in mempolicy_slab_node()
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  mm/mempolicy: advertise new MPOL_PREFERRED_MANY
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  ...
2021-09-03 10:08:28 -07:00
Hugh Dickins
51cc3a6620 fs, mm: fix race in unlinking swapfile
We had a recurring situation in which admin procedures setting up
swapfiles would race with test preparation clearing away swapfiles; and
just occasionally that got stuck on a swapfile "(deleted)" which could
never be swapped off.  That is not supposed to be possible.

2.6.28 commit f9454548e1 ("don't unlink an active swapfile") admitted
that it was leaving a race window open: now close it.

may_delete() makes the IS_SWAPFILE check (amongst many others) before
inode_lock has been taken on target: now repeat just that simple check in
vfs_unlink() and vfs_rename(), after taking inode_lock.

Which goes most of the way to fixing the race, but swapon() must also
check after it acquires inode_lock, that the file just opened has not
already been unlinked.

Link: https://lkml.kernel.org/r/e17b91ad-a578-9a15-5e3-4989e0f999b5@google.com
Fixes: f9454548e1 ("don't unlink an active swapfile")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:11 -07:00
Linus Torvalds
87045e6546 for-5.15-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEs2NIACgkQxWXV+ddt
 WDsJMQ/+PJ/yXfI85mAeAzTJLWQ0zD6YO3iBhf3wOeyychWC4on435pj+zW8zR/U
 /bix25ygoWF4MvGF6p0uyv4Z5mnvkZXE5lapUcJu6wXG7se1QRPH0broTh05IBXK
 SnT93Eb9RexaiNFk7DVma9XkviqZ/ZISPtkJ9wYrfIba7j/U/wa+PtEFS7wk58hP
 rFQXgV64xm/pcP28YYHfOkCjdyUMdJrnBUvfKOlX6d94lmYbP5lyiTL+XJEXExzN
 wPakD0UsnXPr4TRvf+YRTPeFHPPUgyORII7otVUOKmGywWtcJrELX8rXFoW+6GwB
 dzZIcSYXHUxU5UrtMbZgiztVBJ+bQY5juYMIrj13eYOMYkijxAqPP84iDO15+TSV
 zNqyAVjUglHCGUGjhSpAxnAmtp+IJTZfVAWcvIKq3VqvJtb8tssQsk9bqFjH1xlH
 qNJLE57CYe3tjw05K9y0keMh2iJWRWkXZYkgI/zjwo5nreemobpN+3fO4yneVLh7
 ecdBmSl/JVSzAB1NamLOCZNGZLUqiiuTvZlJtI6ZsekrN1+4A6QzVcU/MGjSYL1v
 C7W0hK0LF+e3xIBkxTKVq8noolsgbmlWacxJq8fZq9HwZy5IVJOVm9STDlCuLaIo
 gPr0V0itkclcsMU0CHTyCjMsfuHYUwJZXwg93wKfJf5UCzS4OWU=
 =ALO9
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "The highlights of this round are integrations with fs-verity and
  idmapped mounts, the rest is usual mix of minor improvements, speedups
  and cleanups.

  There are some patches outside of btrfs, namely updating some VFS
  interfaces, all straightforward and acked.

  Features:

   - fs-verity support, using standard ioctls, backward compatible with
     read-only limitation on inodes with previously enabled fs-verity

   - idmapped mount support

   - make mount with rescue=ibadroots more tolerant to partially damaged
     trees

   - allow raid0 on a single device and raid10 on two devices,
     degenerate cases but might be useful as an intermediate step during
     conversion to other profiles

   - zoned mode block group auto reclaim can be disabled via sysfs knob

  Performance improvements:

   - continue readahead of node siblings even if target node is in
     memory, could speed up full send (on sample test +11%)

   - batching of delayed items can speed up creating many files

   - fsync/tree-log speedups
       - avoid unnecessary work (gains +2% throughput, -2% run time on
         sample load)
       - reduced lock contention on renames (on dbench +4% throughput,
         up to -30% latency)

  Fixes:

   - various zoned mode fixes

   - preemptive flushing threshold tuning, avoid excessive work on
     almost full filesystems

  Core:

   - continued subpage support, preparation for implementing remaining
     features like compression and defragmentation; with some
     limitations, write is now enabled on 64K page systems with 4K
     sectors, still considered experimental
       - no readahead on compressed reads
       - inline extents disabled
       - disabled raid56 profile conversion and mount

   - improved flushing logic, fixing early ENOSPC on some workloads

   - inode flags have been internally split to read-only and read-write
     incompat bit parts, used by fs-verity

   - new tree items for fs-verity
       - descriptor item
       - Merkle tree item

   - inode operations extended to be namespace-aware

   - cleanups and refactoring

  Generic code changes:

   - fs: new export filemap_fdatawrite_wbc

   - fs: removed sync_inode

   - block: bio_trim argument type fixups

   - vfs: add namespace-aware lookup"

* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
  btrfs: reset replace target device to allocation state on close
  btrfs: zoned: fix ordered extent boundary calculation
  btrfs: do not do preemptive flushing if the majority is global rsv
  btrfs: reduce the preemptive flushing threshold to 90%
  btrfs: tree-log: check btrfs_lookup_data_extent return value
  btrfs: avoid unnecessarily logging directories that had no changes
  btrfs: allow idmapped mount
  btrfs: handle ACLs on idmapped mounts
  btrfs: allow idmapped INO_LOOKUP_USER ioctl
  btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
  btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
  btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
  btrfs: allow idmapped SNAP_DESTROY ioctls
  btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
  btrfs: check whether fsgid/fsuid are mapped during subvolume creation
  btrfs: allow idmapped permission inode op
  btrfs: allow idmapped setattr inode op
  btrfs: allow idmapped tmpfile inode op
  btrfs: allow idmapped symlink inode op
  btrfs: allow idmapped mkdir inode op
  ...
2021-08-31 09:41:22 -07:00
Linus Torvalds
b91db6a0b5 for-5.15/io_uring-vfs-2021-08-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs8fUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpio4D/9cGrHIbbZsuDIHzhaK2JIUrSG7G4GkcaG/
 NAqbOp7KvF+1elMY08DWLT0nnFqHM7REHIS4Lv55KCNtktTFfdYmxso4lPrRu67o
 MNbMJcEAglgIDw0xP4MfP/vZ0ftXJv8+OXSfL51pD4U40nWIZVpqn8WbWKRqjhGf
 nQhiANbl2mO2Ec7I/UgAIqwczQnF5HveCkX5106dAppma8yEH+v2TkvZyZp/TCU3
 h0ec26hLi+4QRBFm4O0yrVWj1gMS7yfHuEFSGw+jhp/WNTpH9A5pXFQjn7pIyJNi
 uqrwM7knrod9ZH2pE1825w0TrbqkOdcZCo+/NvJHOAy03LUBJ/9qDc+JJUWsEmLZ
 cpd8auaCfuAFx6ForHmKd+Pw1bANebWBMsClyQSh38+fsJ9myci3c3tkkzmO+dSW
 G+rZZochiG4nFSl+CvlUoFfztuu8rdbOLKI/9usPMHNcDiY4yAAmz80B9uQdtQp7
 tRLqegplsDODefLNvl0/Uj7WFJl6w5furchTXPmc+GSPFc+mpW08Olh7ScaCyD8c
 a8YXaQi5hwuUR1N7uW65Df/HGMbIDvxOStcurIakP0mOSvRKrojZgQhbJ8zuCG4y
 cRCwRUzvreNIoKK2ZxEvhLjhE5POaWgy6AtN/UI9k9BeVGQdboKVBGvub5Mv+ZKE
 HpchbANk8Q==
 =T7Zv
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block

Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe:
 "This adds io_uring support for mkdirat, symlinkat, and linkat"

* tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block:
  io_uring: add support for IORING_OP_LINKAT
  io_uring: add support for IORING_OP_SYMLINKAT
  io_uring: add support for IORING_OP_MKDIRAT
  namei: update do_*() helpers to return ints
  namei: make do_linkat() take struct filename
  namei: add getname_uflags()
  namei: make do_symlinkat() take struct filename
  namei: make do_mknodat() take struct filename
  namei: make do_mkdirat() take struct filename
  namei: change filename_parentat() calling conventions
  namei: ignore ERR/NULL names in putname()
2021-08-30 19:39:59 -07:00
Dmitry Kadashev
cf30da90bc io_uring: add support for IORING_OP_LINKAT
IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and
arguments.

In some internal places 'hardlink' is used instead of 'link' to avoid
confusion with the SQE links. Name 'link' conflicts with the existing
'link' member of io_kiocb.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:48:52 -06:00
Dmitry Kadashev
7a8721f84f io_uring: add support for IORING_OP_SYMLINKAT
IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags
and arguments.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:48:33 -06:00
Dmitry Kadashev
45f30dab39 namei: update do_*() helpers to return ints
Update the following to return int rather than long, for uniformity with
the rest of the do_* helpers in namei.c:

* do_rmdir()
* do_unlinkat()
* do_mkdirat()
* do_mknodat()
* do_symlinkat()

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-9-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:41:26 -06:00
Dmitry Kadashev
020250f31c namei: make do_linkat() take struct filename
Pass in the struct filename pointers instead of the user string, for
uniformity with do_renameat2, do_unlinkat, do_mknodat, etc.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-8-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:41:26 -06:00
Dmitry Kadashev
8228e2c313 namei: add getname_uflags()
There are a couple of places where we already open-code the (flags &
AT_EMPTY_PATH) check and io_uring will likely add another one in the
future.  Let's just add a simple helper getname_uflags() that handles
this directly and use it.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-7-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:41:26 -06:00
Dmitry Kadashev
da2d0cede3 namei: make do_symlinkat() take struct filename
Pass in the struct filename pointers instead of the user string, for
uniformity with the recently converted do_mkdnodat(), do_unlinkat(),
do_renameat(), do_mkdirat().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-6-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:41:26 -06:00
Dmitry Kadashev
7797251bb5 namei: make do_mknodat() take struct filename
Pass in the struct filename pointers instead of the user string, for
uniformity with the recently converted do_unlinkat(), do_renameat(),
do_mkdirat().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-5-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:41:26 -06:00
Dmitry Kadashev
584d3226d6 namei: make do_mkdirat() take struct filename
Pass in the struct filename pointers instead of the user string, and
update the three callers to do the same. This is heavily based on
commit dbea8d345177 ("fs: make do_renameat2() take struct filename").

This behaves like do_unlinkat() and do_renameat2().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-4-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:41:26 -06:00
Dmitry Kadashev
0ee50b4753 namei: change filename_parentat() calling conventions
Since commit 5c31b6cedb ("namei: saner calling conventions for
filename_parentat()") filename_parentat() had the following behavior WRT
the passed in struct filename *:

* On error the name is consumed (putname() is called on it);
* On success the name is returned back as the return value;

Now there is a need for filename_create() and filename_lookup() variants
that do not consume the passed filename, and following the same "consume
the name only on error" semantics is proven to be hard to reason about
and result in confusing code.

Hence this preparation change splits filename_parentat() into two: one
that always consumes the name and another that never consumes the name.
This will allow to implement two filename_create() variants in the same
way, and is a consistent and hopefully easier to reason about approach.

Link: https://lore.kernel.org/io-uring/CAOKbgA7MiqZAq3t-HDCpSGUFfco4hMA9ArAE-74fTpU+EkvKPw@mail.gmail.com/
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-3-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:41:26 -06:00
Dmitry Kadashev
91ef658fb8 namei: ignore ERR/NULL names in putname()
Supporting ERR/NULL names in putname() makes callers code cleaner, and
is what some other path walking functions already support for the same
reason.

This also removes a few existing IS_ERR checks before putname().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/CAHk-=wgCac9hBsYzKMpHk0EbLgQaXR=OUAjHaBtaY+G8A9KhFg@mail.gmail.com/
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Dmitry Kadashev <dkadashev@gmail.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-2-dkadashev@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 13:41:26 -06:00
Christian Brauner
c2fd68b6b2 namei: add mapping aware lookup helper
Various filesystems rely on the lookup_one_len() helper to lookup a
single path component relative to a well-known starting point. Allow
such filesystems to support idmapped mounts by adding a version of this
helper to take the idmap into account when calling inode_permission().
This change is a required to let btrfs (and other filesystems) support
idmapped mounts.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:12 +02:00
Jeff Layton
f7e33bdbd6 fs: remove mandatory file locking support
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
off in Fedora and RHEL8. Several other distros have followed suit.

I've heard of one problem in all that time: Someone migrated from an
older distro that supported "-o mand" to one that didn't, and the host
had a fstab entry with "mand" in it which broke on reboot. They didn't
actually _use_ mandatory locking so they just removed the mount option
and moved on.

This patch rips out mandatory locking support wholesale from the kernel,
along with the Kconfig option and the Documentation file. It also
changes the mount code to ignore the "mand" mount option instead of
erroring out, and to throw a big, ugly warning.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
2021-08-23 06:15:36 -04:00
Linus Torvalds
58ec9059b3 Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs name lookup updates from Al Viro:
 "Small namei.c patch series, mostly to simplify the rules for nameidata
  state. It's actually from the previous cycle - but I didn't post it
  for review in time...

  Changes visible outside of fs/namei.c: file_open_root() calling
  conventions change, some freed bits in LOOKUP_... space"

* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  namei: make sure nd->depth is always valid
  teach set_nameidata() to handle setting the root as well
  take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
  switch file_open_root() to struct path
2021-07-03 11:41:14 -07:00
Linus Torvalds
34a456eb1f fs.idmapped.helpers.v5.13
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYIfiiwAKCRCRxhvAZXjc
 ogtMAQC+MtgJZdcH5iDHNEyI36JaWUccKRV7PdvfF1YgnXO45gD+IYxR1c/EQQyD
 kh2AmqhET6jVhe9Nsob5yxduksI+ygo=
 =oh/d
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.helpers.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fs mapping helper updates from Christian Brauner:
 "This adds kernel-doc to all new idmapping helpers and improves their
  naming which was triggered by a discussion with some fs developers.
  Some of the names are based on suggestions by Vivek and Al.

  Also remove the open-coded permission checking in a few places with
  simple helpers. Overall this should lead to more clarity and make it
  easier to maintain"

* tag 'fs.idmapped.helpers.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: introduce two inode i_{u,g}id initialization helpers
  fs: introduce fsuidgid_has_mapping() helper
  fs: document and rename fsid helpers
  fs: document mapping helpers
2021-04-27 12:49:42 -07:00
Linus Torvalds
cc15422c1f fs.idmapped.docs.v5.13
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYIfiFwAKCRCRxhvAZXjc
 oswFAP4sL0oA7mBGDzoxktIMWKY+f7KKDjb9gXc8fDQV9bbcNwD6A9QPJCahfab9
 cndByav/xcB/7n/NXLecNYr8NcfTgg8=
 =mdyh
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.docs.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fs helper kernel-doc updates from Christian Brauner:
 "In the last cycles we forgot to update the kernel-docs in some places
  that were changed during the idmapped mount work. Lukas and Randy took
  the chance to not just fixup those places but also fixup and expand
  kernel-docs for some additional helpers.

  No functional changes"

* tag 'fs.idmapped.docs.v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: update kernel-doc for vfs_rename()
  fs: turn some comments into kernel-doc
  xattr: fix kernel-doc for mnt_userns and vfs xattr helpers
  namei: fix kernel-doc for struct renamedata and more
  libfs: fix kernel-doc for mnt_userns
2021-04-27 12:42:03 -07:00
Al Viro
7962c7d196 namei: make sure nd->depth is always valid
Zero it in set_nameidata() rather than in path_init().  That way
it always matches the number of valid nd->stack[] entries.
Since terminate_walk() does zero it (after having emptied the
stack), we don't need to reinitialize it in subsequent path_init().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-07 13:57:17 -04:00
Al Viro
06422964c8 teach set_nameidata() to handle setting the root as well
That way we don't need the callers to mess with manually setting any fields
of nameidata instances.  Old set_nameidata() gets renamed (__set_nameidata()),
new becomes an inlined helper that takes a struct path pointer and deals
with setting nd->root and putting ND_ROOT_PRESET in nd->state when new
argument is non-NULL.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-07 13:57:16 -04:00
Al Viro
bcba1e7d0d take LOOKUP_{ROOT,ROOT_GRABBED,JUMPED} out of LOOKUP_... space
Separate field in nameidata (nd->state) holding the flags that
should be internal-only - that way we both get some spare bits
in LOOKUP_... and get simpler rules for nd->root lifetime rules,
since we can set the replacement of LOOKUP_ROOT (ND_ROOT_PRESET)
at the same time we set nd->root.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-07 13:57:13 -04:00
Al Viro
ffb37ca3bd switch file_open_root() to struct path
... and provide file_open_root_mnt(), using the root of given mount.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-07 13:56:43 -04:00
Al Viro
4f0ed93fb9 LOOKUP_MOUNTPOINT: we are cleaning "jumped" flag too late
That (and traversals in case of umount .) should be done before
complete_walk().  Either a braino or mismerge damage on queue
reorders - either way, I should've spotted that much earlier.

Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk>
X-Paperbag: Brown
Fixes: 161aff1d93 "LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat()"
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-06 20:33:00 -04:00
Al Viro
7d01ef7585 Make sure nd->path.mnt and nd->path.dentry are always valid pointers
Initialize them in set_nameidata() and make sure that terminate_walk() clears them
once the pointers become potentially invalid (i.e. we leave RCU mode or drop them
in non-RCU one).  Currently we have "path_init() always initializes them and nobody
accesses them outside of path_init()/terminate_walk() segments", which is asking
for trouble.

With that change we would have nd->path.{mnt,dentry}
	1) always valid - NULL or pointing to currently allocated objects.
	2) non-NULL while we are successfully walking
	3) NULL when we are not walking at all
	4) contributing to refcounts whenever non-NULL outside of RCU mode.

Fixes: 6c6ec2b0a3 ("fs: add support for LOOKUP_CACHED")
Reported-by: syzbot+c88a7030da47945a3cc3@syzkaller.appspotmail.com
Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-06 12:33:07 -04:00
Randy Dunlap
2111c3c012
namei: fix kernel-doc for struct renamedata and more
Fix kernel-doc warnings in namei.c:

../fs/namei.c:1149: warning: Excess function parameter 'dir_mode' description in 'may_create_in_sticky'
../fs/namei.c:1149: warning: Excess function parameter 'dir_uid' description in 'may_create_in_sticky'
../fs/namei.c:3396: warning: Function parameter or member 'open_flag' not described in 'vfs_tmpfile'
../fs/namei.c:3396: warning: Excess function parameter 'open_flags' description in 'vfs_tmpfile'
../fs/namei.c:4460: warning: Function parameter or member 'rd' not described in 'vfs_rename'
../fs/namei.c:4460: warning: Excess function parameter 'old_mnt_userns' description in 'vfs_rename'
../fs/namei.c:4460: warning: Excess function parameter 'old_dir' description in 'vfs_rename'
../fs/namei.c:4460: warning: Excess function parameter 'old_dentry' description in 'vfs_rename'
../fs/namei.c:4460: warning: Excess function parameter 'new_mnt_userns' description in 'vfs_rename'
../fs/namei.c:4460: warning: Excess function parameter 'new_dir' description in 'vfs_rename'
../fs/namei.c:4460: warning: Excess function parameter 'new_dentry' description in 'vfs_rename'
../fs/namei.c:4460: warning: Excess function parameter 'delegated_inode' description in 'vfs_rename'
../fs/namei.c:4460: warning: Excess function parameter 'flags' description in 'vfs_rename'

Link: https://lore.kernel.org/r/20210216042929.8931-3-rdunlap@infradead.org
Fixes: 9fe6145097 ("namei: introduce struct renamedata")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-03-23 11:20:25 +01:00
Christian Brauner
8e5389132a
fs: introduce fsuidgid_has_mapping() helper
Don't open-code the checks and instead move them into a clean little
helper we can call. This also reduces the risk that if we ever change
something we forget to change all locations.

Link: https://lore.kernel.org/r/20210320122623.599086-4-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-03-23 11:15:24 +01:00
Christian Brauner
a65e58e791
fs: document and rename fsid helpers
Vivek pointed out that the fs{g,u}id_into_mnt() naming scheme can be
misleading as it could be understood as implying they do the exact same
thing as i_{g,u}id_into_mnt(). The original motivation for this naming
scheme was to signal to callers that the helpers will always take care
to map the k{g,u}id such that the ownership is expressed in terms of the
mnt_users.
Get rid of the confusion by renaming those helpers to something more
sensible. Al suggested mapped_fs{g,u}id() which seems a really good fit.
Usually filesystems don't need to bother with these helpers directly
only in some cases where they allocate objects that carry {g,u}ids which
are either filesystem specific (e.g. xfs quota objects) or don't have a
clean set of helpers as inodes have.

Link: https://lore.kernel.org/r/20210320122623.599086-3-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-03-23 11:13:32 +01:00
Linus Torvalds
7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Al Viro
eacd9aa8ce fix handling of nd->depth on LOOKUP_CACHED failures in try_to_unlazy*
After switching to non-RCU mode, we want nd->depth to match the number
of entries in nd->stack[] that need eventual path_put().
legitimize_links() takes care of that on failures; unfortunately,
failure exits added for LOOKUP_CACHED do not.

We could add the logics for that into those failure exits, both in
try_to_unlazy() and in try_to_unlazy_next(), but since both checks
are immediately followed by legitimize_links() and there's no calls
of legitimize_links() other than those two...  It's easier to
move the check (and required handling of nd->depth on failure) into
legitimize_links() itself.

[caught by Jens: ... and since we are zeroing ->depth here, we need
to do drop_links() first]

Fixes: 6c6ec2b0a3 "fs: add support for LOOKUP_CACHED"
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-02-20 12:33:12 -05:00
Christian Brauner
a2d2329e30
ima: handle idmapped mounts
IMA does sometimes access the inode's i_uid and compares it against the
rules' fowner. Enable IMA to handle idmapped mounts by passing down the
mount's user namespace. We simply make use of the helpers we introduced
before. If the initial user namespace is passed nothing changes so
non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-27-christian.brauner@ubuntu.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner
549c729771
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner
643fe55a06
open: handle idmapped mounts in do_truncate()
When truncating files the vfs will verify that the caller is privileged
over the inode. Extend it to handle idmapped mounts. If the inode is
accessed through an idmapped mount it is mapped according to the mount's
user namespace. Afterwards the permissions checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:18 +01:00
Christian Brauner
6521f89170
namei: prepare for idmapped mounts
The various vfs_*() helpers are called by filesystems or by the vfs
itself to perform core operations such as create, link, mkdir, mknod, rename,
rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the
inode is accessed through an idmapped mount map it into the
mount's user namespace and pass it down. Afterwards the checks and
operations are identical to non-idmapped mounts. If the initial user
namespace is passed nothing changes so non-idmapped mounts will see
identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:18 +01:00
Christian Brauner
9fe6145097
namei: introduce struct renamedata
In order to handle idmapped mounts we will extend the vfs rename helper
to take two new arguments in follow up patches. Since this operations
already takes a bunch of arguments add a simple struct renamedata and
make the current helper use it before we extend it.

Link: https://lore.kernel.org/r/20210121131959.646623-14-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:18 +01:00
Christian Brauner
ba73d98745
namei: handle idmapped mounts in may_*() helpers
The may_follow_link(), may_linkat(), may_lookup(), may_open(),
may_o_create(), may_create_in_sticky(), may_delete(), and may_create()
helpers determine whether the caller is privileged enough to perform the
associated operations. Let them handle idmapped mounts by mapping the
inode or fsids according to the mount's user namespace. Afterwards the
checks are identical to non-idmapped inodes. The patch takes care to
retrieve the mount's user namespace right before performing permission
checks and passing it down into the fileystem so the user namespace
can't change in between by someone idmapping a mount that is currently
not idmapped. If the initial user namespace is passed nothing changes so
non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:17 +01:00
Christian Brauner
21cb47be6f
inode: make init and permission helpers idmapped mount aware
The inode_owner_or_capable() helper determines whether the caller is the
owner of the inode or is capable with respect to that inode. Allow it to
handle idmapped mounts. If the inode is accessed through an idmapped
mount it according to the mount's user namespace. Afterwards the checks
are identical to non-idmapped mounts. If the initial user namespace is
passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Similarly, allow the inode_init_owner() helper to handle idmapped
mounts. It initializes a new inode on idmapped mounts by mapping the
fsuid and fsgid of the caller from the mount's user namespace. If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner
47291baa8d
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner
0558c1bf5a
capability: handle idmapped mounts
In order to determine whether a caller holds privilege over a given
inode the capability framework exposes the two helpers
privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former
verifies that the inode has a mapping in the caller's user namespace and
the latter additionally verifies that the caller has the requested
capability in their current user namespace.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped inodes. If the initial user namespace is passed all
operations are a nop so non-idmapped mounts will not see a change in
behavior.

Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Jens Axboe
6c6ec2b0a3 fs: add support for LOOKUP_CACHED
io_uring always punts opens to async context, since there's no control
over whether the lookup blocks or not. Add LOOKUP_CACHED to support
just doing the fast RCU based lookups, which we know will not block. If
we can do a cached path resolution of the filename, then we don't have
to always punt lookups for a worker.

During path resolution, we always do LOOKUP_RCU first. If that fails and
we terminate LOOKUP_RCU, then fail a LOOKUP_CACHED attempt as well.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-04 11:42:21 -05:00
Al Viro
ae66db45fd saner calling conventions for unlazy_child()
same as for the previous commit - instead of 0/-ECHILD make
it return true/false, rename to try_to_unlazy_child().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-04 11:41:40 -05:00
Jens Axboe
e36cffed20 fs: make unlazy_walk() error handling consistent
Most callers check for non-zero return, and assume it's -ECHILD (which
it always will be). One caller uses the actual error return. Clean this
up and make it fully consistent, by having unlazy_walk() return a bool
instead. Rename it to try_to_unlazy() and return true on success, and
failure on error. That's easier to read.

No functional changes in this patch.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-04 11:10:49 -05:00
Steven Rostedt (VMware)
26ddb45e13 fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast()
Running my yearly branch profiling code, it detected a 100% wrong branch
condition in name.c for lookup_fast(). The code in question has:

		status = d_revalidate(dentry, nd->flags);
		if (likely(status > 0))
			return dentry;
		if (unlazy_child(nd, dentry, seq))
			return ERR_PTR(-ECHILD);
		if (unlikely(status == -ECHILD))
			/* we'd been told to redo it in non-rcu mode */
			status = d_revalidate(dentry, nd->flags);

If the status of the d_revalidate() is greater than zero, then the function
finishes. Otherwise, if it is an "unlazy_child" it returns with -ECHILD.
After the above two checks, the status is compared to -ECHILD, as that is
what is returned if the original d_revalidate() needed to be done in a
non-rcu mode.

Especially this path is called in a condition of:

	if (nd->flags & LOOKUP_RCU) {

And most of the d_revalidate() functions have:

	if (flags & LOOKUP_RCU)
		return -ECHILD;

It appears that that is the only case that this if statement is triggered
on two of my machines, running in production.

As it is dependent on what filesystem mix is configured in the running
kernel, simply remove the unlikely() from the if statement.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-04 00:12:33 -05:00
Al Viro
1e8f44f159 do_tmpfile(): don't mess with finish_open()
use vfs_open() instead

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-01-03 21:06:02 -05:00
Linus Torvalds
7bb5226c8a Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted patches from previous cycle(s)..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix hostfs_open() use of ->f_path.dentry
  Make sure that make_create_in_sticky() never sees uninitialized value of dir_mode
  fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set
  fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode()
  fs/namespace.c: WARN if mnt_count has become negative
2020-12-25 10:54:29 -08:00
Al Viro
1a97d899ec Make sure that make_create_in_sticky() never sees uninitialized value of dir_mode
make sure nd->dir_mode is always initialized after success exit from
link_path_walk(); in case of empty path it did not happen.

Reported-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Tested-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-10 17:33:17 -05:00
Jens Axboe
e886663cfd fs: make do_renameat2() take struct filename
Pass in the struct filename pointers instead of the user string, and
update the three callers to do the same.

This behaves like do_unlinkat(), which also takes a filename struct and
puts it when it is done. Converting callers is then trivial.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:03:59 -07:00
Linus Torvalds
0eac1102e9 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff all over the place (the largest group here is
  Christoph's stat cleanups)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: remove KSTAT_QUERY_FLAGS
  fs: remove vfs_stat_set_lookup_flags
  fs: move vfs_fstatat out of line
  fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
  fs: remove vfs_statx_fd
  fs: omfs: use kmemdup() rather than kmalloc+memcpy
  [PATCH] reduce boilerplate in fsid handling
  fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
  selftests: mount: add nosymfollow tests
  Add a "nosymfollow" mount option.
2020-10-24 12:26:05 -07:00
Christoph Hellwig
402dd2cf46 fs: remove the unused SB_I_MULTIROOT flag
The last user of SB_I_MULTIROOT is disappeared with commit f2aedb713c
("NFS: Add fs_context support.")

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:38 -06:00
Mattias Nissler
dab741e0e0 Add a "nosymfollow" mount option.
For mounts that have the new "nosymfollow" option, don't follow symlinks
when resolving paths. The new option is similar in spirit to the
existing "nodev", "noexec", and "nosuid" options, as well as to the
LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD
variants have been supporting the "nosymfollow" mount option for a long
time with equivalent implementations.

Note that symlinks may still be created on file systems mounted with
the "nosymfollow" option present. readlink() remains functional, so
user space code that is aware of symlinks can still choose to follow
them explicitly.

Setting the "nosymfollow" mount option helps prevent privileged
writers from modifying files unintentionally in case there is an
unexpected link along the accessed path. The "nosymfollow" option is
thus useful as a defensive measure for systems that need to deal with
untrusted file systems in privileged contexts.

More information on the history and motivation for this patch can be
found here:

https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal

Signed-off-by: Mattias Nissler <mnissler@chromium.org>
Signed-off-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-27 16:06:47 -04:00
Kees Cook
fc4177be96 exec: restore EACCES of S_ISDIR execve()
Patch series "Fix S_ISDIR execve() errno".

Fix an errno change for execve() of directories, noticed by Marc Zyngier.
Along with the fix, include a regression test to avoid seeing this return
in the future.

This patch (of 2):

The return code for attempting to execute a directory has always been
EACCES.  Adjust the S_ISDIR exec test to reflect the old errno instead of
the general EISDIR for other kinds of "open" attempts on directories.

Fixes: 633fb6ac39 ("exec: move S_ISREG() check earlier")
Reported-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Greg Kroah-Hartman <gregkh@android.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@google.com>
Link: http://lkml.kernel.org/r/20200813231723.2725102-2-keescook@chromium.org
Link: https://lore.kernel.org/lkml/20200813151305.6191993b@why
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14 19:56:56 -07:00
Linus Torvalds
9ad57f6dfc Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction,
   mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util,
   memory-hotplug, cleanups, uaccess, migration, gup, pagemap),

 - various other subsystems (alpha, misc, sparse, bitmap, lib, bitops,
   checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump,
   exec, kdump, rapidio, panic, kcov, kgdb, ipc).

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits)
  mm/gup: remove task_struct pointer for all gup code
  mm: clean up the last pieces of page fault accountings
  mm/xtensa: use general page fault accounting
  mm/x86: use general page fault accounting
  mm/sparc64: use general page fault accounting
  mm/sparc32: use general page fault accounting
  mm/sh: use general page fault accounting
  mm/s390: use general page fault accounting
  mm/riscv: use general page fault accounting
  mm/powerpc: use general page fault accounting
  mm/parisc: use general page fault accounting
  mm/openrisc: use general page fault accounting
  mm/nios2: use general page fault accounting
  mm/nds32: use general page fault accounting
  mm/mips: use general page fault accounting
  mm/microblaze: use general page fault accounting
  mm/m68k: use general page fault accounting
  mm/ia64: use general page fault accounting
  mm/hexagon: use general page fault accounting
  mm/csky: use general page fault accounting
  ...
2020-08-12 11:24:12 -07:00
Kees Cook
0fd338b2d2 exec: move path_noexec() check earlier
The path_noexec() check, like the regular file check, was happening too
late, letting LSMs see impossible execve()s.  Check it earlier as well in
may_open() and collect the redundant fs/exec.c path_noexec() test under
the same robustness comment as the S_ISREG() check.

My notes on the call path, and related arguments, checks, etc:

do_open_execat()
    struct open_flags open_exec_flags = {
        .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
        .acc_mode = MAY_EXEC,
        ...
    do_filp_open(dfd, filename, open_flags)
        path_openat(nameidata, open_flags, flags)
            file = alloc_empty_file(open_flags, current_cred());
            do_open(nameidata, file, open_flags)
                may_open(path, acc_mode, open_flag)
                    /* new location of MAY_EXEC vs path_noexec() test */
                    inode_permission(inode, MAY_OPEN | acc_mode)
                        security_inode_permission(inode, acc_mode)
                vfs_open(path, file)
                    do_dentry_open(file, path->dentry->d_inode, open)
                        security_file_open(f)
                        open()
    /* old location of path_noexec() test */

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:01 -07:00
Kees Cook
633fb6ac39 exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late.  This was
fixed in commit 73601ea5b7 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.

Move the test into the other inode type checks (which already look for
other pathological conditions[1]).  Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.

Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.

My notes on the call path, and related arguments, checks, etc:

do_open_execat()
    struct open_flags open_exec_flags = {
        .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
        .acc_mode = MAY_EXEC,
        ...
    do_filp_open(dfd, filename, open_flags)
        path_openat(nameidata, open_flags, flags)
            file = alloc_empty_file(open_flags, current_cred());
            do_open(nameidata, file, open_flags)
                may_open(path, acc_mode, open_flag)
		    /* new location of MAY_EXEC vs S_ISREG() test */
                    inode_permission(inode, MAY_OPEN | acc_mode)
                        security_inode_permission(inode, acc_mode)
                vfs_open(path, file)
                    do_dentry_open(file, path->dentry->d_inode, open)
                        /* old location of FMODE_EXEC vs S_ISREG() test */
                        security_file_open(f)
                        open()

[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:01 -07:00
Al Viro
24fb33d40d fix breakage in do_rmdir()
syzbot reported and bisected a use-after-free due to the recent init
cleanups.

The putname() should happen only after we'd *not* branched to retry,
same as it's done in do_unlinkat().

Reported-by: syzbot+bbeb1c88016c7db4aa24@syzkaller.appspotmail.com
Fixes: e24ab0ef68 "fs: push the getname from do_rmdir into the callers"
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:22:39 -07:00
Christoph Hellwig
5fee64fcde init: add an init_mknod helper
Add a simple helper to mknod with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_mknod.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:54 +02:00
Christoph Hellwig
83ff98c3e9 init: add an init_mkdir helper
Add a simple helper to mkdir with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_mkdir.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig
cd3acb6a79 init: add an init_symlink helper
Add a simple helper to symlink with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_symlink.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig
812931d693 init: add an init_link helper
Add a simple helper to link with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_link.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig
e24ab0ef68 fs: push the getname from do_rmdir into the callers
This mirrors do_unlinkat and will make life a little easier for
the init code to reuse the whole function with a kernel filename.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:50 +02:00
Linus Torvalds
63d72b93f2 vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK
posix_acl_permission() does not care about MAY_NOT_BLOCK, and in fact
the permission logic internally must not check that bit (it's only for
upper layers to decide whether they can block to do IO to look up the
acl information or not).

But the way the code was written, it _looked_ like it cared, since the
function explicitly did not mask that bit off.

But it has exactly two callers: one for when that bit is set, which
first clears the bit before calling posix_acl_permission(), and the
other call site when that bit was clear.

So stop the silly games "saving" the MAY_NOT_BLOCK bit that must not be
used for the actual permission test, and that currently is pointlessly
cleared by the callers when the function itself should just not care.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:04:19 -07:00
Linus Torvalds
5fc475b749 vfs: do not do group lookup when not necessary
Rasmus Villemoes points out that the 'in_group_p()' tests can be a
noticeable expense, and often completely unnecessary.  A common
situation is that the 'group' bits are the same as the 'other' bits
wrt the permissions we want to test.

So rewrite 'acl_permission_check()' to not bother checking for group
ownership when the permission check doesn't care.

For example, if we're asking for read permissions, and both 'group' and
'other' allow reading, there's really no reason to check if we're part
of the group or not: either way, we'll allow it.

Rasmus says:
 "On a bog-standard Ubuntu 20.04 install, a workload consisting of
  compiling lots of userspace programs (i.e., calling lots of
  short-lived programs that all need to get their shared libs mapped in,
  and the compilers poking around looking for system headers - lots of
  /usr/lib, /usr/bin, /usr/include/ accesses) puts in_group_p around
  0.1% according to perf top.

  System-installed files are almost always 0755 (directories and
  binaries) or 0644, so in most cases, we can avoid the binary search
  and the cost of pulling the cred->groups array and in_group_p() .text
  into the cpu cache"

Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:04:19 -07:00
Miklos Szeredi
a3c751a50f vfs: allow unprivileged whiteout creation
Whiteouts, unlike real device node should not require privileges to create.

The general concern with device nodes is that opening them can have side
effects.  The kernel already avoids zero major (see
Documentation/admin-guide/devices.txt).  To be on the safe side the patch
explicitly forbids registering a char device with 0/0 number (see
cdev_add()).

This guarantees that a non-O_PATH open on a whiteout will fail with ENODEV;
i.e. it won't have any side effect.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:23 +02:00
Al Viro
5bd73286d5 fix a braino in legitimize_path()
brown paperbag time... wrong order of arguments ended up confusing
the values to check dentry and mount_lock seqcounts against.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: 2aa3847085 ("non-RCU analogue of the previous commit")
Tested-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-06 10:38:59 -04:00