linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-16 17:12:06 +00:00

Author	SHA1	Message	Date
Vasily Averin	30acd0bdfb	memcg: enable accounting for new namesapces and struct nsproxy Container admin can create new namespaces and force kernel to allocate up to several pages of memory for the namespaces and its associated structures. Net and uts namespaces have enabled accounting for such allocations. It makes sense to account for rest ones to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yutian Yang <nglaive@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:12 -07:00
Vasily Averin	79f6540ba8	memcg: enable accounting for mnt_cache entries Patch series "memcg accounting from OpenVZ", v7. OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. Initially we used our own accounting subsystem, then partially committed it to upstream, and a few years ago switched to cgroups v1. Now we're rebasing again, revising our old patches and trying to push them upstream. We try to protect the host system from any misuse of kernel memory allocation triggered by untrusted users inside the containers. Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing list, though I would be very grateful for any comments from maintainersi of affected subsystems or other people added in cc: Compared to the upstream, we additionally account the following kernel objects: - network devices and its Tx/Rx queues - ipv4/v6 addresses and routing-related objects - inet_bind_bucket cache objects - VLAN group arrays - ipv6/sit: ip_tunnel_prl - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets - nsproxy and namespace objects itself - IPC objects: semaphores, message queues and share memory segments - mounts - pollfd and select bits arrays - signals and posix timers - file lock - fasync_struct used by the file lease code and driver's fasync queues - tty objects - per-mm LDT We have an incorrect/incomplete/obsoleted accounting for few other kernel objects: sk_filter, af_packets, netlink and xt_counters for iptables. They require rework and probably will be dropped at all. Also we're going to add an accounting for nft, however it is not ready yet. We have not tested performance on upstream, however, our performance team compares our current RHEL7-based production kernel and reports that they are at least not worse as the according original RHEL7 kernel. This patch (of 10): The kernel allocates ~400 bytes of 'struct mount' for any new mount. Creating a new mount namespace clones most of the parent mounts, and this can be repeated many times. Additionally, each mount allocates up to PATH_MAX=4096 bytes for mnt->mnt_devname. It makes sense to account for these allocations to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Yutian Yang <nglaive@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:12 -07:00
Linus Torvalds	15517c724c	File locking change for v5.14 -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmEg5JwTHGpsYXl0b25A a2VybmVsLm9yZwAKCRAADmhBGVaCFfX3D/94AsPqwbnIyLgP2KjoltTKtnSsOZba tzTKwrA7vPB9b0aFg0i42omwRtGQk5jPaB8J+tBx+oc/JjO+Dpj5FOXOI9PjFr// 2zXBuMdvYWaPlsl9nw5dFzhybdPnv+HgNXA8PRqy/DifU7/oLWxHzQ4sGvr1XXNc vow5KzLe1K4R+hQBwqKLh/9VVE4+foVTAwXWcFBi/RrDf5X97BF7s98JYsNxfpbn EYSyLr818TCUlivAUzx+y0JD+qUknZLuNYWZ2HkYvI29y5h1gUCVmn1orlCDdDq/ G3SCF8M66l4xGW+bpzrEabqIDl+0IWhX+grU+UVdiMH4YuXmY4HPdNgKYIpJolyy zdg8cCO3OAHTJiwkj5jSfxFHQnzPr58LSQGDfrIrNjcDKlQbx3c+8R5yMy7Ar7jZ XAM6h88PAuBknTDWBQogZtuSqKbV8D2LsAUVgRKA7iKSwYXXUmZdW+UDiOavsHmg n2fbi1sPP7woQKyXzFddwxG3+2Nzby8BE7xuyHTXdOrQJNE1PvICH+WVo2m8+FCe uLKLXNf4UECO2MaSyd6k90v3AVty4u1EDTbitgcERztWGzazZdTVlmRdpwy35V+B wskKALzjGgqFoSwkEAnXQdk4+Gk9EtnrXD+HYfmxGYKk87fOBCgJtqDWYw3RHK7n giBfvji9HK3R8A== =xZLO -----END PGP SIGNATURE----- Merge tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull mandatory file locking deprecation warning from Jeff Layton: "As discussed on the list, this patch just adds a new warning for folks who still have mandatory locking enabled and actually mount with '-o mand'. I'd like to get this in for v5.14 so we can push this out into stable kernels and hopefully reach folks who have mounts with -o mand. For now, I'm operating under the assumption that we'll fully remove this support in v5.15, but we can move that out if any legitimate users of this facility speak up between now and then" * tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: warn about impending deprecation of mandatory locks	2021-08-21 10:50:22 -07:00
Jeff Layton	fdd92b64d1	fs: warn about impending deprecation of mandatory locks We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros have disabled it. Warn the stragglers that still use "-o mand" that we'll be dropping support for that mount option. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org>	2021-08-21 07:32:45 -04:00
Miklos Szeredi	427215d85e	ovl: prevent private clone if bind mount is not allowed Add the following checks from __do_loopback() to clone_private_mount() as well: - verify that the mount is in the current namespace - verify that there are no locked children Reported-by: Alois Wohlschlager <alois1@gmx-topmail.de> Fixes: `c771d683a6` ("vfs: introduce clone_private_mount()") Cc: <stable@vger.kernel.org> # v3.18 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2021-08-10 10:21:31 +02:00
Christian Brauner	dd8b477f9a	mount: Support "nosymfollow" in new mount api Commit `dab741e0e0` ("Add a "nosymfollow" mount option.") added support for the "nosymfollow" mount option allowing to block following symlinks when resolving paths. The mount option so far was only available in the old mount api. Make it available in the new mount api as well. Bonus is that it can be applied to a whole subtree not just a single mount. Cc: Christoph Hellwig <hch@lst.de> Cc: Mattias Nissler <mnissler@chromium.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <zwisler@google.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-06-01 12:09:27 +02:00
Christian Brauner	2ca4dcc490	fs/mount_setattr: tighten permission checks We currently don't have any filesystems that support idmapped mounts which are mountable inside a user namespace. That was a deliberate decision for now as a userns root can just mount the filesystem themselves. So enforce this restriction explicitly until there's a real use-case for this. This way we can notice it and will have a chance to adapt and audit our translation helpers and fstests appropriately if we need to support such filesystems. Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org CC: linux-fsdevel@vger.kernel.org Suggested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-05-12 14:13:16 +02:00
Randy Dunlap	1f287bc4e2	fs/namespace: correct/improve kernel-doc notation Fix kernel-doc warnings in fs/namespace.c: ./fs/namespace.c:1379: warning: Function parameter or member 'm' not described in 'may_umount_tree' ./fs/namespace.c:1379: warning: Excess function parameter 'mnt' description in 'may_umount_tree' ./fs/namespace.c:1950: warning: Function parameter or member 'path' not described in 'clone_private_mount' Also convert path_is_mountpoint() comments to kernel-doc. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Allegedly-acked-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20210318025227.4162-1-rdunlap@infradead.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2021-03-31 14:22:55 -06:00
Linus Torvalds	5ceabb6078	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff pile - no common topic here" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: whack-a-mole: don't open-code iminor/imajor 9p: fix misuse of sscanf() in v9fs_stat2inode() audit_alloc_mark(): don't open-code ERR_CAST() fs/inode.c: make inode_init_always() initialize i_ino to 0 vfs: don't unnecessarily clone write access for writable fds	2021-02-27 08:07:12 -08:00
Christian Brauner	9caccd4154	fs: introduce MOUNT_ATTR_IDMAP Introduce a new mount bind mount property to allow idmapping mounts. The MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall together with a file descriptor referring to a user namespace. The user namespace referenced by the namespace file descriptor will be attached to the bind mount. All interactions with the filesystem going through that mount will be mapped according to the mapping specified in the user namespace attached to it. Using user namespaces to mark mounts means we can reuse all the existing infrastructure in the kernel that already exists to handle idmappings and can also use this for permission checking to allow unprivileged user to create idmapped mounts in the future. Idmapping a mount is decoupled from the caller's user and mount namespace. This means idmapped mounts can be created in the initial user namespace which is an important use-case for systemd-homed, portable usb-sticks between systems, sharing data between the initial user namespace and unprivileged containers, and other use-cases that have been brought up. For example, assume a home directory where all files are owned by uid and gid 1000 and the home directory is brought to a new laptop where the user has id 12345. The system administrator can simply create a mount of this home directory with a mapping of 1000:12345:1 and other mappings to indicate the ids should be kept. (With this it is e.g. also possible to create idmapped mounts on the host with an identity mapping 1:1:100000 where the root user is not mapped. A user with root access that e.g. has been pivot rooted into such a mount on the host will be not be able to execute, read, write, or create files as root.) Given that mapping a mount is decoupled from the caller's user namespace a sufficiently privileged process such as a container manager can set up an idmapped mount for the container and the container can simply pivot root to it. There's no need for the container to do anything. The mount will appear correctly mapped independent of the user namespace the container uses. This means we don't need to mark a mount as idmappable. In order to create an idmapped mount the caller must currently be privileged in the user namespace of the superblock the mount belongs to. Once a mount has been idmapped we don't allow it to change its mapping. This keeps permission checking and life-cycle management simple. Users wanting to change the idmapped can always create a new detached mount with a different idmapping. Link: https://lore.kernel.org/r/20210121131959.646623-36-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Mauricio Vásquez Bernal <mauricio@kinvolk.io> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:43:45 +01:00
Christian Brauner	2a1867219c	fs: add mount_setattr() This implements the missing mount_setattr() syscall. While the new mount api allows to change the properties of a superblock there is currently no way to change the properties of a mount or a mount tree using file descriptors which the new mount api is based on. In addition the old mount api has the restriction that mount options cannot be applied recursively. This hasn't changed since changing mount options on a per-mount basis was implemented in [1] and has been a frequent request not just for convenience but also for security reasons. The legacy mount syscall is unable to accommodate this behavior without introducing a whole new set of flags because MS_REC \| MS_REMOUNT \| MS_BIND \| MS_RDONLY \| MS_NOEXEC \| [...] only apply the mount option to the topmost mount. Changing MS_REC to apply to the whole mount tree would mean introducing a significant uapi change and would likely cause significant regressions. The new mount_setattr() syscall allows to recursively clear and set mount options in one shot. Multiple calls to change mount options requesting the same changes are idempotent: int mount_setattr(int dfd, const char path, unsigned flags, struct mount_attr uattr, size_t usize); Flags to modify path resolution behavior are specified in the @flags argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW, and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to restrict path resolution as introduced with openat2() might be supported in the future. The mount_setattr() syscall can be expected to grow over time and is designed with extensibility in mind. It follows the extensible syscall pattern we have used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and others. The set of mount options is passed in the uapi struct mount_attr which currently has the following layout: struct mount_attr { __u64 attr_set; __u64 attr_clr; __u64 propagation; __u64 userns_fd; }; The @attr_set and @attr_clr members are used to clear and set mount options. This way a user can e.g. request that a set of flags is to be raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in @attr_set while at the same time requesting that another set of flags is to be lowered such as removing noexec from a mount tree by specifying MOUNT_ATTR_NOEXEC in @attr_clr. Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0, not a bitmap, users wanting to transition to a different atime setting cannot simply specify the atime setting in @attr_set, but must also specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in @attr_clr. The @propagation field lets callers specify the propagation type of a mount tree. Propagation is a single property that has four different settings and as such is not really a flag argument but an enum. Specifically, it would be unclear what setting and clearing propagation settings in combination would amount to. The legacy mount() syscall thus forbids the combination of multiple propagation settings too. The goal is to keep the semantics of mount propagation somewhat simple as they are overly complex as it is. The @userns_fd field lets user specify a user namespace whose idmapping becomes the idmapping of the mount. This is implemented and explained in detail in the next patch. [1]: commit `2e4b7fcd92` ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:42:45 +01:00
Christian Brauner	5b490500f9	fs: add attr_flags_to_mnt_flags helper Add a simple helper to translate uapi MOUNT_ATTR_* flags to MNT_* flags which we will use in follow-up patches too. Link: https://lore.kernel.org/r/20210121131959.646623-34-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:29:34 +01:00
Christian Brauner	fbdc2f6c40	fs: split out functions to hold writers When a mount is marked read-only we set MNT_WRITE_HOLD on it if there aren't currently any active writers. Split this logic out into simple helpers that we can use in follow-up patches. Link: https://lore.kernel.org/r/20210121131959.646623-33-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:29:34 +01:00
Christian Brauner	e58ace1a0f	namespace: only take read lock in do_reconfigure_mnt() do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock which seems unnecessary since we're not changing the superblock. We're only checking whether it is already read-only. Setting other mount attributes is protected by lock_mount_hash() afaict and not by s_umount. The history of down_write(&sb->s_umount) lock being taken when setting mount attributes dates back to the introduction of MNT_READONLY in [2]. This introduced the concept of having read-only mounts in contrast to just having a read-only superblock. When it got introduced it was simply plumbed into do_remount() which already took down_write(&sb->s_umount) because it was only used to actually change the superblock before [2]. Afaict, it would've already been possible back then to only use down_read(&sb->s_umount) for MS_BIND \| MS_REMOUNT since actual mount options were protected by the vfsmount lock already. But that would've meant special casing the locking for MS_BIND \| MS_REMOUNT in do_remount() which people might not have considered worth it. Then in [1] MS_BIND \| MS_REMOUNT mount option changes were split out of do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount) lock was simply copied over. Now that we have this be a separate helper only take the down_read(&sb->s_umount) lock since we're only interested in checking whether the super block is currently read-only and blocking any writers from changing it. Essentially, checking that the super block is read-only has the advantage that we can avoid having to go into the slowpath and through MNT_WRITE_HOLD and can simply set the read-only flag on the mount in set_mount_attributes(). [1]: commit `43f5e655ef` ("vfs: Separate changing mount flags full remount") [2]: commit `2e4b7fcd92` ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/20210121131959.646623-32-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:29:34 +01:00
Christian Brauner	d033cb6784	mount: make {lock,unlock}_mount_hash() static The lock_mount_hash() and unlock_mount_hash() helpers are never called outside a single file. Remove them from the header and make them static to reflect this fact. There's no need to have them callable from other places right now, as Christoph observed. Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:29:34 +01:00
Christian Brauner	68847c9417	namespace: take lock_mount_hash() directly when changing flags Changing mount options always ends up taking lock_mount_hash() but when MNT_READONLY is requested and neither the mount nor the superblock are MNT_READONLY we end up taking the lock, dropping it, and retaking it to change the other mount attributes. Instead, let's acquire the lock once when changing the mount attributes. This simplifies the locking in these codepath, makes them easier to reason about and avoids having to reacquire the lock right after dropping it. Link: https://lore.kernel.org/r/20210121131959.646623-30-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:29:34 +01:00
Christian Brauner	a6435940b6	mount: attach mappings to mounts In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. Later patches enforce that once a mount has been idmapped it can't be remapped. This keeps permission checking and life-cycle management simple. Users wanting to change the idmapped can always create a new detached mount with a different idmapping. Add a new mnt_userns member to vfsmount and two simple helpers to retrieve the mnt_userns from vfsmounts and files. The idea to attach user namespaces to vfsmounts has been floated around in various forms at Linux Plumbers in ~2018 with the original idea tracing back to a discussion in 2017 at a conference in St. Petersburg between Christoph, Tycho, and myself. Link: https://lore.kernel.org/r/20210121131959.646623-2-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:15 +01:00
Al Viro	a0a6df9afc	umount(2): move the flag validity checks first Unfortunately, there's userland code that used to rely upon these checks being done before anything else to check for UMOUNT_NOFOLLOW support. That broke in `41525f56e2` ("fs: refactor ksys_umount"). Separate those from the rest of checks and move them to ksys_umount(); unlike everything else in there, this can be sanely done there. Reported-by: Sargun Dhillon <sargun@sargun.me> Fixes: `41525f56e2` ("fs: refactor ksys_umount") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2021-01-04 15:31:58 -05:00
Eric Biggers	14e43bf435	vfs: don't unnecessarily clone write access for writable fds There's no need for mnt_want_write_file() to increment mnt_writers when the file is already open for writing, provided that mnt_drop_write_file() is changed to conditionally decrement it. We seem to have ended up in the current situation because mnt_want_write_file() used to be paired with mnt_drop_write(), due to mnt_drop_write_file() not having been added yet. So originally mnt_want_write_file() had to always increment mnt_writers. But later mnt_drop_write_file() was added, and all callers of mnt_want_write_file() were paired with it. This makes the compatibility between mnt_want_write_file() and mnt_drop_write() no longer necessary. Therefore, make __mnt_want_write_file() and __mnt_drop_write_file() skip incrementing mnt_writers on files already open for writing. This removes the only caller of mnt_clone_write(), so remove that too. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2021-01-04 14:02:08 -05:00
Linus Torvalds	7bb5226c8a	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted patches from previous cycle(s)..." * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix hostfs_open() use of ->f_path.dentry Make sure that make_create_in_sticky() never sees uninitialized value of dir_mode fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode() fs/namespace.c: WARN if mnt_count has become negative	2020-12-25 10:54:29 -08:00
Linus Torvalds	f9b4240b07	fixes-v5.11 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCX9daOgAKCRCRxhvAZXjc ohPkAQChXUB2BAjtIzXlCkZoDBbzHHblm5DZ37oy/4xYFmAcEwEA5sw6dQqyGHnF GEP9def51HvXLpBV2BzNUGggo1SoGgQ= =w/cO -----END PGP SIGNATURE----- Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull misc fixes from Christian Brauner: "This contains several fixes which felt worth being combined into a single branch: - Use put_nsproxy() instead of open-coding it switch_task_namespaces() - Kirill's work to unify lifecycle management for all namespaces. The lifetime counters are used identically for all namespaces types. Namespaces may of course have additional unrelated counters and these are not altered. This work allows us to unify the type of the counters and reduces maintenance cost by moving the counter in one place and indicating that basic lifetime management is identical for all namespaces. - Peilin's fix adding three byte padding to Dmitry's PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak. - Two smal patches to convert from the /* fall through / comment annotation to the fallthrough keyword annotation which I had taken into my branch and into -next before `df561f6688` ("treewide: Use fallthrough pseudo-keyword") made it upstream which fixed this tree-wide. Since I didn't want to invalidate all testing for other commits I didn't rebase and kept them" tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: nsproxy: use put_nsproxy() in switch_task_namespaces() sys: Convert to the new fallthrough notation signal: Convert to the new fallthrough notation time: Use generic ns_common::count cgroup: Use generic ns_common::count mnt: Use generic ns_common::count user: Use generic ns_common::count pid: Use generic ns_common::count ipc: Use generic ns_common::count uts: Use generic ns_common::count net: Use generic ns_common::count ns: Add a common refcount into ns_common ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()	2020-12-14 16:40:27 -08:00
Eric Biggers	edf7ddbf1c	fs/namespace.c: WARN if mnt_count has become negative Missing calls to mntget() (or equivalently, too many calls to mntput()) are hard to detect because mntput() delays freeing mounts using task_work_add(), then again using call_rcu(). As a result, mnt_count can often be decremented to -1 without getting a KASAN use-after-free report. Such cases are still bugs though, and they point to real use-after-frees being possible. For an example of this, see the bug fixed by commit `1b0b9cc8d3` ("vfs: fsmount: add missing mntget()"), discussed at https://lkml.kernel.org/linux-fsdevel/20190605135401.GB30925@xxxxxxxxxxxxxxxxxxxxxxxxx/T/#u. This bug should have been trivial to find. But actually, it wasn't found until syzkaller happened to use fchdir() to manipulate the reference count just right for the bug to be noticeable. Address this by making mntput_no_expire() issue a WARN if mnt_count has become negative. Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-12-10 17:33:17 -05:00
Linus Torvalds	0eac1102e9	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff all over the place (the largest group here is Christoph's stat cleanups)" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove KSTAT_QUERY_FLAGS fs: remove vfs_stat_set_lookup_flags fs: move vfs_fstatat out of line fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat fs: remove vfs_statx_fd fs: omfs: use kmemdup() rather than kmalloc+memcpy [PATCH] reduce boilerplate in fsid handling fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS selftests: mount: add nosymfollow tests Add a "nosymfollow" mount option.	2020-10-24 12:26:05 -07:00
Jens Axboe	91989c7078	task_work: cleanup notification modes A previous commit changed the notification mode from true/false to an int, allowing notify-no, notify-yes, or signal-notify. This was backwards compatible in the sense that any existing true/false user would translate to either 0 (on notification sent) or 1, the latter which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2. Clean this up properly, and define a proper enum for the notification mode. Now we have: - TWA_NONE. This is 0, same as before the original change, meaning no notification requested. - TWA_RESUME. This is 1, same as before the original change, meaning that we use TIF_NOTIFY_RESUME. - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the notification. Clean up all the callers, switching their 0/1/false/true to using the appropriate TWA_* mode for notifications. Fixes: `e91b481623` ("task_work: teach task_work_add() to do signal_wake_up()") Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-10-17 15:05:30 -06:00
Linus Torvalds	22230cd2c5	Merge branch 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull compat mount cleanups from Al Viro: "The last remnants of mount(2) compat buried by Christoph. Buried into NFS, that is. Generally I'm less enthusiastic about "let's use in_compat_syscall() deep in call chain" kind of approach than Christoph seems to be, but in this case it's warranted - that had been an NFS-specific wart, hopefully not to be repeated in any other filesystems (read: any new filesystem introducing non-text mount options will get NAKed even if it doesn't mess the layout up). IOW, not worth trying to grow an infrastructure that would avoid that use of in_compat_syscall()..." * 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove compat_sys_mount fs,nfs: lift compat nfs4 mount data handling into the nfs code nfs: simplify nfs4_parse_monolithic	2020-10-12 16:44:57 -07:00
Christoph Hellwig	028abd9222	fs: remove compat_sys_mount compat_sys_mount is identical to the regular sys_mount now, so remove it and use the native version everywhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-09-22 23:45:57 -04:00
Catalin Marinas	d563d678aa	fs: Handle intra-page faults in copy_mount_options() The copy_mount_options() function takes a user pointer argument but no size and it tries to read up to a PAGE_SIZE. However, copy_from_user() is not guaranteed to return all the accessible bytes if, for example, the access crosses a page boundary and gets a fault on the second page. To work around this, the current copy_mount_options() implementation performs two copy_from_user() passes, first to the end of the current page and the second to what's left in the subsequent page. On arm64 with MTE enabled, access to a user page may trigger a fault after part of the buffer in a page has been copied (when the user pointer tag, bits 56-59, no longer matches the allocation tag stored in memory). Allow copy_mount_options() to handle such intra-page faults by resorting to byte at a time copy in case of copy_from_user() failure. Note that copy_from_user() handles the zeroing of the kernel buffer in case of error. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk>	2020-09-04 12:46:07 +01:00
Mattias Nissler	dab741e0e0	Add a "nosymfollow" mount option. For mounts that have the new "nosymfollow" option, don't follow symlinks when resolving paths. The new option is similar in spirit to the existing "nodev", "noexec", and "nosuid" options, as well as to the LOOKUP_NO_SYMLINKS resolve flag in the openat2(2) syscall. Various BSD variants have been supporting the "nosymfollow" mount option for a long time with equivalent implementations. Note that symlinks may still be created on file systems mounted with the "nosymfollow" option present. readlink() remains functional, so user space code that is aware of symlinks can still choose to follow them explicitly. Setting the "nosymfollow" mount option helps prevent privileged writers from modifying files unintentionally in case there is an unexpected link along the accessed path. The "nosymfollow" option is thus useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts. More information on the history and motivation for this patch can be found here: https://sites.google.com/a/chromium.org/dev/chromium-os/chromiumos-design-docs/hardening-against-malicious-stateful-data#TOC-Restricting-symlink-traversal Signed-off-by: Mattias Nissler <mnissler@chromium.org> Signed-off-by: Ross Zwisler <zwisler@google.com> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-08-27 16:06:47 -04:00
Kirill Tkhai	1a7b8969e6	mnt: Use generic ns_common::count Switch over mount namespaces to use the newly introduced common lifetime counter. Currently every namespace type has its own lifetime counter which is stored in the specific namespace struct. The lifetime counters are used identically for all namespaces types. Namespaces may of course have additional unrelated counters and these are not altered. This introduces a common lifetime counter into struct ns_common. The ns_common struct encompasses information that all namespaces share. That should include the lifetime counter since its common for all of them. It also allows us to unify the type of the counters across all namespaces. Most of them use refcount_t but one uses atomic_t and at least one uses kref. Especially the last one doesn't make much sense since it's just a wrapper around refcount_t since 2016 and actually complicates cleanup operations by having to use container_of() to cast the correct namespace struct out of struct ns_common. Having the lifetime counter for the namespaces in one place reduces maintenance cost. Not just because after switching all namespaces over we will have removed more code than we added but also because the logic is more easily understandable and we indicate to the user that the basic lifetime requirements for all namespaces are currently identical. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/159644980287.604812.761686947449081169.stgit@localhost.localdomain Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2020-08-19 14:14:19 +02:00
Linus Torvalds	d57b2b5bc4	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount leak fix from Al Viro: "Regression fix for the syscalls-for-init series - fix a leak of a 'struct path'" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: fix a struct path leak in path_umount	2020-08-07 21:03:25 -07:00
Christoph Hellwig	25ccd24ffd	fs: fix a struct path leak in path_umount Make sure we also put the dentry and vfsmnt in the illegal flags and !may_umount cases. Fixes: `41525f56e2` ("fs: refactor ksys_umount") Reported-by: Vikas Kumar <vikas.kumar2@arm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-08-07 19:21:30 -04:00
Linus Torvalds	e1ec517e18	Merge branch 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull init and set_fs() cleanups from Al Viro: "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series" * 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits) init: add an init_dup helper init: add an init_utimes helper init: add an init_stat helper init: add an init_mknod helper init: add an init_mkdir helper init: add an init_symlink helper init: add an init_link helper init: add an init_eaccess helper init: add an init_chmod helper init: add an init_chown helper init: add an init_chroot helper init: add an init_chdir helper init: add an init_rmdir helper init: add an init_unlink helper init: add an init_umount helper init: add an init_mount helper init: mark create_dev as __init init: mark console_on_rootfs as __init init: initialize ramdisk_execute_command at compile time devtmpfs: refactor devtmpfsd() ...	2020-08-07 09:40:34 -07:00
Christoph Hellwig	09267defa3	init: add an init_umount helper Like ksys_umount, but takes a kernel pointer for the destination path. Switch over the umount in the init code, which just happen to work due to the implicit set_fs(KERNEL_DS) during early init right now. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-31 08:17:51 +02:00
Christoph Hellwig	c60166f042	init: add an init_mount helper Like do_mount, but takes a kernel pointer for the destination path. Switch over the mounts in the init code and devtmpfs to it, which just happen to work due to the implicit set_fs(KERNEL_DS) during early init right now. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-31 08:17:51 +02:00
Christoph Hellwig	41525f56e2	fs: refactor ksys_umount Factor out a path_umount helper that takes a struct path * instead of the actual file name. This will allow to convert the init and devtmpfs code to properly mount based on a kernel pointer instead of relying on the implicit set_fs(KERNEL_DS) during early init. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-31 08:17:50 +02:00
Christoph Hellwig	a1e6aaa374	fs: refactor do_mount Factor out a path_mount helper that takes a struct path * instead of the actual file name. This will allow to convert the init and devtmpfs code to properly mount based on a kernel pointer instead of relying on the implicit set_fs(KERNEL_DS) during early init. Signed-off-by: Christoph Hellwig <hch@lst.de>	2020-07-31 08:17:50 +02:00
Miklos Szeredi	b330966f79	fuse: reject options on reconfigure via fsconfig(2) Previous patch changed handling of remount/reconfigure to ignore all options, including those that are unknown to the fuse kernel fs. This was done for backward compatibility, but this likely only affects the old mount(2) API. The new fsconfig(2) based reconfiguration could possibly be improved. This would make the new API less of a drop in replacement for the old, OTOH this is a good chance to get rid of some weirdnesses in the old API. Several other behaviors might make sense: 1) unknown options are rejected, known options are ignored 2) unknown options are rejected, known options are rejected if the value is changed, allowed otherwise 3) all options are rejected Prior to the backward compatibility fix to ignore all options all known options were accepted (1), even if they change the value of a mount parameter; fuse_reconfigure() does not look at the config values set by fuse_parse_param(). To fix that we'd need to verify that the value provided is the same as set in the initial configuration (2). The major drawback is that this is much more complex than just rejecting all attempts at changing options (3); i.e. all options signify initial configuration values and don't make sense on reconfigure. This patch opts for (3) with the rationale that no mount options are reconfigurable in fuse. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-07-14 14:45:41 +02:00
Linus Torvalds	4dbb29fe9d	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of trivial patches that fell through the cracks last cycle" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: fix indentation in deactivate_super() vfs: Remove duplicated d_mountpoint check in __is_local_mountpoint	2020-06-10 16:09:11 -07:00
Linus Torvalds	52435c86bf	overlayfs update for 5.8 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXt9klAAKCRDh3BK/laaZ PBeeAP9GRI0yajPzBzz2ZK9KkDc6A7wPiaAec+86Q+c02VncVwEAvq5Pi4um5RTZ 7SVv56ggKO3Cqx779zVyZTRYDs3+YA4= =bpKI -----END PGP SIGNATURE----- Merge tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "Fixes: - Resolve mount option conflicts consistently - Sync before remount R/O - Fix file handle encoding corner cases - Fix metacopy related issues - Fix an unintialized return value - Add missing permission checks for underlying layers Optimizations: - Allow multipe whiteouts to share an inode - Optimize small writes by inheriting SB_NOSEC from upper layer - Do not call ->syncfs() multiple times for sync(2) - Do not cache negative lookups on upper layer - Make private internal mounts longterm" * tag 'ovl-update-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (27 commits) ovl: remove unnecessary lock check ovl: make oip->index bool ovl: only pass ->ki_flags to ovl_iocb_to_rwf() ovl: make private mounts longterm ovl: get rid of redundant members in struct ovl_fs ovl: add accessor for ofs->upper_mnt ovl: initialize error in ovl_copy_xattr ovl: drop negative dentry in upper layer ovl: check permission to open real file ovl: call secutiry hook in ovl_real_ioctl() ovl: verify permissions in ovl_path_open() ovl: switch to mounter creds in readdir ovl: pass correct flags for opening real directory ovl: fix redirect traversal on metacopy dentries ovl: initialize OVL_UPPERDATA in ovl_lookup() ovl: use only uppermetacopy state in ovl_lookup() ovl: simplify setting of origin for index lookup ovl: fix out of bounds access warning in ovl_check_fb_len() ovl: return required buffer size for file handles ovl: sync dirty data when remounting to ro mode ...	2020-06-09 15:40:50 -07:00
Miklos Szeredi	df820f8de4	ovl: make private mounts longterm Overlayfs is using clone_private_mount() to create internal mounts for underlying layers. These are used for operations requiring a path, such as dentry_open(). Since these private mounts are not in any namespace they are treated as short term, "detached" mounts and mntput() involves taking the global mount_lock, which can result in serious cacheline pingpong. Make these private mounts longterm instead, which trade the penalty on mntput() for a slightly longer shutdown time due to an added RCU grace period when putting these mounts. Introduce a new helper kern_unmount_many() that can take care of multiple longterm mounts with a single RCU grace period. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-06-04 10:48:19 +02:00
Linus Torvalds	e7c93cbfe9	threads-v5.8 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXtYhfgAKCRCRxhvAZXjc oghSAP9uVX3vxYtEtNvu9WtEn1uYZcSKZoF1YrcgY7UfSmna0gEAruzyZcai4CJL WKv+4aRq2oYk+hsqZDycAxIsEgWvNg8= =ZWj3 -----END PGP SIGNATURE----- Merge tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread updates from Christian Brauner: "We have been discussing using pidfds to attach to namespaces for quite a while and the patches have in one form or another already existed for about a year. But I wanted to wait to see how the general api would be received and adopted. This contains the changes to make it possible to use pidfds to attach to the namespaces of a process, i.e. they can be passed as the first argument to the setns() syscall. When only a single namespace type is specified the semantics are equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, when a pidfd is passed, multiple namespace flags can be specified in the second setns() argument and setns() will attach the caller to all the specified namespaces all at once or to none of them. Specifying 0 is not valid together with a pidfd. Here are just two obvious examples: setns(pidfd, CLONE_NEWPID \| CLONE_NEWNS \| CLONE_NEWNET); setns(pidfd, CLONE_NEWUSER); Allowing to also attach subsets of namespaces supports various use-cases where callers setns to a subset of namespaces to retain privilege, perform an action and then re-attach another subset of namespaces. Apart from significantly reducing the number of syscalls needed to attach to all currently supported namespaces (eight "open+setns" sequences vs just a single "setns()"), this also allows atomic setns to a set of namespaces, i.e. either attaching to all namespaces succeeds or we fail without having changed anything. This is centered around a new internal struct nsset which holds all information necessary for a task to switch to a new set of namespaces atomically. Fwiw, with this change a pidfd becomes the only token needed to interact with a container. I'm expecting this to be picked-up by util-linux for nsenter rather soon. Associated with this change is a shiny new test-suite dedicated to setns() (for pidfds and nsfds alike)" * tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: selftests/pidfd: add pidfd setns tests nsproxy: attach to namespaces via pidfds nsproxy: add struct nsset	2020-06-03 13:12:57 -07:00
Linus Torvalds	f359287765	Merge branch 'from-miklos' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted patches from Miklos. An interesting part here is /proc/mounts stuff..." The "/proc/mounts stuff" is using a cursor for keeeping the location data while traversing the mount listing. Also probably worth noting is the addition of faccessat2(), which takes an additional set of flags to specify how the lookup is done (AT_EACCESS, AT_SYMLINK_NOFOLLOW, AT_EMPTY_PATH). * 'from-miklos' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: add faccessat2 syscall vfs: don't parse "silent" option vfs: don't parse "posixacl" option vfs: don't parse forbidden flags statx: add mount_root statx: add mount ID statx: don't clear STATX_ATIME on SB_RDONLY uapi: deprecate STATX_ALL utimensat: AT_EMPTY_PATH support vfs: split out access_override_creds() proc/mounts: add cursor aio: fix async fsync creds vfs: allow unprivileged whiteout creation	2020-06-01 16:44:06 -07:00
Nikolay Borisov	5ad05cc8e0	vfs: Remove duplicated d_mountpoint check in __is_local_mountpoint This function acts as an out-of-line helper for is_local_mountpoint is only called after the latter verifies the dentry is not a mountpoint. There's no semantic changes and the resulting object code is smaller: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-26 (-26) Function old new delta __is_local_mountpoint 147 121 -26 Total: Before=34161, After=34135, chg -0.08% Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-05-29 10:35:24 -04:00
Miklos Szeredi	9f6c61f96f	proc/mounts: add cursor If mounts are deleted after a read(2) call on /proc/self/mounts (or its kin), the subsequent read(2) could miss a mount that comes after the deleted one in the list. This is because the file position is interpreted as the number mount entries from the start of the list. E.g. first read gets entries #0 to #9; the seq file index will be 10. Then entry #5 is deleted, resulting in #10 becoming #9 and #11 becoming #10, etc... The next read will continue from entry #10, and #9 is missed. Solve this by adding a cursor entry for each open instance. Taking the global namespace_sem for write seems excessive, since we are only dealing with a per-namespace list. Instead add a per-namespace spinlock and use that together with namespace_sem taken for read to protect against concurrent modification of the mount list. This may reduce parallelism of is_local_mountpoint(), but it's hardly a big contention point. We could also use RCU freeing of cursors to make traversal not need additional locks, if that turns out to be neceesary. Only move the cursor once for each read (cursor is not added on open) to minimize cacheline invalidation. When EOF is reached, the cursor is taken off the list, in order to prevent an excessive number of cursors due to inactive open file descriptors. Reported-by: Karel Zak <kzak@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-05-14 16:44:24 +02:00
Christian Brauner	303cc571d1	nsproxy: attach to namespaces via pidfds For quite a while we have been thinking about using pidfds to attach to namespaces. This patchset has existed for about a year already but we've wanted to wait to see how the general api would be received and adopted. Now that more and more programs in userspace have started using pidfds for process management it's time to send this one out. This patch makes it possible to use pidfds to attach to the namespaces of another process, i.e. they can be passed as the first argument to the setns() syscall. When only a single namespace type is specified the semantics are equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, when a pidfd is passed, multiple namespace flags can be specified in the second setns() argument and setns() will attach the caller to all the specified namespaces all at once or to none of them. Specifying 0 is not valid together with a pidfd. Here are just two obvious examples: setns(pidfd, CLONE_NEWPID \| CLONE_NEWNS \| CLONE_NEWNET); setns(pidfd, CLONE_NEWUSER); Allowing to also attach subsets of namespaces supports various use-cases where callers setns to a subset of namespaces to retain privilege, perform an action and then re-attach another subset of namespaces. If the need arises, as Eric suggested, we can extend this patchset to assume even more context than just attaching all namespaces. His suggestion specifically was about assuming the process' root directory when setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just keep it flexible in terms of supporting subsets of namespaces but let's wait until we have users asking for even more context to be assumed. At that point we can add an extension. The obvious example where this is useful is a standard container manager interacting with a running container: pushing and pulling files or directories, injecting mounts, attaching/execing any kind of process, managing network devices all these operations require attaching to all or at least multiple namespaces at the same time. Given that nowadays most containers are spawned with all namespaces enabled we're currently looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns> nsfds, another 7 to actually perform the namespace switch. With time namespaces we're looking at about 16 syscalls. (We could amortize the first 7 or 8 syscalls for opening the nsfds by stashing them in each container's monitor process but that would mean we need to send around those file descriptors through unix sockets everytime we want to interact with the container or keep on-disk state. Even in scenarios where a caller wants to join a particular namespace in a particular order callers still profit from batching other namespaces. That mostly applies to the user namespace but all container runtimes I found join the user namespace first no matter if it privileges or deprivileges the container similar to how unshare behaves.) With pidfds this becomes a single syscall no matter how many namespaces are supposed to be attached to. A decently designed, large-scale container manager usually isn't the parent of any of the containers it spawns so the containers don't die when it crashes or needs to update or reinitialize. This means that for the manager to interact with containers through pids is inherently racy especially on systems where the maximum pid number is not significicantly bumped. This is even more problematic since we often spawn and manage thousands or ten-thousands of containers. Interacting with a container through a pid thus can become risky quite quickly. Especially since we allow for an administrator to enable advanced features such as syscall interception where we're performing syscalls in lieu of the container. In all of those cases we use pidfds if they are available and we pass them around as stable references. Using them to setns() to the target process' namespaces is as reliable as using nsfds. Either the target process is already dead and we get ESRCH or we manage to attach to its namespaces but we can't accidently attach to another process' namespaces. So pidfds lend themselves to be used with this api. The other main advantage is that with this change the pidfd becomes the only relevant token for most container interactions and it's the only token we need to create and send around. Apart from significiantly reducing the number of syscalls from double digit to single digit which is a decent reason post-spectre/meltdown this also allows to switch to a set of namespaces atomically, i.e. either attaching to all the specified namespaces succeeds or we fail. If we fail we haven't changed a single namespace. There are currently three namespaces that can fail (other than for ENOMEM which really is not very interesting since we then have other problems anyway) for non-trivial reasons, user, mount, and pid namespaces. We can fail to attach to a pid namespace if it is not our current active pid namespace or a descendant of it. We can fail to attach to a user namespace because we are multi-threaded or because our current mount namespace shares filesystem state with other tasks, or because we're trying to setns() to the same user namespace, i.e. the target task has the same user namespace as we do. We can fail to attach to a mount namespace because it shares filesystem state with other tasks or because we fail to lookup the new root for the new mount namespace. In most non-pathological scenarios these issues can be somewhat mitigated. But there are cases where we're half-attached to some namespace and failing to attach to another one. I've talked about some of these problem during the hallway track (something only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles in 2018(?). Even if all these issues could be avoided with super careful userspace coding it would be nicer to have this done in-kernel. Pidfds seem to lend themselves nicely for this. The other neat thing about this is that setns() becomes an actual counterpart to the namespace bits of unshare(). Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Jann Horn <jannh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com	2020-05-13 11:41:22 +02:00
Christian Brauner	f2a8d52e0a	nsproxy: add struct nsset Add a simple struct nsset. It holds all necessary pieces to switch to a new set of namespaces without leaving a task in a half-switched state which we will make use of in the next patch. This patch switches the existing setns logic over without causing a change in setns() behavior. This brings setns() closer to how unshare() works(). The prepare_ns() function is responsible to prepare all necessary information. This has two reasons. First it minimizes dependencies between individual namespaces, i.e. all install handler can expect that all fields are properly initialized independent in what order they are called in. Second, this makes the code easier to maintain and easier to follow if it needs to be changed. The prepare_ns() helper will only be switched over to use a flags argument in the next patch. Here it will still use nstype as a simple integer argument which was argued would be clearer. I'm not particularly opinionated about this if it really helps or not. The struct nsset itself already contains the flags field since its name already indicates that it can contain information required by different namespaces. None of this should have functional consequences. Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Jann Horn <jannh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com	2020-05-09 13:57:12 +02:00
Mauro Carvalho Chehab	0c1bc6b845	docs: filesystems: fix renamed references Some filesystem references got broken by a previous patch series I submitted. Address those. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Acked-by: David Sterba <dsterba@suse.com> # fs/affs/Kconfig Link: https://lore.kernel.org/r/57318c53008dbda7f6f4a5a9e5787f4d37e8565a.1586881715.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2020-04-20 15:45:22 -06:00
Al Viro	161aff1d93	LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() New LOOKUP flag, telling path_lookupat() to act as path_mountpointat(). IOW, traverse mounts at the final point and skip revalidation of the location where it ends up. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-03-13 21:08:17 -04:00
Al Viro	25e195aa1e	follow_automount(): get rid of dead^Wstillborn code 1) no instances of ->d_automount() have ever made use of the "return ERR_PTR(-EISDIR) if you don't feel like mounting anything" - that's a rudiment of plans that got superseded before the thing went into the tree. Despite the comment in follow_automount(), autofs has never done that. 2) if there's no ->d_automount() in dentry_operations, filesystems should not set DCACHE_NEED_AUTOMOUNT in the first place. None have ever done so... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-02-27 14:43:55 -05:00
Al Viro	26df6034fd	fix automount/automount race properly Protection against automount/automount races (two threads hitting the same referral point at the same time) is based upon do_add_mount() prevention of identical overmounts - trying to overmount the root of mounted tree with the same tree fails with -EBUSY. It's unreliable (the other thread might've mounted something on top of the automount it has triggered) and causes no end of headache for follow_automount() and its caller, since finish_automount() behaves like do_new_mount() - if the mountpoint to be is overmounted, it mounts on top what's overmounting it. It's not only wrong (we want to go into what's overmounting the automount point and quietly discard what we planned to mount there), it introduces the possibility of original parent mount getting dropped. That's what `8aef188452` (VFS: Fix vfsmount overput on simultaneous automount) deals with, but it can't do anything about the reliability of conflict detection - if something had been overmounted the other thread's automount (e.g. that other thread having stepped into automount in mount(2)), we don't get that -EBUSY and the result is referral point under automounted NFS under explicit overmount under another copy of automounted NFS What we need is finish_automount() NOT digging into overmounts - if it finds one, it should just quietly discard the thing it was asked to mount. And don't bother with actually crossing into the results of finish_automount() - the same loop that calls follow_automount() will do that just fine on the next iteration. IOW, instead of calling lock_mount() have finish_automount() do it manually, _without_ the "move into overmount and retry" part. And leave crossing into the results to the caller of follow_automount(), which simplifies it a lot. Moral: if you end up with a lot of glue working around the calling conventions of something, perhaps these calling conventions are simply wrong... Fixes: `8aef188452` (VFS: Fix vfsmount overput on simultaneous automount) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-02-27 14:40:43 -05:00

1 2 3 4 5 ...

609 Commits