mirror of
https://github.com/torvalds/linux.git
synced 2024-11-18 10:01:43 +00:00
6aee4badd8
Pull openat2 support from Al Viro: "This is the openat2() series from Aleksa Sarai. I'm afraid that the rest of namei stuff will have to wait - it got zero review the last time I'd posted #work.namei, and there had been a leak in the posted series I'd caught only last weekend. I was going to repost it on Monday, but the window opened and the odds of getting any review during that... Oh, well. Anyway, openat2 part should be ready; that _did_ get sane amount of review and public testing, so here it comes" From Aleksa's description of the series: "For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Furthermore, the need for some sort of control over VFS's path resolution (to avoid malicious paths resulting in inadvertent breakouts) has been a very long-standing desire of many userspace applications. This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset (which was a variant of David Drysdale's O_BENEATH patchset[4] which was a spin-off of the Capsicum project[5]) with a few additions and changes made based on the previous discussion within [6] as well as others I felt were useful. In line with the conclusions of the original discussion of AT_NO_JUMPS, the flag has been split up into separate flags. However, instead of being an openat(2) flag it is provided through a new syscall openat2(2) which provides several other improvements to the openat(2) interface (see the patch description for more details). The following new LOOKUP_* flags are added: LOOKUP_NO_XDEV: Blocks all mountpoint crossings (upwards, downwards, or through absolute links). Absolute pathnames alone in openat(2) do not trigger this. Magic-link traversal which implies a vfsmount jump is also blocked (though magic-link jumps on the same vfsmount are permitted). LOOKUP_NO_MAGICLINKS: Blocks resolution through /proc/$pid/fd-style links. This is done by blocking the usage of nd_jump_link() during resolution in a filesystem. The term "magic-links" is used to match with the only reference to these links in Documentation/, but I'm happy to change the name. It should be noted that this is different to the scope of ~LOOKUP_FOLLOW in that it applies to all path components. However, you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it will *not* fail (assuming that no parent component was a magic-link), and you will have an fd for the magic-link. In order to correctly detect magic-links, the introduction of a new LOOKUP_MAGICLINK_JUMPED state flag was required. LOOKUP_BENEATH: Disallows escapes to outside the starting dirfd's tree, using techniques such as ".." or absolute links. Absolute paths in openat(2) are also disallowed. Conceptually this flag is to ensure you "stay below" a certain point in the filesystem tree -- but this requires some additional to protect against various races that would allow escape using "..". Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it can trivially beam you around the filesystem (breaking the protection). In future, there might be similar safety checks done as in LOOKUP_IN_ROOT, but that requires more discussion. In addition, two new flags are added that expand on the above ideas: LOOKUP_NO_SYMLINKS: Does what it says on the tin. No symlink resolution is allowed at all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an fd for the symlink as long as no parent path had a symlink component. LOOKUP_IN_ROOT: This is an extension of LOOKUP_BENEATH that, rather than blocking attempts to move past the root, forces all such movements to be scoped to the starting point. This provides chroot(2)-like protection but without the cost of a chroot(2) for each filesystem operation, as well as being safe against race attacks that chroot(2) is not. If a race is detected (as with LOOKUP_BENEATH) then an error is generated, and similar to LOOKUP_BENEATH it is not permitted to cross magic-links with LOOKUP_IN_ROOT. The primary need for this is from container runtimes, which currently need to do symlink scoping in userspace[7] when opening paths in a potentially malicious container. There is a long list of CVEs that could have bene mitigated by having RESOLVE_THIS_ROOT (such as CVE-2017-1002101, CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a few). In order to make all of the above more usable, I'm working on libpathrs[8] which is a C-friendly library for safe path resolution. It features a userspace-emulated backend if the kernel doesn't support openat2(2). Hopefully we can get userspace to switch to using it, and thus get openat2(2) support for free once it's ready. Future work would include implementing things like RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow programs to be sure they don't hit DoSes though stale NFS handles)" * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Documentation: path-lookup: include new LOOKUP flags selftests: add openat2(2) selftests open: introduce openat2(2) syscall namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution namei: LOOKUP_IN_ROOT: chroot-like scoped resolution namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution namei: LOOKUP_NO_XDEV: block mountpoint crossing namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution namei: LOOKUP_NO_SYMLINKS: block symlink resolution namei: allow set_root() to produce errors namei: allow nd_jump_link() to produce errors nsfs: clean-up ns_get_path() signature to return int namei: only return -ECHILD from follow_dotdot_rcu()
288 lines
6.1 KiB
C
288 lines
6.1 KiB
C
// SPDX-License-Identifier: GPL-2.0
|
|
#include <linux/mount.h>
|
|
#include <linux/pseudo_fs.h>
|
|
#include <linux/file.h>
|
|
#include <linux/fs.h>
|
|
#include <linux/proc_fs.h>
|
|
#include <linux/proc_ns.h>
|
|
#include <linux/magic.h>
|
|
#include <linux/ktime.h>
|
|
#include <linux/seq_file.h>
|
|
#include <linux/user_namespace.h>
|
|
#include <linux/nsfs.h>
|
|
#include <linux/uaccess.h>
|
|
|
|
#include "internal.h"
|
|
|
|
static struct vfsmount *nsfs_mnt;
|
|
|
|
static long ns_ioctl(struct file *filp, unsigned int ioctl,
|
|
unsigned long arg);
|
|
static const struct file_operations ns_file_operations = {
|
|
.llseek = no_llseek,
|
|
.unlocked_ioctl = ns_ioctl,
|
|
};
|
|
|
|
static char *ns_dname(struct dentry *dentry, char *buffer, int buflen)
|
|
{
|
|
struct inode *inode = d_inode(dentry);
|
|
const struct proc_ns_operations *ns_ops = dentry->d_fsdata;
|
|
|
|
return dynamic_dname(dentry, buffer, buflen, "%s:[%lu]",
|
|
ns_ops->name, inode->i_ino);
|
|
}
|
|
|
|
static void ns_prune_dentry(struct dentry *dentry)
|
|
{
|
|
struct inode *inode = d_inode(dentry);
|
|
if (inode) {
|
|
struct ns_common *ns = inode->i_private;
|
|
atomic_long_set(&ns->stashed, 0);
|
|
}
|
|
}
|
|
|
|
const struct dentry_operations ns_dentry_operations =
|
|
{
|
|
.d_prune = ns_prune_dentry,
|
|
.d_delete = always_delete_dentry,
|
|
.d_dname = ns_dname,
|
|
};
|
|
|
|
static void nsfs_evict(struct inode *inode)
|
|
{
|
|
struct ns_common *ns = inode->i_private;
|
|
clear_inode(inode);
|
|
ns->ops->put(ns);
|
|
}
|
|
|
|
static int __ns_get_path(struct path *path, struct ns_common *ns)
|
|
{
|
|
struct vfsmount *mnt = nsfs_mnt;
|
|
struct dentry *dentry;
|
|
struct inode *inode;
|
|
unsigned long d;
|
|
|
|
rcu_read_lock();
|
|
d = atomic_long_read(&ns->stashed);
|
|
if (!d)
|
|
goto slow;
|
|
dentry = (struct dentry *)d;
|
|
if (!lockref_get_not_dead(&dentry->d_lockref))
|
|
goto slow;
|
|
rcu_read_unlock();
|
|
ns->ops->put(ns);
|
|
got_it:
|
|
path->mnt = mntget(mnt);
|
|
path->dentry = dentry;
|
|
return 0;
|
|
slow:
|
|
rcu_read_unlock();
|
|
inode = new_inode_pseudo(mnt->mnt_sb);
|
|
if (!inode) {
|
|
ns->ops->put(ns);
|
|
return -ENOMEM;
|
|
}
|
|
inode->i_ino = ns->inum;
|
|
inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
|
|
inode->i_flags |= S_IMMUTABLE;
|
|
inode->i_mode = S_IFREG | S_IRUGO;
|
|
inode->i_fop = &ns_file_operations;
|
|
inode->i_private = ns;
|
|
|
|
dentry = d_alloc_anon(mnt->mnt_sb);
|
|
if (!dentry) {
|
|
iput(inode);
|
|
return -ENOMEM;
|
|
}
|
|
d_instantiate(dentry, inode);
|
|
dentry->d_fsdata = (void *)ns->ops;
|
|
d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry);
|
|
if (d) {
|
|
d_delete(dentry); /* make sure ->d_prune() does nothing */
|
|
dput(dentry);
|
|
cpu_relax();
|
|
return -EAGAIN;
|
|
}
|
|
goto got_it;
|
|
}
|
|
|
|
int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb,
|
|
void *private_data)
|
|
{
|
|
int ret;
|
|
|
|
do {
|
|
struct ns_common *ns = ns_get_cb(private_data);
|
|
if (!ns)
|
|
return -ENOENT;
|
|
ret = __ns_get_path(path, ns);
|
|
} while (ret == -EAGAIN);
|
|
|
|
return ret;
|
|
}
|
|
|
|
struct ns_get_path_task_args {
|
|
const struct proc_ns_operations *ns_ops;
|
|
struct task_struct *task;
|
|
};
|
|
|
|
static struct ns_common *ns_get_path_task(void *private_data)
|
|
{
|
|
struct ns_get_path_task_args *args = private_data;
|
|
|
|
return args->ns_ops->get(args->task);
|
|
}
|
|
|
|
int ns_get_path(struct path *path, struct task_struct *task,
|
|
const struct proc_ns_operations *ns_ops)
|
|
{
|
|
struct ns_get_path_task_args args = {
|
|
.ns_ops = ns_ops,
|
|
.task = task,
|
|
};
|
|
|
|
return ns_get_path_cb(path, ns_get_path_task, &args);
|
|
}
|
|
|
|
int open_related_ns(struct ns_common *ns,
|
|
struct ns_common *(*get_ns)(struct ns_common *ns))
|
|
{
|
|
struct path path = {};
|
|
struct file *f;
|
|
int err;
|
|
int fd;
|
|
|
|
fd = get_unused_fd_flags(O_CLOEXEC);
|
|
if (fd < 0)
|
|
return fd;
|
|
|
|
do {
|
|
struct ns_common *relative;
|
|
|
|
relative = get_ns(ns);
|
|
if (IS_ERR(relative)) {
|
|
put_unused_fd(fd);
|
|
return PTR_ERR(relative);
|
|
}
|
|
|
|
err = __ns_get_path(&path, relative);
|
|
} while (err == -EAGAIN);
|
|
|
|
if (err) {
|
|
put_unused_fd(fd);
|
|
return err;
|
|
}
|
|
|
|
f = dentry_open(&path, O_RDONLY, current_cred());
|
|
path_put(&path);
|
|
if (IS_ERR(f)) {
|
|
put_unused_fd(fd);
|
|
fd = PTR_ERR(f);
|
|
} else
|
|
fd_install(fd, f);
|
|
|
|
return fd;
|
|
}
|
|
EXPORT_SYMBOL_GPL(open_related_ns);
|
|
|
|
static long ns_ioctl(struct file *filp, unsigned int ioctl,
|
|
unsigned long arg)
|
|
{
|
|
struct user_namespace *user_ns;
|
|
struct ns_common *ns = get_proc_ns(file_inode(filp));
|
|
uid_t __user *argp;
|
|
uid_t uid;
|
|
|
|
switch (ioctl) {
|
|
case NS_GET_USERNS:
|
|
return open_related_ns(ns, ns_get_owner);
|
|
case NS_GET_PARENT:
|
|
if (!ns->ops->get_parent)
|
|
return -EINVAL;
|
|
return open_related_ns(ns, ns->ops->get_parent);
|
|
case NS_GET_NSTYPE:
|
|
return ns->ops->type;
|
|
case NS_GET_OWNER_UID:
|
|
if (ns->ops->type != CLONE_NEWUSER)
|
|
return -EINVAL;
|
|
user_ns = container_of(ns, struct user_namespace, ns);
|
|
argp = (uid_t __user *) arg;
|
|
uid = from_kuid_munged(current_user_ns(), user_ns->owner);
|
|
return put_user(uid, argp);
|
|
default:
|
|
return -ENOTTY;
|
|
}
|
|
}
|
|
|
|
int ns_get_name(char *buf, size_t size, struct task_struct *task,
|
|
const struct proc_ns_operations *ns_ops)
|
|
{
|
|
struct ns_common *ns;
|
|
int res = -ENOENT;
|
|
const char *name;
|
|
ns = ns_ops->get(task);
|
|
if (ns) {
|
|
name = ns_ops->real_ns_name ? : ns_ops->name;
|
|
res = snprintf(buf, size, "%s:[%u]", name, ns->inum);
|
|
ns_ops->put(ns);
|
|
}
|
|
return res;
|
|
}
|
|
|
|
struct file *proc_ns_fget(int fd)
|
|
{
|
|
struct file *file;
|
|
|
|
file = fget(fd);
|
|
if (!file)
|
|
return ERR_PTR(-EBADF);
|
|
|
|
if (file->f_op != &ns_file_operations)
|
|
goto out_invalid;
|
|
|
|
return file;
|
|
|
|
out_invalid:
|
|
fput(file);
|
|
return ERR_PTR(-EINVAL);
|
|
}
|
|
|
|
static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
|
|
{
|
|
struct inode *inode = d_inode(dentry);
|
|
const struct proc_ns_operations *ns_ops = dentry->d_fsdata;
|
|
|
|
seq_printf(seq, "%s:[%lu]", ns_ops->name, inode->i_ino);
|
|
return 0;
|
|
}
|
|
|
|
static const struct super_operations nsfs_ops = {
|
|
.statfs = simple_statfs,
|
|
.evict_inode = nsfs_evict,
|
|
.show_path = nsfs_show_path,
|
|
};
|
|
|
|
static int nsfs_init_fs_context(struct fs_context *fc)
|
|
{
|
|
struct pseudo_fs_context *ctx = init_pseudo(fc, NSFS_MAGIC);
|
|
if (!ctx)
|
|
return -ENOMEM;
|
|
ctx->ops = &nsfs_ops;
|
|
ctx->dops = &ns_dentry_operations;
|
|
return 0;
|
|
}
|
|
|
|
static struct file_system_type nsfs = {
|
|
.name = "nsfs",
|
|
.init_fs_context = nsfs_init_fs_context,
|
|
.kill_sb = kill_anon_super,
|
|
};
|
|
|
|
void __init nsfs_init(void)
|
|
{
|
|
nsfs_mnt = kern_mount(&nsfs);
|
|
if (IS_ERR(nsfs_mnt))
|
|
panic("can't set nsfs up\n");
|
|
nsfs_mnt->mnt_sb->s_flags &= ~SB_NOUSER;
|
|
}
|