The security_task_getsecid_subj() LSM hook invites misuse by allowing
callers to specify a task even though the hook is only safe when the
current task is referenced. Fix this by removing the task_struct
argument to the hook, requiring LSM implementations to use the
current task. While we are changing the hook declaration we also
rename the function to security_current_getsecid_subj() in an effort
to reinforce that the hook captures the subjective credentials of the
current task and not an arbitrary task on the system.
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmGANbAUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNaMBAAg+9gZr0F7xiafu8JFZqZfx/AQdJ2
G2cn3le+/tXGZmF8m/+82lOaR6LeQLatgSDJNSkXWkKr0nRwseQJDbtRfvYJdn0t
Ax05/Fmz6OGxQ2wgRYgaFiSrKpE5p3NhDtiLFVdkCJaQNe/8DZOc7NhBl6EjZf3x
ubhl2hUiJ4AmiXGwcYhr4uKgP4nhW8OM1/OkskVi+bBMmLA8KTY9kslmIDP5E3BW
29W4qhqeLNQupY5dGMEMVcyxY9ZUWpO39q4uOaQVZrUGE7xABkj/jhnxT5gFTSlI
pu8VhsYXm9KuRVveIsv0L5SZfadwoM9YAl7ki1wD3W5rHqOAte3rBTm6VmNlQwfU
MqxP65Jiyxudxet5Be3/dCRH/+MDQuwBxivgmZXbeVxor2SeznVb0GDaEUC5FSHu
CJIgWtQzsPJMxgAEGXN4F3QGP0htTTJni56GUPOsrf4TIBW02TT+oLTLFRIokQQL
INNOfwVSRXElnCsvxsHR4oB+JZ9pJyBaAmeupcQ6jmcKiWlbLj4s+W0U0pM5h91v
hmMpz7KMxrX6gVL4gB2Jj4aN3r5YRbq26NBu6D+wdwwBTeTTocaHSpAqkv4buClf
uNk3cG8Hkp8TTg9cM8jYgpxMyzKH/AI/Uw3VhEa1xCiq2Ck3DgfnZvnvcRRaZevU
FPgmwgqePJXGi60=
=sb8J
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
- Add LSM/SELinux/Smack controls and auditing for io-uring.
As usual, the individual commit descriptions have more detail, but we
were basically missing two things which we're adding here:
+ establishment of a proper audit context so that auditing of
io-uring ops works similarly to how it does for syscalls (with
some io-uring additions because io-uring ops are *not* syscalls)
+ additional LSM hooks to enable access control points for some of
the more unusual io-uring features, e.g. credential overrides.
The additional audit callouts and LSM hooks were done in conjunction
with the io-uring folks, based on conversations and RFC patches
earlier in the year.
- Fixup the binder credential handling so that the proper credentials
are used in the LSM hooks; the commit description and the code
comment which is removed in these patches are helpful to understand
the background and why this is the proper fix.
- Enable SELinux genfscon policy support for securityfs, allowing
improved SELinux filesystem labeling for other subsystems which make
use of securityfs, e.g. IMA.
* tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
security: Return xattr name from security_dentry_init_security()
selinux: fix a sock regression in selinux_ip_postroute_compat()
binder: use cred instead of task for getsecid
binder: use cred instead of task for selinux checks
binder: use euid from cred instead of using task
LSM: Avoid warnings about potentially unused hook variables
selinux: fix all of the W=1 build warnings
selinux: make better use of the nf_hook_state passed to the NF hooks
selinux: fix race condition when computing ocontext SIDs
selinux: remove unneeded ipv6 hook wrappers
selinux: remove the SELinux lockdown implementation
selinux: enable genfscon labeling for securityfs
Smack: Brutalist io_uring support
selinux: add support for the io_uring access controls
lsm,io_uring: add LSM hooks to io_uring
io_uring: convert io_uring to the secure anon inode interface
fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure()
audit: add filtering for io_uring records
audit,io_uring,io-wq: add some basic audit support to io_uring
audit: prepare audit_context for use in calling contexts beyond syscalls
Multiple corrections to smackfs.
W=1 fixes
Fix for overlayfs.
-----BEGIN PGP SIGNATURE-----
iQJLBAABCAA1FiEEC+9tH1YyUwIQzUIeOKUVfIxDyBEFAmGALSwXHGNhc2V5QHNj
aGF1Zmxlci1jYS5jb20ACgkQOKUVfIxDyBECcRAAuFN3uiptwLHC0P/Yy5VtANF4
kvRV7egXeIp8tYl6zVb+VIa3AgQMTB0BfdpSVMq8YyTtp5olczlM2b00zN4cVZq0
VO7XlTUfoljdsi4IysaA5rZZtjT5entoNl6eiCrMVwpvC4ZrlTqebOJteXDAnEQP
zI+smzc5mUFn1FUkGaQ+ciLHv2wyT39Wmk78BDKI/UAl09xV6Kxfci15Q9UPpZ94
CnYMfrjsinUCzQ+gbj9FIe5vXvcGwVoO7jJUNda5kuCSM3N4TTxD/fDkVxWEwOD2
eNjemub5RkxVWSlzRtVQgEFsssRsd6VYEKop34jyojDOGDj+JVQ03ntnFUfWScUz
8BkwyYsE8I+f878f0wAaz+8xrefjnwnRUHFzkF5hd6wLFCvQlCKULL/naRtig7pJ
x1V6Q+AC/Qyu0rNrSH5UCDgsvDQ3YzKocWhnvgCqJa8bd/QlfMKu8sxIwetNlctz
+TG+GwBLKaVmdiwWoI/CF3PkM4xYo4DtJtDnvlzAiEjGEYosEXilgDBq6IAD8vLa
cuSXtWCIpBk5VKkglvAvsIbXxnWa0W45j7PXyf8b7YXRWF511I8zBHjpDx6XP/Ko
FywGEaRDeNO3KZJxw9e39FUdyl1MT+s+gN3sERomUTig9RaPhp87pC/kWMWxVj+Y
fU/iIgrRTqa2spQgYNg=
=lPZg
-----END PGP SIGNATURE-----
Merge tag 'Smack-for-5.16' of https://github.com/cschaufler/smack-next
Pull smack updates from Casey Schaufler:
"Multiple corrections to smackfs:
- a change for overlayfs support that corrects the initial attributes
on created files
- code clean-up for netlabel processing
- several fixes in smackfs for a variety of reasons
- Errors reported by W=1 have been addressed
All told, nothing challenging"
* tag 'Smack-for-5.16' of https://github.com/cschaufler/smack-next:
smackfs: use netlbl_cfg_cipsov4_del() for deleting cipso_v4_doi
smackfs: use __GFP_NOFAIL for smk_cipso_doi()
Smack: fix W=1 build warnings
smack: remove duplicated hook function
Smack:- Use overlay inode label in smack_inode_copy_up()
smack: Guard smack_ipv6_lock definition within a SMACK_IPV6_PORT_LABELING block
smackfs: Fix use-after-free in netlbl_catmap_walk()
syzbot is reporting UAF at cipso_v4_doi_search() [1], for smk_cipso_doi()
is calling kfree() without removing from the cipso_v4_doi_list list after
netlbl_cfg_cipsov4_map_add() returned an error. We need to use
netlbl_cfg_cipsov4_del() in order to remove from the list and wait for
RCU grace period before kfree().
Link: https://syzkaller.appspot.com/bug?extid=93dba5b91f0fed312cbd [1]
Reported-by: syzbot <syzbot+93dba5b91f0fed312cbd@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 6c2e8ac095 ("netlabel: Update kernel configuration API")
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
syzbot is reporting kernel panic at smk_cipso_doi() due to memory
allocation fault injection [1]. The reason for need to use panic() was
not explained. But since no fix was proposed for 18 months, for now
let's use __GFP_NOFAIL for utilizing syzbot resource on other bugs.
Link: https://syzkaller.appspot.com/bug?extid=89731ccb6fec15ce1c22 [1]
Reported-by: syzbot <syzbot+89731ccb6fec15ce1c22@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
A couple of functions had malformed comment blocks.
Namespace parameters were added without updating the
comment blocks. These are all repaired in the Smack code,
so "% make W=1 security/smack" is warning free.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Currently in "smack_inode_copy_up()" function, process label is
changed with the label on parent inode. Due to which,
process is assigned directory label and whatever file or directory
created by the process are also getting directory label
which is wrong label.
Changes has been done to use label of overlay inode instead
of parent inode.
Signed-off-by: Vishal Goel <vishal.goel@samsung.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
The mutex smack_ipv6_lock is only used with the SMACK_IPV6_PORT_LABELING
block but its definition is outside of the block. This leads to a
defined-but-not-used warning on PREEMPT_RT.
Moving smack_ipv6_lock down to the block where it is used where it used
raises the question why is smk_ipv6_port_list read if nothing is added
to it.
Turns out, only smk_ipv6_port_check() is using it outside of an ifdef
SMACK_IPV6_PORT_LABELING block. However two of three caller invoke
smk_ipv6_port_check() from a ifdef block and only one is using
__is_defined() macro which requires the function and smk_ipv6_port_list
to be around.
Put the lock and list inside an ifdef SMACK_IPV6_PORT_LABELING block to
avoid the warning regarding unused mutex. Extend the ifdef-block to also
cover smk_ipv6_port_check(). Make smack_socket_connect() use ifdef
instead of __is_defined() to avoid complains about missing function.
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Jann Horn reported a problem with commit eb1231f73c ("selinux:
clarify task subjective and objective credentials") where some LSM
hooks were attempting to access the subjective credentials of a task
other than the current task. Generally speaking, it is not safe to
access another task's subjective credentials and doing so can cause
a number of problems.
Further, while looking into the problem, I realized that Smack was
suffering from a similar problem brought about by a similar commit
1fb057dcde ("smack: differentiate between subjective and objective
task credentials").
This patch addresses this problem by restoring the use of the task's
objective credentials in those cases where the task is other than the
current executing task. Not only does this resolve the problem
reported by Jann, it is arguably the correct thing to do in these
cases.
Cc: stable@vger.kernel.org
Fixes: eb1231f73c ("selinux: clarify task subjective and objective credentials")
Fixes: 1fb057dcde ("smack: differentiate between subjective and objective task credentials")
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Add Smack privilege checks for io_uring. Use CAP_MAC_OVERRIDE
for the override_creds case and CAP_MAC_ADMIN for creating a
polling thread. These choices are based on conjecture regarding
the intent of the surrounding code.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: make the smack_uring_* funcs static, remove debug code]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Syzkaller reported use-after-free bug as described in [1]. The bug is
triggered when smk_set_cipso() tries to free stale category bitmaps
while there are concurrent reader(s) using the same bitmaps.
Wait for RCU grace period to finish before freeing the category bitmaps
in smk_set_cipso(). This makes sure that there are no more readers using
the stale bitmaps and freeing them should be safe.
[1] https://lore.kernel.org/netdev/000000000000a814c505ca657a4e@google.com/
Reported-by: syzbot+3f91de0b813cc3d19a80@syzkaller.appspotmail.com
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Mark 'smack_enabled' as __initdata
since it is only used during initialization code.
Signed-off-by: Austin Kim <austin.kim@lge.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
In the smk_access_entry() function, if no matching rule is found
in the rust_list, a negative error code will be used to perform bit
operations with the MAY_ enumeration value. This is semantically
wrong. This patch fixes this issue.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Fix gcc W=1 warning:
security/smack/smack_access.c:342: warning: Function parameter or member 'ad' not described in 'smack_log'
security/smack/smack_access.c:403: warning: Function parameter or member 'skp' not described in 'smk_insert_entry'
security/smack/smack_access.c:487: warning: Function parameter or member 'level' not described in 'smk_netlbl_mls'
security/smack/smack_access.c:487: warning: Function parameter or member 'len' not described in 'smk_netlbl_mls'
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
This reverts commit 942cb357ae.
The io_uring PF_IO_WORKER threads no longer have PF_KTHREAD set, so no
need to special case them for credential checks.
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
A typo is found out by codespell tool in 383th line of smackfs.c:
$ codespell ./security/smack/
./smackfs.c:383: numer ==> number
Fix a typo found by codespell.
Signed-off-by: Xiong Zhenwu <xiong.zhenwu@zte.com.cn>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEgycj0O+d1G2aycA8rZhLv9lQBTwFAmCInP4ACgkQrZhLv9lQ
BTza0g//dTeb9woC9H7qlEhK4l9yk62lTss60Q8X7m7ZSNfdL4tiEbi64SgK+iOW
OOegbrOEb8Kzh4KJJYmVlVZ5YUWyH4szgmee1wnylBdsWiWaPLPF3Cflz77apy6T
TiiBsJd7rRE29FKheaMt34B41BMh8QHESN+DzjzJWsFoi/uNxjgSs2W16XuSupKu
bpRmB1pYNXMlrkzz7taL05jndZYE5arVriqlxgAsuLOFOp/ER7zecrjImdCM/4kL
W6ej0R1fz2Geh6CsLBJVE+bKWSQ82q5a4xZEkSYuQHXgZV5eywE5UKu8ssQcRgQA
VmGUY5k73rfY9Ofupf2gCaf/JSJNXKO/8Xjg0zAdklKtmgFjtna5Tyg9I90j7zn+
5swSpKuRpilN8MQH+6GWAnfqQlNoviTOpFeq3LwBtNVVOh08cOg6lko/bmebBC+R
TeQPACKS0Q0gCDPm9RYoU1pMUuYgfOwVfVRZK1prgi2Co7ZBUMOvYbNoKYoPIydr
ENBYljlU1OYwbzgR2nE+24fvhU8xdNOVG1xXYPAEHShu+p7dLIWRLhl8UCtRQpSR
1ofeVaJjgjrp29O+1OIQjB2kwCaRdfv/Gq1mztE/VlMU/r++E62OEzcH0aS+mnrg
yzfyUdI8IFv1q6FGT9yNSifWUWxQPmOKuC8kXsKYfqfJsFwKmHM=
=uCN4
-----END PGP SIGNATURE-----
Merge tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull Landlock LSM from James Morris:
"Add Landlock, a new LSM from Mickaël Salaün.
Briefly, Landlock provides for unprivileged application sandboxing.
From Mickaël's cover letter:
"The goal of Landlock is to enable to restrict ambient rights (e.g.
global filesystem access) for a set of processes. Because Landlock
is a stackable LSM [1], it makes possible to create safe security
sandboxes as new security layers in addition to the existing
system-wide access-controls. This kind of sandbox is expected to
help mitigate the security impact of bugs or unexpected/malicious
behaviors in user-space applications. Landlock empowers any
process, including unprivileged ones, to securely restrict
themselves.
Landlock is inspired by seccomp-bpf but instead of filtering
syscalls and their raw arguments, a Landlock rule can restrict the
use of kernel objects like file hierarchies, according to the
kernel semantic. Landlock also takes inspiration from other OS
sandbox mechanisms: XNU Sandbox, FreeBSD Capsicum or OpenBSD
Pledge/Unveil.
In this current form, Landlock misses some access-control features.
This enables to minimize this patch series and ease review. This
series still addresses multiple use cases, especially with the
combined use of seccomp-bpf: applications with built-in sandboxing,
init systems, security sandbox tools and security-oriented APIs [2]"
The cover letter and v34 posting is here:
https://lore.kernel.org/linux-security-module/20210422154123.13086-1-mic@digikod.net/
See also:
https://landlock.io/
This code has had extensive design discussion and review over several
years"
Link: https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/ [1]
Link: https://lore.kernel.org/lkml/f646e1c7-33cf-333f-070c-0a40ad0468cd@digikod.net/ [2]
* tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
landlock: Enable user space to infer supported features
landlock: Add user and kernel documentation
samples/landlock: Add a sandbox manager example
selftests/landlock: Add user space tests
landlock: Add syscall implementations
arch: Wire up Landlock syscalls
fs,security: Add sb_delete hook
landlock: Support filesystem access-control
LSM: Infrastructure management of the superblock
landlock: Add ptrace restrictions
landlock: Set up the security framework and manage credentials
landlock: Add ruleset and domain management
landlock: Add object management
Move management of the superblock->sb_security blob out of the
individual security modules and into the security infrastructure.
Instead of allocating the blobs from within the modules, the modules
tell the infrastructure how much space is required, and the space is
allocated there.
Cc: John Johansen <john.johansen@canonical.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210422154123.13086-6-mic@digikod.net
Signed-off-by: James Morris <jamorris@linux.microsoft.com>
With the split of the security_task_getsecid() into subjective and
objective variants it's time to update Smack to ensure it is using
the correct task creds.
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Of the three LSMs that implement the security_task_getsecid() LSM
hook, all three LSMs provide the task's objective security
credentials. This turns out to be unfortunate as most of the hook's
callers seem to expect the task's subjective credentials, although
a small handful of callers do correctly expect the objective
credentials.
This patch is the first step towards fixing the problem: it splits
the existing security_task_getsecid() hook into two variants, one
for the subjective creds, one for the objective creds.
void security_task_getsecid_subj(struct task_struct *p,
u32 *secid);
void security_task_getsecid_obj(struct task_struct *p,
u32 *secid);
While this patch does fix all of the callers to use the correct
variant, in order to keep this patch focused on the callers and to
ease review, the LSMs continue to use the same implementation for
both hooks. The net effect is that this patch should not change
the behavior of the kernel in any way, it will be up to the latter
LSM specific patches in this series to change the hook
implementations and return the correct credentials.
Acked-by: Mimi Zohar <zohar@linux.ibm.com> (IMA)
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdfhttps://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
1d7b902e28
In order to support idmapped mounts, filesystems need to be changed
and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
patches to convert individual filesystem are not very large or
complicated overall as can be seen from the included fat, ext4, and
xfs ports. Patches for other filesystems are actively worked on and
will be sent out separately. The xfstestsuite can be used to verify
that port has been done correctly.
The mount_setattr() syscall is motivated independent of the idmapped
mounts patches and it's been around since July 2019. One of the most
valuable features of the new mount api is the ability to perform
mounts based on file descriptors only.
Together with the lookup restrictions available in the openat2()
RESOLVE_* flag namespace which we added in v5.6 this is the first time
we are close to hardened and race-free (e.g. symlinks) mounting and
path resolution.
While userspace has started porting to the new mount api to mount
proper filesystems and create new bind-mounts it is currently not
possible to change mount options of an already existing bind mount in
the new mount api since the mount_setattr() syscall is missing.
With the addition of the mount_setattr() syscall we remove this last
restriction and userspace can now fully port to the new mount api,
covering every use-case the old mount api could. We also add the
crucial ability to recursively change mount options for a whole mount
tree, both removing and adding mount options at the same time. This
syscall has been requested multiple times by various people and
projects.
There is a simple tool available at
https://github.com/brauner/mount-idmapped
that allows to create idmapped mounts so people can play with this
patch series. I'll add support for the regular mount binary should you
decide to pull this in the following weeks:
Here's an example to a simple idmapped mount of another user's home
directory:
u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--"
* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
xfs: support idmapped mounts
ext4: support idmapped mounts
fat: handle idmapped mounts
tests: add mount_setattr() selftests
fs: introduce MOUNT_ATTR_IDMAP
fs: add mount_setattr()
fs: add attr_flags_to_mnt_flags helper
fs: split out functions to hold writers
namespace: only take read lock in do_reconfigure_mnt()
mount: make {lock,unlock}_mount_hash() static
namespace: take lock_mount_hash() directly when changing flags
nfs: do not export idmapped mounts
overlayfs: do not mount on top of idmapped mounts
ecryptfs: do not mount on top of idmapped mounts
ima: handle idmapped mounts
apparmor: handle idmapped mounts
fs: make helpers idmap mount aware
exec: handle idmapped mounts
would_dump: handle idmapped mounts
...
syzbot found WARNINGs in several smackfs write operations where
bytes count is passed to memdup_user_nul which exceeds
GFP MAX_ORDER. Check count size if bigger than PAGE_SIZE.
Per smackfs doc, smk_write_net4addr accepts any label or -CIPSO,
smk_write_net6addr accepts any label or -DELETE. I couldn't find
any general rule for other label lengths except SMK_LABELLEN,
SMK_LONGLABEL, SMK_CIPSOMAX which are documented.
Let's constrain, in general, smackfs label lengths for PAGE_SIZE.
Although fuzzer crashes write to smackfs/netlabel on 0x400000 length.
Here is a quick way to reproduce the WARNING:
python -c "print('A' * 0x400000)" > /sys/fs/smackfs/netlabel
Reported-by: syzbot+a71a442385a0b2815497@syzkaller.appspotmail.com
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
When interacting with user namespace and non-user namespace aware
filesystem capabilities the vfs will perform various security checks to
determine whether or not the filesystem capabilities can be used by the
caller, whether they need to be removed and so on. The main
infrastructure for this resides in the capability codepaths but they are
called through the LSM security infrastructure even though they are not
technically an LSM or optional. This extends the existing security hooks
security_inode_removexattr(), security_inode_killpriv(),
security_inode_getsecurity() to pass down the mount's user namespace and
makes them aware of idmapped mounts.
In order to actually get filesystem capabilities from disk the
capability infrastructure exposes the get_vfs_caps_from_disk() helper.
For user namespace aware filesystem capabilities a root uid is stored
alongside the capabilities.
In order to determine whether the caller can make use of the filesystem
capability or whether it needs to be ignored it is translated according
to the superblock's user namespace. If it can be translated to uid 0
according to that id mapping the caller can use the filesystem
capabilities stored on disk. If we are accessing the inode that holds
the filesystem capabilities through an idmapped mount we map the root
uid according to the mount's user namespace. Afterwards the checks are
identical to non-idmapped mounts: reading filesystem caps from disk
enforces that the root uid associated with the filesystem capability
must have a mapping in the superblock's user namespace and that the
caller is either in the same user namespace or is a descendant of the
superblock's user namespace. For filesystems that are mountable inside
user namespace the caller can just mount the filesystem and won't
usually need to idmap it. If they do want to idmap it they can create an
idmapped mount and mark it with a user namespace they created and which
is thus a descendant of s_user_ns. For filesystems that are not
mountable inside user namespaces the descendant rule is trivially true
because the s_user_ns will be the initial user namespace.
If the initial user namespace is passed nothing changes so non-idmapped
mounts will see identical behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-11-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When interacting with extended attributes the vfs verifies that the
caller is privileged over the inode with which the extended attribute is
associated. For posix access and posix default extended attributes a uid
or gid can be stored on-disk. Let the functions handle posix extended
attributes on idmapped mounts. If the inode is accessed through an
idmapped mount we need to map it according to the mount's user
namespace. Afterwards the checks are identical to non-idmapped mounts.
This has no effect for e.g. security xattrs since they don't store uids
or gids and don't perform permission checks on them like posix acls do.
Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
in the face of io_uring's use of kernel threads.
-----BEGIN PGP SIGNATURE-----
iQJLBAABCAA1FiEEC+9tH1YyUwIQzUIeOKUVfIxDyBEFAl/ig/gXHGNhc2V5QHNj
aGF1Zmxlci1jYS5jb20ACgkQOKUVfIxDyBGqvRAArz1+Ki7ipJVOusXAMtBLimUc
4rQnk6vCLZllojY5K0swfN7Y8zDMwrOiGLO9zXFeWFT3lc9vf0HVDDAerIZ+go8n
g6gII+OWPM2E0ywBgzXgzt5eSOkJ7npcn5zYAHz5sZGlYjWfVu7IAtR6GAcN0aeC
C75NTJSTbx2uyxUL/5rRJBAiNKHRbmzsNX103SKqAEuhEmHwb1yXE2i8xrxKe/un
vQv7CJQYZRf6tFEpC8zYPDZb1/EXifxQkzxGhmJCJcl+ZTWW8RplNk8ij/BIyb/V
Bupj+cF/waq6lXKGQB/lxrUiOYrUzk09RBaawH5ZTVnQ+kBgXPJjc3gjrOdjuVji
DVl4oImgOGrTpq36PdldWIlB1tY1hrGgP4qjaRn0/6zJocLmuu3fVlVbHb/xMgMT
Nq5/Ny1ymfH3w/hsg9Usze8nAL8Sj3pt6j0hdmgqS31Fth7aR16eLxZY5rK0jDWu
d+eEobZx/ww0jJB8s+cVMGxhFNlq1Qmu0+nCPGHKtfnBwAWdMfu2M25lg3s2CBb+
8+1edxG9Vfcq7laNUUrV9z8RAJpog+22DjEZpq2pCKIHgGqEYsvrxFcdLX/ymlTi
RDtIi0+SF2Vt8B6M+iZ8gbTAGzcGZi1uHXZCha95A1nwLq6X7ZwzIiIzL0CicvZJ
QPG3O4zvys3LthxZFoA=
=zXSS
-----END PGP SIGNATURE-----
Merge tag 'Smack-for-5.11-io_uring-fix' of git://github.com/cschaufler/smack-next
Pull smack fix from Casey Schaufler:
"Provide a fix for the incorrect handling of privilege in the face of
io_uring's use of kernel threads. That invalidated an long standing
assumption regarding the privilege of kernel threads.
The fix is simple and safe. It was provided by Jens Axboe and has been
tested"
* tag 'Smack-for-5.11-io_uring-fix' of git://github.com/cschaufler/smack-next:
Smack: Handle io_uring kernel thread privileges
Smack assumes that kernel threads are privileged for smackfs
operations. This was necessary because the credential of the
kernel thread was not related to a user operation. With io_uring
the credential does reflect a user's rights and can be used.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
set of corrections in function header comments.
-----BEGIN PGP SIGNATURE-----
iQJLBAABCAA1FiEEC+9tH1YyUwIQzUIeOKUVfIxDyBEFAl/ZAUgXHGNhc2V5QHNj
aGF1Zmxlci1jYS5jb20ACgkQOKUVfIxDyBFgFhAAgxDoN1EV1LMbt5YHHWVCRkuN
a9anGIoKQz++B+XMXJ4p5aouJvm5Bg7XATgHagUIK8oT3i4EW/8t1mD2tSw1kvrn
WARe0T+DrH+4NL1S53hcsfNVYdfUL4oSaQP5S/D7OezAILs8VDz108a97AWmrTnA
0Ed3q+C8YzqQfPP3KjcNhXYOBpx2EwSwmWNeyAL9QU4lGYUbeO9gN7TVXQnRK4OH
/cjhPRhe9D7Abr2QtHuVjzjFvpLx6hlkNOCI9o1kI9FM7JlC2m2nAEUerm1/DRxC
z7vB0CP1nc8Q6MlbeI+fM3CQvWaDXV6/fKcHfUQ5sqIL2utyZzKZwg1xxfDvFYZ8
WmF3v0XKUwW30rBa7Zbq+IBPSZ7ITWnIrIig2HwhdGxEBa6GHcy6MLMrOVKmQPsx
2cNiFd8+FizP82tSzHpfb8hS2834Qz1RS/z7rpXFhkq71EqgLTl9Taghtv6gOYiX
WGRiz4zOCqtOIzdQWfz+npLwfmr7ILfw1B5ftjMJv5U6Fq0JPowGswl999OYMvUr
Ct/iyAcT64M5TTac8T7qFjzmFpjjGF8Ev3NPPnqYLTBFYA13/3Wrjdj8VDM92RXW
HaqVv3bMaYwOzREI1OjiDyfJq+IwfUQNLaABSW+3lwuw3BTAabttkheMLnYAAY7O
KXNhqO97Vp8Q83JBbtk=
=LBD6
-----END PGP SIGNATURE-----
Merge tag 'Smack-for-5.11' of git://github.com/cschaufler/smack-next
Pull smack updates from Casey Schaufler:
"There are no functional changes. Just one minor code clean-up and a
set of corrections in function header comments"
* tag 'Smack-for-5.11' of git://github.com/cschaufler/smack-next:
security/smack: remove unused varible 'rc'
Smack: fix kernel-doc interface on functions
A followup change to tcp_request_sock_op would have to drop the 'const'
qualifier from the 'route_req' function as the
'security_inet_conn_request' call is moved there - and that function
expects a 'struct sock *'.
However, it turns out its also possible to add a const qualifier to
security_inet_conn_request instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This varible isn't used and can be removed to avoid a gcc warning:
security/smack/smack_lsm.c:3873:6: warning: variable ‘rc’ set but not
used [-Wunused-but-set-variable]
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: linux-security-module@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
The are some kernel-doc interface issues:
security/smack/smackfs.c:1950: warning: Function parameter or member
'list' not described in 'smk_parse_label_list'
security/smack/smackfs.c:1950: warning: Excess function parameter
'private' description in 'smk_parse_label_list'
security/smack/smackfs.c:1979: warning: Function parameter or member
'list' not described in 'smk_destroy_label_list'
security/smack/smackfs.c:1979: warning: Excess function parameter 'head'
description in 'smk_destroy_label_list'
security/smack/smackfs.c:2141: warning: Function parameter or member
'count' not described in 'smk_read_logging'
security/smack/smackfs.c:2141: warning: Excess function parameter 'cn'
description in 'smk_read_logging'
security/smack/smackfs.c:2278: warning: Function parameter or member
'format' not described in 'smk_user_access'
Correct them in this patch.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: linux-security-module@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Two kernel test robot suggested clean-ups.
Teach Smack to use the IPv4 netlabel cache.
This results in a 12-14% improvement on TCP benchmarks.
-----BEGIN PGP SIGNATURE-----
iQJLBAABCAA1FiEEC+9tH1YyUwIQzUIeOKUVfIxDyBEFAl+EkO0XHGNhc2V5QHNj
aGF1Zmxlci1jYS5jb20ACgkQOKUVfIxDyBFmWBAAqJzkYd1Pd3g0vewZXGe5mRxs
Qqt6i3HMJ7yAPIJN3445DyHUB6Sad2jmGXGullROfUtJYFkTbxqIj0XT+RUrJFap
qf54Sgzt/ucNTvZ45s5mV6iYdL2ol7NSB9kQOMVMV0BZf2GXknYv0/rP23wI/upo
Ylk41UYZNE5RDwoH90sy9aUUJ6VHhMYJOj33aUXz6rwaO1Ck6xvu1tCDudJpERkU
kUmgpYzkNVGmg+GQIOT/xr7C7LRFxfnbSX0UL5TYPEKC23tYfkSBoClgCpNSSYSh
3+7+YNRjMjxuKYJEPTUliDeDMQLxQB/3tLMD2c92nADtTJyRkTZelSgFhDFiOW0L
ln5otQ0aGuDvkmLmBLk/jcq3eKctztqJIQdxQG9K3kJdohz1t/onJJt3cwuWZIo5
T8dKz/mhga11u0ii6Zk+ecDdpagcxrQ8FDKcjI1tgXtTKAEvjfhVwAbj/9X55BGY
T3M52b6RuE/ZRPD/BKXMaAUTNI0jwYbJyZYm+5GY/fk6lg0CRNaPBTVr0hj1Ia4P
7akSD1gYUSTblFvjGa+hBsnmUDHo7Htl+kXJ3v6U8bSPFgFsO7hLbZJR6nRbxDvV
k/KejR7RsORkbTFczWGngTKGI2UUNgMJf3Km9WU5bLqU3PpM32Hlal3CLMm3dQTB
/3Gfl6aCy7Ct72tNilw=
=EgBn
-----END PGP SIGNATURE-----
Merge tag 'Smack-for-5.10' of git://github.com/cschaufler/smack-next
Pull smack updates from Casey Schaufler:
"Two minor fixes and one performance enhancement to Smack. The
performance improvement is significant and the new code is more like
its counterpart in SELinux.
- Two kernel test robot suggested clean-ups.
- Teach Smack to use the IPv4 netlabel cache. This results in a
12-14% improvement on TCP benchmarks"
* tag 'Smack-for-5.10' of git://github.com/cschaufler/smack-next:
Smack: Remove unnecessary variable initialization
Smack: Fix build when NETWORK_SECMARK is not set
Smack: Use the netlabel cache
Smack: Set socket labels only once
Smack: Consolidate uses of secmark into a function
The initialization of rc in smack_from_netlbl() is pointless.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Use proper conditional compilation for the secmark field in
the network skb.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Utilize the Netlabel cache mechanism for incoming packet matching.
Refactor the initialization of secattr structures, as it was being
done in two places.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Refactor the IP send checks so that the netlabel value
is set only when necessary, not on every send. Some functions
get renamed as the changes made the old name misleading.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Add a function smack_from_skb() that returns the Smack label
identified by a network secmark. Replace the explicit uses of
the secmark with this function.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
We have an upper bound on "maplevel" but forgot to check for negative
values.
Fixes: e114e47377 ("Smack: Simplified Mandatory Access Control Kernel")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
This is similar to commit 84e99e58e8 ("Smack: slab-out-of-bounds in
vsscanf") where we added a bounds check on "rule".
Reported-by: syzbot+a22c6092d003d6fe1122@syzkaller.appspotmail.com
Fixes: f7112e6c9a ("Smack: allow for significantly longer Smack labels v4")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
smk_write_relabel_self() frees memory from the task's credentials with
no locking, which can easily cause a use-after-free because multiple
tasks can share the same credentials structure.
Fix this by using prepare_creds() and commit_creds() to correctly modify
the task's credentials.
Reproducer for "BUG: KASAN: use-after-free in smk_write_relabel_self":
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
static void *thrproc(void *arg)
{
int fd = open("/sys/fs/smackfs/relabel-self", O_WRONLY);
for (;;) write(fd, "foo", 3);
}
int main()
{
pthread_t t;
pthread_create(&t, NULL, thrproc, NULL);
thrproc(NULL);
}
Reported-by: syzbot+e6416dabb497a650da40@syzkaller.appspotmail.com
Fixes: 38416e5393 ("Smack: limited capability for changing process label")
Cc: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7U/i8ACgkQ+7dXa6fL
C2u2eg/+Oy6ybq0hPovYVkFI9WIG7ZCz7w9Q6BEnfYMqqn3dnfJxKQ3l4pnQEOWw
f4QfvpvevsYfMtOJkYcG6s66rQgbFdqc5TEyBBy0QNp3acRolN7IXkcopvv9xOpQ
JxedpbFG1PTFLWjvBpyjlrUPouwLzq2FXAf1Ox0ZIMw6165mYOMWoli1VL8dh0A0
Ai7JUB0WrvTNbrwhV413obIzXT/rPCdcrgbQcgrrLPex8lQ47ZAE9bq6k4q5HiwK
KRzEqkQgnzId6cCNTFBfkTWsx89zZunz7jkfM5yx30MvdAtPSxvvpfIPdZRZkXsP
E2K9Fk1/6OQZTC0Op3Pi/bt+hVG/mD1p0sQUDgo2MO3qlSS+5mMkR8h3mJEgwK12
72P4YfOJkuAy2z3v4lL0GYdUDAZY6i6G8TMxERKu/a9O3VjTWICDOyBUS6F8YEAK
C7HlbZxAEOKTVK0BTDTeEUBwSeDrBbvH6MnRlZCG5g1Fos2aWP0udhjiX8IfZLO7
GN6nWBvK1fYzfsUczdhgnoCzQs3suoDo04HnsTPGJ8De52T4x2RsjV+gPx0nrNAq
eWChl1JvMWsY2B3GLnl9XQz4NNN+EreKEkk+PULDGllrArrPsp5Vnhb9FJO1PVCU
hMDJHohPiXnKbc8f4Bd78OhIvnuoGfJPdM5MtNe2flUKy2a2ops=
=YTGf
-----END PGP SIGNATURE-----
Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull notification queue from David Howells:
"This adds a general notification queue concept and adds an event
source for keys/keyrings, such as linking and unlinking keys and
changing their attributes.
Thanks to Debarshi Ray, we do have a pull request to use this to fix a
problem with gnome-online-accounts - as mentioned last time:
https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47
Without this, g-o-a has to constantly poll a keyring-based kerberos
cache to find out if kinit has changed anything.
[ There are other notification pending: mount/sb fsinfo notifications
for libmount that Karel Zak and Ian Kent have been working on, and
Christian Brauner would like to use them in lxc, but let's see how
this one works first ]
LSM hooks are included:
- A set of hooks are provided that allow an LSM to rule on whether or
not a watch may be set. Each of these hooks takes a different
"watched object" parameter, so they're not really shareable. The
LSM should use current's credentials. [Wanted by SELinux & Smack]
- A hook is provided to allow an LSM to rule on whether or not a
particular message may be posted to a particular queue. This is
given the credentials from the event generator (which may be the
system) and the watch setter. [Wanted by Smack]
I've provided SELinux and Smack with implementations of some of these
hooks.
WHY
===
Key/keyring notifications are desirable because if you have your
kerberos tickets in a file/directory, your Gnome desktop will monitor
that using something like fanotify and tell you if your credentials
cache changes.
However, we also have the ability to cache your kerberos tickets in
the session, user or persistent keyring so that it isn't left around
on disk across a reboot or logout. Keyrings, however, cannot currently
be monitored asynchronously, so the desktop has to poll for it - not
so good on a laptop. This facility will allow the desktop to avoid the
need to poll.
DESIGN DECISIONS
================
- The notification queue is built on top of a standard pipe. Messages
are effectively spliced in. The pipe is opened with a special flag:
pipe2(fds, O_NOTIFICATION_PIPE);
The special flag has the same value as O_EXCL (which doesn't seem
like it will ever be applicable in this context)[?]. It is given up
front to make it a lot easier to prohibit splice&co from accessing
the pipe.
[?] Should this be done some other way? I'd rather not use up a new
O_* flag if I can avoid it - should I add a pipe3() system call
instead?
The pipe is then configured::
ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);
Messages are then read out of the pipe using read().
- It should be possible to allow write() to insert data into the
notification pipes too, but this is currently disabled as the
kernel has to be able to insert messages into the pipe *without*
holding pipe->mutex and the code to make this work needs careful
auditing.
- sendfile(), splice() and vmsplice() are disabled on notification
pipes because of the pipe->mutex issue and also because they
sometimes want to revert what they just did - but one or more
notification messages might've been interleaved in the ring.
- The kernel inserts messages with the wait queue spinlock held. This
means that pipe_read() and pipe_write() have to take the spinlock
to update the queue pointers.
- Records in the buffer are binary, typed and have a length so that
they can be of varying size.
This allows multiple heterogeneous sources to share a common
buffer; there are 16 million types available, of which I've used
just a few, so there is scope for others to be used. Tags may be
specified when a watchpoint is created to help distinguish the
sources.
- Records are filterable as types have up to 256 subtypes that can be
individually filtered. Other filtration is also available.
- Notification pipes don't interfere with each other; each may be
bound to a different set of watches. Any particular notification
will be copied to all the queues that are currently watching for it
- and only those that are watching for it.
- When recording a notification, the kernel will not sleep, but will
rather mark a queue as having lost a message if there's
insufficient space. read() will fabricate a loss notification
message at an appropriate point later.
- The notification pipe is created and then watchpoints are attached
to it, using one of:
keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);
where in both cases, fd indicates the queue and the number after is
a tag between 0 and 255.
- Watches are removed if either the notification pipe is destroyed or
the watched object is destroyed. In the latter case, a message will
be generated indicating the enforced watch removal.
Things I want to avoid:
- Introducing features that make the core VFS dependent on the
network stack or networking namespaces (ie. usage of netlink).
- Dumping all this stuff into dmesg and having a daemon that sits
there parsing the output and distributing it as this then puts the
responsibility for security into userspace and makes handling
namespaces tricky. Further, dmesg might not exist or might be
inaccessible inside a container.
- Letting users see events they shouldn't be able to see.
TESTING AND MANPAGES
====================
- The keyutils tree has a pipe-watch branch that has keyctl commands
for making use of notifications. Proposed manual pages can also be
found on this branch, though a couple of them really need to go to
the main manpages repository instead.
If the kernel supports the watching of keys, then running "make
test" on that branch will cause the testing infrastructure to spawn
a monitoring process on the side that monitors a notifications pipe
for all the key/keyring changes induced by the tests and they'll
all be checked off to make sure they happened.
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch
- A test program is provided (samples/watch_queue/watch_test) that
can be used to monitor for keyrings, mount and superblock events.
Information on the notifications is simply logged to stdout"
* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
smack: Implement the watch_key and post_notification hooks
selinux: Implement the watch_key security hook
keys: Make the KEY_NEED_* perms an enum rather than a mask
pipe: Add notification lossage handling
pipe: Allow buffers to be marked read-whole-or-error for notifications
Add sample notification program
watch_queue: Add a key/keyring notification facility
security: Add hooks to rule on setting a watch
pipe: Add general notification queue support
pipe: Add O_NOTIFICATION_PIPE
security: Add a hook for the point of notification insertion
uapi: General notification queue definitions
Pull execve updates from Eric Biederman:
"Last cycle for the Nth time I ran into bugs and quality of
implementation issues related to exec that could not be easily be
fixed because of the way exec is implemented. So I have been digging
into exec and cleanup up what I can.
I don't think I have exec sorted out enough to fix the issues I
started with but I have made some headway this cycle with 4 sets of
changes.
- promised cleanups after introducing exec_update_mutex
- trivial cleanups for exec
- control flow simplifications
- remove the recomputation of bprm->cred
The net result is code that is a bit easier to understand and work
with and a decrease in the number of lines of code (if you don't count
the added tests)"
* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
exec: Compute file based creds only once
exec: Add a per bprm->file version of per_clear
binfmt_elf_fdpic: fix execfd build regression
selftests/exec: Add binfmt_script regression test
exec: Remove recursion from search_binary_handler
exec: Generic execfd support
exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
exec: Move the call of prepare_binprm into search_binary_handler
exec: Allow load_misc_binary to call prepare_binprm unconditionally
exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
exec: Teach prepare_exec_creds how exec treats uids & gids
exec: Set the point of no return sooner
exec: Move handling of the point of no return to the top level
exec: Run sync_mm_rss before taking exec_update_mutex
exec: Fix spelling of search_binary_handler in a comment
exec: Move the comment from above de_thread to above unshare_sighand
exec: Rename flush_old_exec begin_new_exec
exec: Move most of setup_new_exec into flush_old_exec
exec: In setup_new_exec cache current in the local variable me
...
Today security_bprm_set_creds has several implementations:
apparmor_bprm_set_creds, cap_bprm_set_creds, selinux_bprm_set_creds,
smack_bprm_set_creds, and tomoyo_bprm_set_creds.
Except for cap_bprm_set_creds they all test bprm->called_set_creds and
return immediately if it is true. The function cap_bprm_set_creds
ignores bprm->calld_sed_creds entirely.
Create a new LSM hook security_bprm_creds_for_exec that is called just
before prepare_binprm in __do_execve_file, resulting in a LSM hook
that is called exactly once for the entire of exec. Modify the bits
of security_bprm_set_creds that only want to be called once per exec
into security_bprm_creds_for_exec, leaving only cap_bprm_set_creds
behind.
Remove bprm->called_set_creds all of it's former users have been moved
to security_bprm_creds_for_exec.
Add or upate comments a appropriate to bring them up to date and
to reflect this change.
Link: https://lkml.kernel.org/r/87v9kszrzh.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com> # For the LSM and Smack bits
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Implement the watch_key security hook in Smack to make sure that a key
grants the caller Read permission in order to set a watch on a key.
Also implement the post_notification security hook to make sure that the
notification source is granted Write permission by the watch queue.
For the moment, the watch_devices security hook is left unimplemented as
it's not obvious what the object should be since the queue is global and
didn't previously exist.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Since the meaning of combining the KEY_NEED_* constants is undefined, make
it so that you can't do that by turning them into an enum.
The enum is also given some extra values to represent special
circumstances, such as:
(1) The '0' value is reserved and causes a warning to trap the parameter
being unset.
(2) The key is to be unlinked and we require no permissions on it, only
the keyring, (this replaces the KEY_LOOKUP_FOR_UNLINK flag).
(3) An override due to CAP_SYS_ADMIN.
(4) An override due to an instantiation token being present.
(5) The permissions check is being deferred to later key_permission()
calls.
The extra values give the opportunity for LSMs to audit these situations.
[Note: This really needs overhauling so that lookup_user_key() tells
key_task_permission() and the LSM what operation is being done and leaves
it to those functions to decide how to map that onto the available
permits. However, I don't really want to make these change in the middle
of the notifications patchset.]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
cc: Paul Moore <paul@paul-moore.com>
cc: Stephen Smalley <stephen.smalley.work@gmail.com>
cc: Casey Schaufler <casey@schaufler-ca.com>
cc: keyrings@vger.kernel.org
cc: selinux@vger.kernel.org
commit a269434d2f ("LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH")
left behind this, remove it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
The inode_smack cache is no longer used.
Remove it.
Signed-off-by: Vishal Goel <vishal.goel@samsung.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
"smk_lock" mutex is used during inode instantiation in
smack_d_instantiate()function. It has been used to avoid
simultaneous access on same inode security structure.
Since smack related initialization is done only once i.e during
inode creation. If the inode has already been instantiated then
smack_d_instantiate() function just returns without doing
anything.
So it means mutex lock is required only during inode creation.
But since 2 processes can't create same inodes or files
simultaneously. Also linking or some other file operation can't
be done simultaneously when the file is getting created since
file lookup will fail before dentry inode linkup which is done
after smack initialization.
So no mutex lock is required in inode_smack structure.
It will save memory as well as improve some performance.
If 40000 inodes are created in system, it will save 1.5 MB on
32-bit systems & 2.8 MB on 64-bit systems.
Signed-off-by: Vishal Goel <vishal.goel@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>