Overview
This patch reworks the handling of POSIX CPU timers, including the
ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
with the help of Roland McGrath, the owner and original writer of this code.
The problem we ran into, and the reason for this rework, has to do with using
a profiling timer in a process with a large number of threads. It appears
that the performance of the old implementation of run_posix_cpu_timers() was
at least O(n*3) (where "n" is the number of threads in a process) or worse.
Everything is fine with an increasing number of threads until the time taken
for that routine to run becomes the same as or greater than the tick time, at
which point things degrade rather quickly.
This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."
Code Changes
This rework corrects the implementation of run_posix_cpu_timers() to make it
run in constant time for a particular machine. (Performance may vary between
one machine and another depending upon whether the kernel is built as single-
or multiprocessor and, in the latter case, depending upon the number of
running processors.) To do this, at each tick we now update fields in
signal_struct as well as task_struct. The run_posix_cpu_timers() function
uses those fields to make its decisions.
We define a new structure, "task_cputime," to contain user, system and
scheduler times and use these in appropriate places:
struct task_cputime {
cputime_t utime;
cputime_t stime;
unsigned long long sum_exec_runtime;
};
This is included in the structure "thread_group_cputime," which is a new
substructure of signal_struct and which varies for uniprocessor versus
multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
a simple substructure, while for multiprocessor kernels it is a pointer:
struct thread_group_cputime {
struct task_cputime totals;
};
struct thread_group_cputime {
struct task_cputime *totals;
};
We also add a new task_cputime substructure directly to signal_struct, to
cache the earliest expiration of process-wide timers, and task_cputime also
replaces the it_*_expires fields of task_struct (used for earliest expiration
of thread timers). The "thread_group_cputime" structure contains process-wide
timers that are updated via account_user_time() and friends. In the non-SMP
case the structure is a simple aggregator; unfortunately in the SMP case that
simplicity was not achievable due to cache-line contention between CPUs (in
one measured case performance was actually _worse_ on a 16-cpu system than
the same test on a 4-cpu system, due to this contention). For SMP, the
thread_group_cputime counters are maintained as a per-cpu structure allocated
using alloc_percpu(). The timer functions update only the timer field in
the structure corresponding to the running CPU, obtained using per_cpu_ptr().
We define a set of inline functions in sched.h that we use to maintain the
thread_group_cputime structure and hide the differences between UP and SMP
implementations from the rest of the kernel. The thread_group_cputime_init()
function initializes the thread_group_cputime structure for the given task.
The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
in the per-cpu structures and fields. The thread_group_cputime_free()
function, also a no-op for UP, in SMP frees the per-cpu structures. The
thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
thread_group_cputime_alloc() if the per-cpu structures haven't yet been
allocated. The thread_group_cputime() function fills the task_cputime
structure it is passed with the contents of the thread_group_cputime fields;
in UP it's that simple but in SMP it must also safely check that tsk->signal
is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
if so, sums the per-cpu values for each online CPU. Finally, the three
functions account_group_user_time(), account_group_system_time() and
account_group_exec_runtime() are used by timer functions to update the
respective fields of the thread_group_cputime structure.
Non-SMP operation is trivial and will not be mentioned further.
The per-cpu structure is always allocated when a task creates its first new
thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
It is freed at process exit via a call to thread_group_cputime_free() from
cleanup_signal().
All functions that formerly summed utime/stime/sum_sched_runtime values from
from all threads in the thread group now use thread_group_cputime() to
snapshot the values in the thread_group_cputime structure or the values in
the task structure itself if the per-cpu structure hasn't been allocated.
Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
The run_posix_cpu_timers() function has been split into a fast path and a
slow path; the former safely checks whether there are any expired thread
timers and, if not, just returns, while the slow path does the heavy lifting.
With the dedicated thread group fields, timers are no longer "rebalanced" and
the process_timer_rebalance() function and related code has gone away. All
summing loops are gone and all code that used them now uses the
thread_group_cputime() inline. When process-wide timers are set, the new
task_cputime structure in signal_struct is used to cache the earliest
expiration; this is checked in the fast path.
Performance
The fix appears not to add significant overhead to existing operations. It
generally performs the same as the current code except in two cases, one in
which it performs slightly worse (Case 5 below) and one in which it performs
very significantly better (Case 2 below). Overall it's a wash except in those
two cases.
I've since done somewhat more involved testing on a dual-core Opteron system.
Case 1: With no itimer running, for a test with 100,000 threads, the fixed
kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
all of which was spent in the system. There were twice as many
voluntary context switches with the fix as without it.
Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
an unmodified kernel can handle), the fixed kernel ran the test in
eight percent of the time (5.8 seconds as opposed to 70 seconds) and
had better tick accuracy (.012 seconds per tick as opposed to .023
seconds per tick).
Case 3: A 4000-thread test with an initial timer tick of .01 second and an
interval of 10,000 seconds (i.e. a timer that ticks only once) had
very nearly the same performance in both cases: 6.3 seconds elapsed
for the fixed kernel versus 5.5 seconds for the unfixed kernel.
With fewer threads (eight in these tests), the Case 1 test ran in essentially
the same time on both the modified and unmodified kernels (5.2 seconds versus
5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds
versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
tick versus .025 seconds per tick for the unmodified kernel.
Since the fix affected the rlimit code, I also tested soft and hard CPU limits.
Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
running), the modified kernel was very slightly favored in that while
it killed the process in 19.997 seconds of CPU time (5.002 seconds of
wall time), only .003 seconds of that was system time, the rest was
user time. The unmodified kernel killed the process in 20.001 seconds
of CPU (5.014 seconds of wall time) of which .016 seconds was system
time. Really, though, the results were too close to call. The results
were essentially the same with no itimer running.
Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
(where the hard limit would never be reached) and an itimer running,
the modified kernel exhibited worse tick accuracy than the unmodified
kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise,
performance was almost indistinguishable. With no itimer running this
test exhibited virtually identical behavior and times in both cases.
In times past I did some limited performance testing. those results are below.
On a four-cpu Opteron system without this fix, a sixteen-thread test executed
in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On
the same system with the fix, user and elapsed time were about the same, but
system time dropped to 0.007 seconds. Performance with eight, four and one
thread were comparable. Interestingly, the timer ticks with the fix seemed
more accurate: The sixteen-thread test with the fix received 149543 ticks
for 0.024 seconds per tick, while the same test without the fix received 58720
for 0.061 seconds per tick. Both cases were configured for an interval of
0.01 seconds. Again, the other tests were comparable. Each thread in this
test computed the primes up to 25,000,000.
I also did a test with a large number of threads, 100,000 threads, which is
impossible without the fix. In this case each thread computed the primes only
up to 10,000 (to make the runtime manageable). System time dominated, at
1546.968 seconds out of a total 2176.906 seconds (giving a user time of
629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite
accurate. There is obviously no comparable test without the fix.
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a bug and a philosophical decision about who handles errors.
security_context_to_sid_core() was leaking a context in the common case.
This was causing problems on fedora systems which recently have started
making extensive use of this function.
In discussion it was decided that if string_to_context_struct() had an
error it was its own responsibility to clean up any mess it created
along the way.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
During the use of a dev_cgroup, we should guarantee the corresponding
cgroup won't be deleted (i.e. via rmdir). This can be done through
css_get(&dev_cgroup->css), but here we can just get and use the dev_cgroup
under rcu_read_lock.
And also remove checking NULL dev_cgroup, it won't be NULL since a task
always belongs to a cgroup.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags
the target process if that is not the current process and it is trying to
change its own flags in a different way at the same time.
__capable() is using neither atomic ops nor locking to protect t->flags. This
patch removes __capable() and introduces has_capability() that doesn't set
PF_SUPERPRIV on the process being queried.
This patch further splits security_ptrace() in two:
(1) security_ptrace_may_access(). This passes judgement on whether one
process may access another only (PTRACE_MODE_ATTACH for ptrace() and
PTRACE_MODE_READ for /proc), and takes a pointer to the child process.
current is the parent.
(2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only,
and takes only a pointer to the parent process. current is the child.
In Smack and commoncap, this uses has_capability() to determine whether
the parent will be permitted to use PTRACE_ATTACH if normal checks fail.
This does not set PF_SUPERPRIV.
Two of the instances of __capable() actually only act on current, and so have
been changed to calls to capable().
Of the places that were using __capable():
(1) The OOM killer calls __capable() thrice when weighing the killability of a
process. All of these now use has_capability().
(2) cap_ptrace() and smack_ptrace() were using __capable() to check to see
whether the parent was allowed to trace any process. As mentioned above,
these have been split. For PTRACE_ATTACH and /proc, capable() is now
used, and for PTRACE_TRACEME, has_capability() is used.
(3) cap_safe_nice() only ever saw current, so now uses capable().
(4) smack_setprocattr() rejected accesses to tasks other than current just
after calling __capable(), so the order of these two tests have been
switched and capable() is used instead.
(5) In smack_file_send_sigiotask(), we need to allow privileged processes to
receive SIGIO on files they're manipulating.
(6) In smack_task_wait(), we let a process wait for a privileged process,
whether or not the process doing the waiting is privileged.
I've tested this with the LTP SELinux and syscalls testscripts.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>
Given a hosed SELinux config in which a system never loads policy or
disables SELinux we currently just return -EINVAL for anyone trying to
read /proc/mounts. This is a configuration problem but we can certainly
be more graceful. This patch just ignores -EINVAL when displaying LSM
options and causes /proc/mounts display everything else it can. If
policy isn't loaded the obviously there are no options, so we aren't
really loosing any information here.
This is safe as the only other return of EINVAL comes from
security_sid_to_context_core() in the case of an invalid sid. Even if a
FS was mounted with a now invalidated context that sid should have been
remapped to unlabeled and so we won't hit the EINVAL and will work like
we should. (yes, I tested to make sure it worked like I thought)
Signed-off-by: Eric Paris <eparis@redhat.com>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: James Morris <jmorris@namei.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (39 commits)
[PATCH] fix RLIM_NOFILE handling
[PATCH] get rid of corner case in dup3() entirely
[PATCH] remove remaining namei_{32,64}.h crap
[PATCH] get rid of indirect users of namei.h
[PATCH] get rid of __user_path_lookup_open
[PATCH] f_count may wrap around
[PATCH] dup3 fix
[PATCH] don't pass nameidata to __ncp_lookup_validate()
[PATCH] don't pass nameidata to gfs2_lookupi()
[PATCH] new (local) helper: user_path_parent()
[PATCH] sanitize __user_walk_fd() et.al.
[PATCH] preparation to __user_walk_fd cleanup
[PATCH] kill nameidata passing to permission(), rename to inode_permission()
[PATCH] take noexec checks to very few callers that care
Re: [PATCH 3/6] vfs: open_exec cleanup
[patch 4/4] vfs: immutable inode checking cleanup
[patch 3/4] fat: dont call notify_change
[patch 2/4] vfs: utimes cleanup
[patch 1/4] vfs: utimes: move owner check into inode_change_ok()
[PATCH] vfs: use kstrdup() and check failing allocation
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
netns: fix ip_rt_frag_needed rt_is_expired
netfilter: nf_conntrack_extend: avoid unnecessary "ct->ext" dereferences
netfilter: fix double-free and use-after free
netfilter: arptables in netns for real
netfilter: ip{,6}tables_security: fix future section mismatch
selinux: use nf_register_hooks()
netfilter: ebtables: use nf_register_hooks()
Revert "pkt_sched: sch_sfq: dump a real number of flows"
qeth: use dev->ml_priv instead of dev->priv
syncookies: Make sure ECN is disabled
net: drop unused BUG_TRAP()
net: convert BUG_TRAP to generic WARN_ON
drivers/net: convert BUG_TRAP to generic WARN_ON
The FAT_IOCTL_SET_ATTRIBUTES ioctl() calls notify_change() to change
the file mode before changing the inode attributes. Replace with
explicit calls to security_inode_setattr(), fat_setattr() and
fsnotify_change().
This is equivalent to the original. The reason it is needed, is that
later in the series we move the immutable check into notify_change().
That would break the FAT_IOCTL_SET_ATTRIBUTES ioctl, as it needs to
perform the mode change regardless of the immutability of the file.
[Fix error if fat is built as a module. Thanks to OGAWA Hirofumi for
noticing.]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the tracehook_tracer_task() hook to consolidate all forms of
"Who is using ptrace on me?" logic. This is used for "TracerPid:" in
/proc and for permission checks. We also clean up the selinux code the
called an identical accessor.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently this list is protected with a simple spinlock, even for reading
from one. This is OK, but can be better.
Actually I want it to be better very much, since after replacing the
OpenVZ device permissions engine with the cgroup-based one I noticed, that
we set 12 default device permissions for each newly created container (for
/dev/null, full, terminals, ect devices), and people sometimes have up to
20 perms more, so traversing the ~30-40 elements list under a spinlock
doesn't seem very good.
Here's the RCU protection for white-list - dev_whitelist_item-s are added
and removed under the devcg->lock, but are looked up in permissions
checking under the rcu_read_lock.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch converts devcgroup_access_write() from a raw file handler
into a handler for the cgroup write_string() method. This allows some
boilerplate copying/locking/checking to be removed and simplifies the
cleanup path, since these functions are performed by the cgroups
framework before calling the handler.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filesystem capabilities have come of age. Remove the experimental tag for
configuring filesystem capabilities.
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When cap_bset suppresses some of the forced (fP) capabilities of a file,
it is generally only safe to execute the program if it understands how to
recognize it doesn't have enough privilege to work correctly. For legacy
applications (fE!=0), which have no non-destructive way to determine that
they are missing privilege, we fail to execute (EPERM) any executable that
requires fP capabilities, but would otherwise get pP' < fP. This is a
fail-safe permission check.
For some discussion of why it is problematic for (legacy) privileged
applications to run with less than the set of capabilities requested for
them, see:
http://userweb.kernel.org/~morgan/sendmail-capabilities-war-story.html
With this iteration of this support, we do not include setuid-0 based
privilege protection from the bounding set. That is, the admin can still
(ab)use the bounding set to suppress the privileges of a setuid-0 program.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 811f379927.
From Eric Paris:
"Please drop this patch for now. It deadlocks on ntfs-3g. I need to
rework it to handle fuse filesystems better. (casey was right)"
The register security hook is no longer required, as the capability
module is always registered. LSMs wishing to stack capability as
a secondary module should do so explicitly.
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Remove the dummy module and make the "capability" module the default.
Compile and boot tested.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
The sb_get_mnt_opts() hook is unused, and is superseded by the
sb_show_options() hook.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: James Morris <jmorris@namei.org>
This patch causes SELinux mount options to show up in /proc/mounts. As
with other code in the area seq_put errors are ignored. Other LSM's
will not have their mount options displayed until they fill in their own
security_sb_show_options() function.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: James Morris <jmorris@namei.org>
Currently if a FS is mounted for which SELinux policy does not define an
fs_use_* that FS will either be genfs labeled or not labeled at all.
This decision is based on the existence of a genfscon rule in policy and
is irrespective of the capabilities of the filesystem itself. This
patch allows the kernel to check if the filesystem supports security
xattrs and if so will use those if there is no fs_use_* rule in policy.
An fstype with a no fs_use_* rule but with a genfs rule will use xattrs
if available and will follow the genfs rule.
This can be particularly interesting for things like ecryptfs which
actually overlays a real underlying FS. If we define excryptfs in
policy to use xattrs we will likely get this wrong at times, so with
this path we just don't need to define it!
Overlay ecryptfs on top of NFS with no xattr support:
SELinux: initialized (dev ecryptfs, type ecryptfs), uses genfs_contexts
Overlay ecryptfs on top of ext4 with xattr support:
SELinux: initialized (dev ecryptfs, type ecryptfs), uses xattr
It is also useful as the kernel adds new FS we don't need to add them in
policy if they support xattrs and that is how we want to handle them.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Fix several warnings generated by sparse of the form
"returning void-valued expression".
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Use do_each_thread as a proper do/while block. Sparse complained.
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Remove unused and shadowed addrlen variable. Picked up by sparse.
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Paul Moore <paul.moore@hp.com>
I've gotten complaints and reports about people not understanding the
meaning of the current unknown class/perm handling the kernel emits on
every policy load. Hopefully this will make make it clear to everyone
the meaning of the message and won't waste a printk the user won't care
about anyway on systems where the kernel and the policy agree on
everything.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
On Mon, 2008-06-09 at 01:24 -0700, Andrew Morton wrote:
> Getting a few of these with FC5:
>
> SELinux: context_struct_compute_av: unrecognized class 69
> SELinux: context_struct_compute_av: unrecognized class 69
>
> one came out when I logged in.
>
> No other symptoms, yet.
Change handling of invalid classes by SELinux, reporting class values
unknown to the kernel as errors (w/ ratelimit applied) and handling
class values unknown to policy as normal denials.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
We used to protect against races of policy load in security_load_policy
by using the load_mutex. Since then we have added a new mutex,
sel_mutex, in sel_write_load() which is always held across all calls to
security_load_policy we are covered and can safely just drop this one.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
The class_to_string array is referenced by tclass. My code mistakenly
was using tclass - 1. If the proceeding class is a userspace class
rather than kernel class this may cause a denial/EINVAL even if unknown
handling is set to allow. The bug shouldn't be allowing excess
privileges since those are given based on the contents of another array
which should be correctly referenced.
At this point in time its pretty unlikely this is going to cause
problems. The most recently added kernel classes which could be
affected are association, dccp_socket, and peer. Its pretty unlikely
any policy with handle_unknown=allow doesn't have association and
dccp_socket undefined (they've been around longer than unknown handling)
and peer is conditionalized on a policy cap which should only be defined
if that class exists in policy.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Fix an endianness bug in the handling of network node addresses by
SELinux. This yields no change on little endian hardware but fixes
the incorrect handling on big endian hardware. The network node
addresses are stored in network order in memory by checkpolicy, not in
cpu/host order, and thus should not have cpu_to_le32/le32_to_cpu
conversions applied upon policy write/read unlike other data in the
policy.
Bug reported by John Weeks of Sun, who noticed that binary policy
files built from the same policy source on x86 and sparc differed and
tracked it down to the ipv4 address handling in checkpolicy.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Simplify and improve the robustness of the SELinux ioctl checking by
using the "access mode" bits of the ioctl command to determine the
permission check rather than dealing with individual command values.
This removes any knowledge of specific ioctl commands from SELinux
and follows the same guidance we gave to Smack earlier.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Enable processes with CAP_MAC_ADMIN + mac_admin permission in policy
to get undefined contexts on inodes. This extends the support for
deferred mapping of security contexts in order to permit restorecon
and similar programs to see the raw file contexts unknown to the
system policy in order to check them.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Enable security modules to distinguish reading of process state via
proc from full ptrace access by renaming ptrace_may_attach to
ptrace_may_access and adding a mode argument indicating whether only
read access or full attach access is requested. This allows security
modules to permit access to reading process state without granting
full ptrace access. The base DAC/capability checking remains unchanged.
Read access to /proc/pid/mem continues to apply a full ptrace attach
check since check_mem_permission() already requires the current task
to already be ptracing the target. The other ptrace checks within
proc for elements like environ, maps, and fds are changed to pass the
read mode instead of attach.
In the SELinux case, we model such reading of process state as a
reading of a proc file labeled with the target process' label. This
enables SELinux policy to permit such reading of process state without
permitting control or manipulation of the target process, as there are
a number of cases where programs probe for such information via proc
but do not need to be able to control the target (e.g. procps,
lsof, PolicyKit, ConsoleKit). At present we have to choose between
allowing full ptrace in policy (more permissive than required/desired)
or breaking functionality (or in some cases just silencing the denials
via dontaudit rules but this can hide genuine attacks).
This version of the patch incorporates comments from Casey Schaufler
(change/replace existing ptrace_may_attach interface, pass access
mode), and Chris Wright (provide greater consistency in the checking).
Note that like their predecessors __ptrace_may_attach and
ptrace_may_attach, the __ptrace_may_access and ptrace_may_access
interfaces use different return value conventions from each other (0
or -errno vs. 1 or 0). I retained this difference to avoid any
changes to the caller logic but made the difference clearer by
changing the latter interface to return a bool rather than an int and
by adding a comment about it to ptrace.h for any future callers.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: James Morris <jmorris@namei.org>
Remove inherit field from inode_security_struct, per Stephen Smalley:
"Let's just drop inherit altogether - dead field."
Signed-off-by: James Morris <jmorris@namei.org>
reorder inode_security_struct to remove padding on 64 bit builds
size reduced from 72 to 64 bytes increasing objects per slab to 64.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: James Morris <jmorris@namei.org>
Formatting and syntax changes
whitespace, tabs to spaces, trailing space
put open { on same line as struct def
remove unneeded {} after if statements
change printk("Lu") to printk("llu")
convert asm/uaccess.h to linux/uaacess.h includes
remove unnecessary asm/bug.h includes
convert all users of simple_strtol to strict_strtol
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Fix a sleeping function called from invalid context bug by moving allocation
to the callers prior to taking the policy rdlock.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Introduce SELinux support for deferred mapping of security contexts in
the SID table upon policy reload, and use this support for inode
security contexts when the context is not yet valid under the current
policy. Only processes with CAP_MAC_ADMIN + mac_admin permission in
policy can set undefined security contexts on inodes. Inodes with
such undefined contexts are treated as having the unlabeled context
until the context becomes valid upon a policy reload that defines the
context. Context invalidation upon policy reload also uses this
support to save the context information in the SID table and later
recover it upon a subsequent policy reload that defines the context
again.
This support is to enable package managers and similar programs to set
down file contexts unknown to the system policy at the time the file
is created in order to better support placing loadable policy modules
in packages and to support build systems that need to create images of
different distro releases with different policies w/o requiring all of
the contexts to be defined or legal in the build host policy.
With this patch applied, the following sequence is possible, although
in practice it is recommended that this permission only be allowed to
specific program domains such as the package manager.
# rmdir baz
# rm bar
# touch bar
# chcon -t foo_exec_t bar # foo_exec_t is not yet defined
chcon: failed to change context of `bar' to `system_u:object_r:foo_exec_t': Invalid argument
# mkdir -Z system_u:object_r:foo_exec_t baz
mkdir: failed to set default file creation context to `system_u:object_r:foo_exec_t': Invalid argument
# cat setundefined.te
policy_module(setundefined, 1.0)
require {
type unconfined_t;
type unlabeled_t;
}
files_type(unlabeled_t)
allow unconfined_t self:capability2 mac_admin;
# make -f /usr/share/selinux/devel/Makefile setundefined.pp
# semodule -i setundefined.pp
# chcon -t foo_exec_t bar # foo_exec_t is not yet defined
# mkdir -Z system_u:object_r:foo_exec_t baz
# ls -Zd bar baz
-rw-r--r-- root root system_u:object_r:unlabeled_t bar
drwxr-xr-x root root system_u:object_r:unlabeled_t baz
# cat foo.te
policy_module(foo, 1.0)
type foo_exec_t;
files_type(foo_exec_t)
# make -f /usr/share/selinux/devel/Makefile foo.pp
# semodule -i foo.pp # defines foo_exec_t
# ls -Zd bar baz
-rw-r--r-- root root user_u:object_r:foo_exec_t bar
drwxr-xr-x root root system_u:object_r:foo_exec_t baz
# semodule -r foo
# ls -Zd bar baz
-rw-r--r-- root root system_u:object_r:unlabeled_t bar
drwxr-xr-x root root system_u:object_r:unlabeled_t baz
# semodule -i foo.pp
# ls -Zd bar baz
-rw-r--r-- root root user_u:object_r:foo_exec_t bar
drwxr-xr-x root root system_u:object_r:foo_exec_t baz
# semodule -r setundefined foo
# chcon -t foo_exec_t bar # no longer defined and not allowed
chcon: failed to change context of `bar' to `system_u:object_r:foo_exec_t': Invalid argument
# rmdir baz
# mkdir -Z system_u:object_r:foo_exec_t baz
mkdir: failed to set default file creation context to `system_u:object_r:foo_exec_t': Invalid argument
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
# cat devices.list
c 1:3 r
# echo 'c 1:3 w' > sub/devices.allow
# cat sub/devices.list
c 1:3 w
As illustrated, the parent group has no write permission to /dev/null, so
it's child should not be allowed to add this write permission.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# echo "b $((0x7fffffff)):$((0x80000000)) rwm" > devices.allow
# cat devices.list
b 214748364:-21474836 rwm
though a major/minor number of 0x800000000 is meaningless, we
should not cast it to a negative value.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# cat /devcg/devices.list
a *:* rwm
# echo a > devices.allow
# cat /devcg/devices.list
a *:* rwm
a 0:0 rwm
This is odd and maybe confusing. With this patch, writing 'a' to
devices.allow will add 'a *:* rwm' to the whitelist.
Also a few fixes and updates to the document.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The filesystem capability support meaning for CAP_SETPCAP is less powerful
than the non-filesystem capability support. As such, when filesystem
capabilities are configured, we should not permit CAP_SETPCAP to 'enhance'
the current process through strace manipulation of a child process.
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dummy module is used by folk that run security conscious code(!?). A
feature of such code (for example, dhclient) is that it tries to operate
with minimum privilege (dropping unneeded capabilities). While the dummy
module doesn't restrict code execution based on capability state, the user
code expects the kernel to appear to support it. This patch adds back
faked support for the PR_SET_KEEPCAPS etc., calls - making the kernel
behave as before 2.6.26.
For details see: http://bugzilla.kernel.org/show_bug.cgi?id=10748
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This semaphore doesn't appear to be used, so remove it.
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Consider you added a 'c foo:bar r' permission to some cgroup and then (a
bit later) 'c'foo:bar w' for it. After this you'll see the
c foo:bar r
c foo:bar w
lines in a devices.list file.
Another example - consider you added 10 'c foo:bar r' permissions to some
cgroup (e.g. by mistake). After this you'll see 10 c foo:bar r lines in
a list file.
This is weird. This situation also has one more annoying consequence.
Having many items in a white list makes permissions checking slower, sine
it has to walk a longer list.
The proposal is to merge permissions for items, that correspond to the
same device.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two functions, that need to get a device_cgroup from a task (they are
devcgroup_inode_permission and devcgroup_inode_mknod) make it in a strange
way:
They get a css_set from task, then a subsys_state from css_set, then a
cgroup from the state and then a subsys_state again from the cgroup.
Besides, the devices_subsys_id is read from memory, whilst there's a
enum-ed constant for it.
Optimize this part a bit:
1. Get the subsys_stats form the task and be done - no 2 extra
dereferences,
2. Use the device_subsys_id constant, not the value from memory
(i.e. one less dereference).
Found while preparing 2.6.26 OpenVZ port.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>