This is effectively a revert of 7b9a7ec565
plus fixing it a different way...
We found, when trying to run an application from an application which
had dropped privs that the kernel does security checks on undefined
capability bits. This was ESPECIALLY difficult to debug as those
undefined bits are hidden from /proc/$PID/status.
Consider a root application which drops all capabilities from ALL 4
capability sets. We assume, since the application is going to set
eff/perm/inh from an array that it will clear not only the defined caps
less than CAP_LAST_CAP, but also the higher 28ish bits which are
undefined future capabilities.
The BSET gets cleared differently. Instead it is cleared one bit at a
time. The problem here is that in security/commoncap.c::cap_task_prctl()
we actually check the validity of a capability being read. So any task
which attempts to 'read all things set in bset' followed by 'unset all
things set in bset' will not even attempt to unset the undefined bits
higher than CAP_LAST_CAP.
So the 'parent' will look something like:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffc000000000
All of this 'should' be fine. Given that these are undefined bits that
aren't supposed to have anything to do with permissions. But they do...
So lets now consider a task which cleared the eff/perm/inh completely
and cleared all of the valid caps in the bset (but not the invalid caps
it couldn't read out of the kernel). We know that this is exactly what
the libcap-ng library does and what the go capabilities library does.
They both leave you in that above situation if you try to clear all of
you capapabilities from all 4 sets. If that root task calls execve()
the child task will pick up all caps not blocked by the bset. The bset
however does not block bits higher than CAP_LAST_CAP. So now the child
task has bits in eff which are not in the parent. These are
'meaningless' undefined bits, but still bits which the parent doesn't
have.
The problem is now in cred_cap_issubset() (or any operation which does a
subset test) as the child, while a subset for valid cap bits, is not a
subset for invalid cap bits! So now we set durring commit creds that
the child is not dumpable. Given it is 'more priv' than its parent. It
also means the parent cannot ptrace the child and other stupidity.
The solution here:
1) stop hiding capability bits in status
This makes debugging easier!
2) stop giving any task undefined capability bits. it's simple, it you
don't put those invalid bits in CAP_FULL_SET you won't get them in init
and you won't get them in any other task either.
This fixes the cap_issubset() tests and resulting fallout (which
made the init task in a docker container untraceable among other
things)
3) mask out undefined bits when sys_capset() is called as it might use
~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.
4) mask out undefined bit when we read a file capability off of disk as
again likely all bits are set in the xattr for forward/backward
compatibility.
This lets 'setcap all+pe /bin/bash; /bin/bash' run
Signed-off-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Steve Grubb <sgrubb@redhat.com>
Cc: Dan Walsh <dwalsh@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.
This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.
When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.
The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.
Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). The call to de_thread() may kill threads waiting
for the mutex.
Changes across threads to the filter pointer includes a barrier.
Based on patches by Will Drewry.
Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
This changes the mode setting helper to allow threads to change the
seccomp mode from another thread. We must maintain barriers to keep
TIF_SECCOMP synchronized with the rest of the seccomp state.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Normally, task_struct.seccomp.filter is only ever read or modified by
the task that owns it (current). This property aids in fast access
during system call filtering as read access is lockless.
Updating the pointer from another task, however, opens up race
conditions. To allow cross-thread filter pointer updates, writes to the
seccomp fields are now protected by the sighand spinlock (which is shared
by all threads in the thread group). Read access remains lockless because
pointer updates themselves are atomic. However, writes (or cloning)
often entail additional checking (like maximum instruction counts)
which require locking to perform safely.
In the case of cloning threads, the child is invisible to the system
until it enters the task list. To make sure a child can't be cloned from
a thread and left in a prior state, seccomp duplication is additionally
moved under the sighand lock. Then parent and child are certain have
the same seccomp state when they exit the lock.
Based on patches by Will Drewry and David Drysdale.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
In preparation for adding seccomp locking, move filter creation away
from where it is checked and applied. This will allow for locking where
no memory allocation is happening. The validation, filter attachment,
and seccomp mode setting can all happen under the future locks.
For extreme defensiveness, I've added a BUG_ON check for the calculated
size of the buffer allocation in case BPF_MAXINSN ever changes, which
shouldn't ever happen. The compiler should actually optimize out this
check since the test above it makes it impossible.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
This adds the new "seccomp" syscall with both an "operation" and "flags"
parameter for future expansion. The third argument is a pointer value,
used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
In addition to the TSYNC flag later in this patch series, there is a
non-zero chance that this syscall could be used for configuring a fixed
argument area for seccomp-tracer-aware processes to pass syscall arguments
in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
for this syscall. Additionally, this syscall uses operation, flags,
and user pointer for arguments because strictly passing arguments via
a user pointer would mean seccomp itself would be unable to trivially
filter the seccomp syscall itself.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Separates the two mode setting paths to make things more readable with
fewer #ifdefs within function bodies.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
To support splitting mode 1 from mode 2, extract the mode checking and
assignment logic into common functions.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
In preparation for having other callers of the seccomp mode setting
logic, split the prctl entry point away from the core logic that performs
seccomp mode setting.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Instead of allowing public keys, with certificates signed by any
key on the system trusted keyring, to be added to a trusted keyring,
this patch further restricts the certificates to those signed only by
builtin keys on the system keyring.
This patch defines a new option 'builtin' for the kernel parameter
'keys_ownerid' to allow trust validation using builtin keys.
Simplified Mimi's "KEYS: define an owner trusted keyring" patch
Changelog v7:
- rename builtin_keys to use_builtin_keys
Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Pull cgroup fixes from Tejun Heo:
"Mostly fixes for the fallouts from the recent cgroup core changes.
The decoupled nature of cgroup dynamic hierarchy management
(hierarchies are created dynamically on mount but may or may not be
reused once unmounted depending on remaining usages) led to more
ugliness being added to kernfs.
Hopefully, this is the last of it"
* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: break kernfs active protection in cpuset_write_resmask()
cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
kernfs: introduce kernfs_pin_sb()
cgroup: fix mount failure in a corner case
cpuset,mempolicy: fix sleeping function called from invalid context
cgroup: fix broken css_has_online_children()
Pull workqueue fixes from Tejun Heo:
"Two workqueue fixes. Both are one liners. One fixes missing uevent
for workqueue files on sysfs. The other one fixes missing zeroing of
NUMA cpu masks which can lead to oopses among other things"
* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: zero cpumask of wq_numa_possible_cpumask on init
workqueue: fix dev_set_uevent_suppress() imbalance
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.
BUG: unable to handle kernel paging request at 0000000000001d08
IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10
PGD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50
[<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0
[<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0
[<ffffffff811926f1>] new_slab+0x91/0x300
[<ffffffff815de95a>] __slab_alloc+0x2bb/0x482
[<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0
[<ffffffff810a3c78>] ? load_balance+0x218/0x890
[<ffffffff8101a679>] ? sched_clock+0x9/0x10
[<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10
[<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200
[<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0
[<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60
[<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140
[<ffffffff8105d0ec>] do_fork+0xbc/0x360
[<ffffffff8105d3b6>] kernel_thread+0x26/0x30
[<ffffffff81086652>] kthreadd+0x2c2/0x300
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
[<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0
[<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60
In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.
When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:
3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
3593 if (wq_numa_enabled) {
3594 for_each_node(node) {
3595 if (cpumask_subset(pool->attrs->cpumask,
3596 wq_numa_possible_cpumask[node])) {
3597 pool->node = node;
3598 break;
3599 }
3600 }
3601 }
But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.
By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: bce903809a ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
Pull irq fixes from Thomas Gleixner:
"A few minor fixlets in ARM SoC irq drivers and a fix for a memory leak
which I introduced in the last round of cleanups :("
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Fix memory leak when calling irq_free_hwirqs()
irqchip: spear_shirq: Fix interrupt offset
irqchip: brcmstb-l2: Level-2 interrupts are edge sensitive
irqchip: armada-370-xp: Mask all interrupts during initialization.
irq_free_hwirqs() always calls irq_free_descs() with a cnt == 0
which makes it a no-op since the interrupt count to free is
decremented in itself.
Fixes: 7b6ef12625
Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1404167084-8070-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
running:
# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whatever
kills the uprobe. Along the way he found some other minor bugs and clean ups
that he fixed up making it a total of 4 patches.
Doing unrelated work, I found that the reading of the ftrace trace
file disables all function tracer callbacks. This was fine when ftrace
was the only user, but now that it's used by perf and kprobes, this
is a bug where reading trace can disable kprobes and perf. A very unexpected
side effect and should be fixed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTtMugAAoJEKQekfcNnQGuRs8H/2HhNAy0F1pAFYT5tH2o0z/Z
z6NFn83FUUoesg/Bd+1Dk7VekIZIdo1JxQc67/Y0D0oylPPr31gmVvk2llFPdJV5
xuWiUOuMbDq4Eh3Een8yaOsNsGbcX0lgw9qJEyqAvhJMi5G4dyt3r/g+vFThAyqm
O8Uv74GBGmUmmGMyZuW2r2f2vSEANSXTLbzFqj54fV7FNms1B0MpZ/2AiRcEwCzi
9yMainwrO1VPVSrSJFkW8g4sNl5X1M4tiIT8wGN75YePJVK3FAH8LmUDfpau4+ae
/QGdjNkRDWpZ3mMJPbz3sAT7USnMlSZ5w80Rf/CZuZAe2Ncycvh9AhqcFV0+fj8=
=3h4s
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code
where running:
# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whatever
kills the uprobe. Along the way he found some other minor bugs and
clean ups that he fixed up making it a total of 4 patches.
Doing unrelated work, I found that the reading of the ftrace trace
file disables all function tracer callbacks. This was fine when
ftrace was the only user, but now that it's used by perf and kprobes,
this is a bug where reading trace can disable kprobes and perf. A
very unexpected side effect and should be fixed"
* tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove ftrace_stop/start() from reading the trace file
tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable()
tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher()
uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone
tracing/uprobes: Revert "Support mix of ftrace and perf"
Revert commit 939f04bec1 ("printk: enable interrupts before calling
console_trylock_for_printk()").
Andreas reported:
: None of the post 3.15 kernel boot for me. They all hang at the GRUB
: screen telling me it loaded and started the kernel, but the kernel
: itself stops before it prints anything (or even replaces the GRUB
: background graphics).
939f04bec1 is modest latency reduction. Revert it until we understand
the reason for these failures.
Reported-by: Andreas Bombe <aeb@debian.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Writing to either "cpuset.cpus" or "cpuset.mems" file flushes
cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up
migrating tasks off a cpuset after new resources are added to it.
As cpuset_hotplug_work calls into cgroup core via
cgroup_transfer_tasks(), this flushing adds the dependency to cgroup
core locking from cpuset_write_resmak(). This used to be okay because
cgroup interface files were protected by a different mutex; however,
8353da1f91 ("cgroup: remove cgroup_tree_mutex") simplified the
cgroup core locking and this dependency became a deadlock hazard -
cgroup file removal performed under cgroup core lock tries to drain
on-going file operation which is trying to flush cpuset_hotplug_work
blocked on the same cgroup core lock.
The locking simplification was done because kernfs added an a lot
easier way to deal with circular dependencies involving kernfs active
protection. Let's use the same strategy in cpuset and break active
protection in cpuset_write_resmask(). While it isn't the prettiest,
this is a very rare, likely unique, situation which also goes away on
the unified hierarchy.
The commands to trigger the deadlock warning without the patch and the
lockdep output follow.
localhost:/ # mount -t cgroup -o cpuset xxx /cpuset
localhost:/ # mkdir /cpuset/tmp
localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus
localhost:/ # echo 0 > cpuset/tmp/cpuset.mems
localhost:/ # echo $$ > /cpuset/tmp/tasks
localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online
======================================================
[ INFO: possible circular locking dependency detected ]
3.16.0-rc1-0.1-default+ #7 Not tainted
-------------------------------------------------------
kworker/1:0/32649 is trying to acquire lock:
(cgroup_mutex){+.+.+.}, at: [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150
but task is already holding lock:
(cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (cpuset_hotplug_work){+.+...}:
...
-> #1 (s_active#175){++++.+}:
...
-> #0 (cgroup_mutex){+.+.+.}:
...
other info that might help us debug this:
Chain exists of:
cgroup_mutex --> s_active#175 --> cpuset_hotplug_work
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(cpuset_hotplug_work);
lock(s_active#175);
lock(cpuset_hotplug_work);
lock(cgroup_mutex);
*** DEADLOCK ***
2 locks held by kworker/1:0/32649:
#0: ("events"){.+.+.+}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
#1: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520
stack backtrace:
CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7
...
Call Trace:
[<ffffffff815a5f78>] dump_stack+0x72/0x8a
[<ffffffff810c263f>] print_circular_bug+0x10f/0x120
[<ffffffff810c481e>] check_prev_add+0x43e/0x4b0
[<ffffffff810c4ee6>] validate_chain+0x656/0x7c0
[<ffffffff810c53d2>] __lock_acquire+0x382/0x660
[<ffffffff810c57a9>] lock_acquire+0xf9/0x170
[<ffffffff815aa13f>] mutex_lock_nested+0x6f/0x380
[<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150
[<ffffffff811129c0>] hotplug_update_tasks_insane+0x110/0x1d0
[<ffffffff81112bbd>] cpuset_hotplug_update_tasks+0x13d/0x180
[<ffffffff811148ec>] cpuset_hotplug_workfn+0x18c/0x630
[<ffffffff810854d4>] process_one_work+0x254/0x520
[<ffffffff810875dd>] worker_thread+0x13d/0x3d0
[<ffffffff8108e0c8>] kthread+0xf8/0x100
[<ffffffff815acaec>] ret_from_fork+0x7c/0xb0
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Li Zefan <lizefan@huawei.com>
Tested-by: Li Zefan <lizefan@huawei.com>
Disabling reading and writing to the trace file should not be able to
disable all function tracing callbacks. There's other users today
(like kprobes and perf). Reading a trace file should not stop those
from happening.
Cc: stable@vger.kernel.org # 3.0+
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The usage of uprobe_buffer_enable() added by dcad1a20 is very wrong,
1. uprobe_buffer_enable() and uprobe_buffer_disable() are not balanced,
_enable() should be called only if !enabled.
2. If uprobe_buffer_enable() fails probe_event_enable() should clear
tp.flags and free event_file_link.
3. If uprobe_register() fails it should do uprobe_buffer_disable().
Link: http://lkml.kernel.org/p/20140627170146.GA18332@redhat.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Fixes: dcad1a204f "tracing/uprobes: Fetch args before reserving a ring buffer"
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I do not know why dd9fa555d7 "tracing/uprobes: Move argument fetching
to uprobe_dispatcher()" added the UPROBE_HANDLER_REMOVE, but it looks
wrong.
OK, perhaps it makes sense to avoid store_trace_args() if the tracee is
nacked by uprobe_perf_filter(). But then we should kill the same code
in uprobe_perf_func() and unify the TRACE/PROFILE filtering (we need to
do this anyway to mix perf/ftrace). Until then this code actually adds
the pessimization because uprobe_perf_filter() will be called twice and
return T in likely case.
Link: http://lkml.kernel.org/p/20140627170143.GA18329@redhat.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add WARN_ON's into uprobe_unregister() and uprobe_apply() to ensure
that nobody tries to play with the dead uprobe/consumer. This helps
to catch the bugs like the one fixed by the previous patch.
In the longer term we should fix this poorly designed interface.
uprobe_register() should return "struct uprobe *" which should be
passed to apply/unregister. Plus other semantic changes, see the
changelog in commit 41ccba029e.
Link: http://lkml.kernel.org/p/20140627170140.GA18322@redhat.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This reverts commit 43fe98913c.
This patch is very wrong. Firstly, this change leads to unbalanced
uprobe_unregister(). Just for example,
# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whatever
after that uprobe is dead (unregistered) but the user of ftrace/perf
can't know this, and it looks as if nobody hits this probe.
This would be easy to fix, but there are other reasons why it is not
simple to mix ftrace and perf. If nothing else, they can't share the
same ->consumer.filter. This is fixable too, but probably we need to
fix the poorly designed uprobe_register() interface first. At least
"register" and "apply" should be clearly separated.
Link: http://lkml.kernel.org/p/20140627170136.GA18319@redhat.com
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com>
Cc: stable@vger.kernel.org # v3.14
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We've converted cgroup to kernfs so cgroup won't be intertwined with
vfs objects and locking, but there are dark areas.
Run two instances of this script concurrently:
for ((; ;))
{
mount -t cgroup -o cpuacct xxx /cgroup
umount /cgroup
}
After a while, I saw two mount processes were stuck at retrying, because
they were waiting for a subsystem to become free, but the root associated
with this subsystem never got freed.
This can happen, if thread A is in the process of killing superblock but
hasn't called percpu_ref_kill(), and at this time thread B is mounting
the same cgroup root and finds the root in the root list and performs
percpu_ref_try_get().
To fix this, we try to increase both the refcnt of the superblock and the
percpu refcnt of cgroup root.
v2:
- we should try to get both the superblock refcnt and cgroup_root refcnt,
because cgroup_root may have no superblock assosiated with it.
- adjust/add comments.
tj: Updated comments. Renamed @sb to @pinned_sb.
Cc: <stable@vger.kernel.org> # 3.15
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
# cat test.sh
#! /bin/bash
mount -t cgroup -o cpu xxx /cgroup
umount /cgroup
mount -t cgroup -o cpu,cpuacct xxx /cgroup
umount /cgroup
# ./test.sh
mount: xxx already mounted or /cgroup busy
mount: according to mtab, xxx is already mounted on /cgroup
It's because the cgroupfs_root of the first mount was under destruction
asynchronously.
Fix this by delaying and then retrying mount for this case.
v3:
- put the refcnt immediately after getting it. (Tejun)
v2:
- use percpu_ref_tryget_live() rather that introducing
percpu_ref_alive(). (Tejun)
- adjust comment.
tj: Updated the comment a bit.
Cc: <stable@vger.kernel.org> # 3.15
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66
[ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130
[ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
[ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380
[ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
[ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90
[ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.
This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():
"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable <stable@vger.kernel.org>
race condition that happens between enabling/disabling syscall tracepoints
and new process creations (the check to go into the ptrace path for a process
can be set when it shouldn't, or not set when it should). Not a major bug
but one that should be fixed and even applied to stable.
The other two patches are cleanup/fixes that are not that critical, but
for an -rc1 release would be nice to have. They both deal with syscall
tracepoints.
It also includes a patch to introduce a new macro for the TRACE_EVENT()
format called __field_struct(). Originally, __field() was used to record
any variable into a trace event, but with the addition of setting the
"is signed" attribute, the check causes anything but a primitive variable
to fail to compile. That is, structs and unions can't be used as they
once were. When the "is signed" check was introduce there were only
primitive variables being recorded. But that will change soon and it
was reported that __field() causes build failures.
To solve the __field() issue, __field_struct() is introduced to allow
trace_events to be able to record complex types too.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTpxx+AAoJEKQekfcNnQGu8dQH/RWwKuq/SDqdqFrYM3y0MGtU
GpWTzaTYfhoO7FE2zNx3JQtfu5Zg+KPzavecEQqYJGdUvd5lopo+GR4TYQ2GIDdA
O4gBgXkxJjYCQ4x9INXf/U8Afd4tfCYSWik2zzWUZaJSLd6pMV/4XbqqZItZ7aqt
bxWa7CgmSMHmfsO5a3xWzX0ybPCmYw/+G15CZDGguoazO1FU6oFAnKnr86PGhquH
eIHzGAO7ka6k+hQwM1gzUi5vIN+OGBZ4HuJCFu/jqaJHMxW0lomL93Gruifk3l4v
r2ierd0FjMSrV1BzCWRR+diWq8W0xwOUutFd1eG3fKlKHZJn4oZfiUHsZw65PFQ=
=Idsk
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing cleanups and fixes from Steven Rostedt:
"This includes three patches from Oleg Nesterov. The first is a fix to
a race condition that happens between enabling/disabling syscall
tracepoints and new process creations (the check to go into the ptrace
path for a process can be set when it shouldn't, or not set when it
should). Not a major bug but one that should be fixed and even
applied to stable.
The other two patches are cleanup/fixes that are not that critical,
but for an -rc1 release would be nice to have. They both deal with
syscall tracepoints.
It also includes a patch to introduce a new macro for the
TRACE_EVENT() format called __field_struct(). Originally, __field()
was used to record any variable into a trace event, but with the
addition of setting the "is signed" attribute, the check causes
anything but a primitive variable to fail to compile. That is,
structs and unions can't be used as they once were. When the "is
signed" check was introduce there were only primitive variables being
recorded. But that will change soon and it was reported that
__field() causes build failures.
To solve the __field() issue, __field_struct() is introduced to allow
trace_events to be able to record complex types too"
* tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add __field_struct macro for TRACE_EVENT()
tracing: syscall_regfunc() should not skip kernel threads
tracing: Change syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread()
tracing: Fix syscall_*regfunc() vs copy_process() race
A 'softlockup' is defined as a bug that causes the kernel to loop in
kernel mode for more than a predefined period to time, without giving
other tasks a chance to run.
Currently, upon detection of this condition by the per-cpu watchdog
task, debug information (including a stack trace) is sent to the system
log.
On some occasions, we have observed that the "victim" rather than the
actual "culprit" (i.e. the owner/holder of the contended resource) is
reported to the user. Often this information has proven to be
insufficient to assist debugging efforts.
To avoid loss of useful debug information, for architectures which
support NMI, this patch makes it possible to improve soft lockup
reporting. This is accomplished by issuing an NMI to each cpu to obtain
a stack trace.
If NMI is not supported we just revert back to the old method. A sysctl
and boot-time parameter is available to toggle this feature.
[dzickus@redhat.com: add CONFIG_SMP in certain areas]
[akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
[mq@suse.cz: fix warning]
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jan Moskyto Matejka <mq@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:
divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6
However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.
This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.
This successfully solves the divide by zero issue at the same time.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Wu noticed the following splat on his machine when updating
/proc/sys/kernel/watchdog_thresh:
BUG: sleeping function called from invalid context at mm/slub.c:965
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
#0: (sb_writers#3){.+.+.+}, at: [<ffffffff8117b663>] vfs_write+0x143/0x180
#1: (watchdog_proc_mutex){+.+.+.}, at: [<ffffffff810e02d3>] proc_dowatchdog+0x33/0x110
#2: (cpu_hotplug.lock){.+.+.+}, at: [<ffffffff810589c2>] get_online_cpus+0x32/0x80
Preemption disabled at:[<ffffffff810e0384>] proc_dowatchdog+0xe4/0x110
CPU: 0 PID: 1 Comm: init Not tainted 3.16.0-rc1-testing #34
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x4e/0x7a
__might_sleep+0x11d/0x190
kmem_cache_alloc_trace+0x4e/0x1e0
perf_event_alloc+0x55/0x440
perf_event_create_kernel_counter+0x26/0xe0
watchdog_nmi_enable+0x75/0x140
update_timers_all_cpus+0x53/0xa0
proc_dowatchdog+0xe4/0x110
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xad/0x180
SyS_write+0x49/0xb0
system_call_fastpath+0x16/0x1b
NMI watchdog: disabled (cpu0): hardware events not enabled
What happened is after updating the watchdog_thresh, the lockup detector
is restarted to utilize the new value. Part of this process involved
disabling preemption. Once preemption was disabled, perf tried to
allocate a new event (as part of the restart). This caused the above
BUG_ON as you can't sleep with preemption disabled.
The preemption restriction seemed agressive as we are not doing anything
on that particular cpu, but with all the online cpus (which are
protected by the get_online_cpus lock). Remove the restriction and the
BUG_ON goes away.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Peter Wu <peter@lekensteyn.nl>
Tested-by: Peter Wu <peter@lekensteyn.nl>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To allow filtering of huge pages, makedumpfile must be able to identify
them in the dump. This can be done by checking the appropriate page
flag, so communicate its value to makedumpfile through the VMCOREINFO
interface.
There's only one small catch. Depending on how many page flags are
available on a given architecture, this bit can be called PG_head or
PG_compound.
I sent a similar patch back in 2012, but Eric Biederman did not like
using an #ifdef. So, this time I'm adding a common symbol
(PG_head_mask) instead.
See https://lkml.org/lkml/2012/11/28/91 for the previous version.
Signed-off-by: Petr Tesarik <ptesarik@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race between the CPU offline code (within stop-machine) and
the smp-call-function code, which can lead to getting IPIs on the
outgoing CPU, *after* it has gone offline.
Specifically, this can happen when using
smp_call_function_single_async() to send the IPI, since this API allows
sending asynchronous IPIs from IRQ disabled contexts. The exact race
condition is described below.
During CPU offline, in stop-machine, we don't enforce any rule in the
_DISABLE_IRQ stage, regarding the order in which the outgoing CPU and
the other CPUs disable their local interrupts. Due to this, we can
encounter a situation in which an IPI is sent by one of the other CPUs
to the outgoing CPU (while it is *still* online), but the outgoing CPU
ends up noticing it only *after* it has gone offline.
CPU 1 CPU 2
(Online CPU) (CPU going offline)
Enter _PREPARE stage Enter _PREPARE stage
Enter _DISABLE_IRQ stage
=
Got a device interrupt, and | Didn't notice the IPI
the interrupt handler sent an | since interrupts were
IPI to CPU 2 using | disabled on this CPU.
smp_call_function_single_async() |
=
Enter _DISABLE_IRQ stage
Enter _RUN stage Enter _RUN stage
=
Busy loop with interrupts | Invoke take_cpu_down()
disabled. | and take CPU 2 offline
=
Enter _EXIT stage Enter _EXIT stage
Re-enable interrupts Re-enable interrupts
The pending IPI is noted
immediately, but alas,
the CPU is offline at
this point.
This of course, makes the smp-call-function IPI handler code running on
CPU 2 unhappy and it complains about "receiving an IPI on an offline
CPU".
One real example of the scenario on CPU 1 is the block layer's
complete-request call-path:
__blk_complete_request() [interrupt-handler]
raise_blk_irq()
smp_call_function_single_async()
However, if we look closely, the block layer does check that the target
CPU is online before firing the IPI. So in this case, it is actually
the unfortunate ordering/timing of events in the stop-machine phase that
leads to receiving IPIs after the target CPU has gone offline.
In reality, getting a late IPI on an offline CPU is not too bad by
itself (this can happen even due to hardware latencies in IPI
send-receive). It is a bug only if the target CPU really went offline
without executing all the callbacks queued on its list. (Note that a
CPU is free to execute its pending smp-call-function callbacks in a
batch, without waiting for the corresponding IPIs to arrive for each one
of those callbacks).
So, fixing this issue can be broken up into two parts:
1. Ensure that a CPU goes offline only after executing all the
callbacks queued on it.
2. Modify the warning condition in the smp-call-function IPI handler
code such that it warns only if an offline CPU got an IPI *and* that
CPU had gone offline with callbacks still pending in its queue.
Achieving part 1 is straight-forward - just flush (execute) all the
queued callbacks on the outgoing CPU in the CPU_DYING stage[1],
including those callbacks for which the source CPU's IPIs might not have
been received on the outgoing CPU yet. Once we do this, an IPI that
arrives late on the CPU going offline (either due to the race mentioned
above, or due to hardware latencies) will be completely harmless, since
the outgoing CPU would have executed all the queued callbacks before
going offline.
Overall, this fix (parts 1 and 2 put together) additionally guarantees
that we will see a warning only when the *IPI-sender code* is buggy -
that is, if it queues the callback _after_ the target CPU has gone
offline.
[1]. The CPU_DYING part needs a little more explanation: by the time we
execute the CPU_DYING notifier callbacks, the CPU would have already
been marked offline. But we want to flush out the pending callbacks at
this stage, ignoring the fact that the CPU is offline. So restructure
the IPI handler code so that we can by-pass the "is-cpu-offline?" check
in this particular case. (Of course, the right solution here is to fix
CPU hotplug to mark the CPU offline _after_ invoking the CPU_DYING
notifiers, but this requires a lot of audit to ensure that this change
doesn't break any existing code; hence lets go with the solution
proposed above until that is done).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sachin Kamat <sachin.kamat@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uevents are suppressed during attributes registration, but never
restored, so kobject_uevent() does nothing.
Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 226223ab3c
Pull perf fixes from Ingo Molnar:
"This is larger than usual: the main reason are the ARM symbol lookup
speedups that came in late and were hard to resist.
There's also a kprobes fix and various tooling fixes, plus the minimal
re-enablement of the mmap2 support interface"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/kprobes: Fix build errors and blacklist context_track_user
perf tests: Add test for closing dso objects on EMFILE error
perf tests: Add test for caching dso file descriptors
perf tests: Allow reuse of test_file function
perf tests: Spawn child for each test
perf tools: Add dso__data_* interface descriptons
perf tools: Allow to close dso fd in case of open failure
perf tools: Add file size check and factor dso__data_read_offset
perf tools: Cache dso data file descriptor
perf tools: Add global count of opened dso objects
perf tools: Add global list of opened dso objects
perf tools: Add data_fd into dso object
perf tools: Separate dso data related variables
perf tools: Cache register accesses for unwind processing
perf record: Fix to honor user freq/interval properly
perf timechart: Reflow documentation
perf probe: Improve error messages in --line option
perf probe: Improve an error message of perf probe --vars mode
perf probe: Show error code and description in verbose mode
perf probe: Improve error message for unknown member of data structure
...
Pull rtmutex fixes from Thomas Gleixner:
"Another three patches to make the rtmutex code more robust. That's
the last urgent fallout from the big futex/rtmutex investigation"
* 'locking-urgent-for-linus.patch' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtmutex: Plug slow unlock race
rtmutex: Detect changes in the pi lock chain
rtmutex: Handle deadlock detection smarter
syscall_regfunc() ignores the kernel threads because "it has no effect",
see cc3b13c1 "Don't trace kernel thread syscalls" which added this check.
However, this means that a user-space task spawned by call_usermodehelper()
will run without TIF_SYSCALL_TRACEPOINT if sys_tracepoint_refcount != 0.
Remove this check. The unnecessary report from ret_from_fork path mentioned
by cc3b13c1 is no longer possible, see See commit fb45550d76 "make sure
that kernel_thread() callbacks call do_exit() themselves".
A kernel_thread() callback can only return and take the int_ret_from_sys_call
path after do_execve() succeeds, otherwise the kernel will crash. But in this
case it is no longer a kernel thread and thus is needs TIF_SYSCALL_TRACEPOINT.
Link: http://lkml.kernel.org/p/20140413185938.GD20668@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
1. Remove _irqsafe from syscall_regfunc/syscall_unregfunc,
read_lock(tasklist) doesn't need to disable irqs.
2. Change this code to avoid the deprecated do_each_thread()
and use for_each_process_thread() (stolen from the patch
from Frederic).
3. Change syscall_regfunc() to check PF_KTHREAD to skip
the kernel threads, ->mm != NULL is the common mistake.
Note: probably this check should be simply removed, needs
another patch.
[fweisbec@gmail.com: s/do_each_thread/for_each_process_thread/]
Link: http://lkml.kernel.org/p/20140413185918.GC20668@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
syscall_regfunc() and syscall_unregfunc() should set/clear
TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
with copy_process() and miss the new child which was not added to
the process/thread lists yet.
Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
under tasklist.
Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com
Cc: stable@vger.kernel.org # 2.6.33
Fixes: a871bd33a6 "tracing: Add syscall tracepoints"
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- Fix for an ia64 regression introduced during the 3.11 cycle by a
commit that modified the hardware initialization ordering and made
device discovery fail on some systems.
- Fix for a build problem on systems where the cpufreq-cpu0 driver
is built-in and the cpu-thermal driver is modular from Arnd Bergmann.
- Fix for a recently introduced computational mistake in the
intel_pstate driver that leads to excessive rounding errors from
Doug Smythies.
- Fix for a failure code path in cpufreq_update_policy() that fails
to unlock the locks acquired previously from Aaron Plattner.
- Fix for the cpuidle mvebu driver to use shorter state names which
will prevent the sysfs interface from returning mangled strings.
From Gregory Clement.
- ACPI LPSS driver fix to make sure that the I2C controllers
included in BayTrail SoCs are not held in the reset state while
they are being probed from Mika Westerberg.
- New kernel command line arguments making it possible to build
kernel images with hibernation and kASLR included at the same
time and to select which of them will be used via the command
line (they are still functionally mutually exclusive, though).
From Kees Cook.
- ACPI battery driver quirk for Acer Aspire V5-573G that fails
to send battery status change notifications timely from
Alexander Mezin.
- Two ACPI core cleanups from Christoph Jaeger and Fabian Frederick.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJToufrAAoJEILEb/54YlRxnjAP/2z8wxZ7ORXnRy+fWsSgEWzG
5EJCy0P+G8ui70WlvAd6I1OsKP27wWR/xR1oaxX+5rsOcJyGEjuPXddHn80pkVat
LL/HkdRaIyftOQRolRjEtMNu7go0riJHYB7S1agl7rIihtc+3t5qva/XAPUzBYCN
xlGy8kQ91oG1SW2fWT2jfI4RgZCMduDgFtXe2yCbuDFVmoR06/5l1fW2bn525Vfb
P/PeKshK8jnMLPiAmyr6vm5aV+YrCYm2h76QBxCPse1hP2B2WPwk1v0OGMb5fgp0
yUAKsChEpaFwK86gDUeKPbeHrAhxQd7RyqwLtMGO7yfCuM/hPxgNyp1NwvPc/+Jw
XbKQig4vNSKpDMrjWKNkANQaolqoe/sROZKIx8vvKxpSB0+n1NVMyEp0enb+S9mD
DEFHe2V/iJMBE4jUh68CcygZfTlNBgssfF/jL8aE90qW33cGXb82oB6XrMCzeANl
+TWG3sF9GRbf0YBjXwJCPXIokW9KQk0kW1mSZ+Ixgl9MbSmMiBYW7zXG0/6aOcAk
Ei217UGNgk290FaTwhFou5cK+M99n98qyZO4DQ5Xx2s1zHOQGSftvDp8EvL4fYxy
Tv0IGaGOpwPlAPx4WGGGU5ujmfUXxFTrWQRccRUHaCjcc53gghUr2cxSo8pMeg/R
YK4eE4ui2DlXG8/Vuygy
=TE5Y
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management fixes from Rafael Wysocki:
"These are fixes mostly (ia64 regression related to the ACPI
enumeration of devices, cpufreq regressions, fix for I2C controllers
included in Intel SoCs, mvebu cpuidle driver fix related to sysfs)
plus additional kernel command line arguments from Kees to make it
possible to build kernel images with hibernation and the kernel
address space randomization included simultaneously, a new ACPI
battery driver quirk for a system with a broken BIOS and a couple of
ACPI core cleanups.
Specifics:
- Fix for an ia64 regression introduced during the 3.11 cycle by a
commit that modified the hardware initialization ordering and made
device discovery fail on some systems.
- Fix for a build problem on systems where the cpufreq-cpu0 driver is
built-in and the cpu-thermal driver is modular from Arnd Bergmann.
- Fix for a recently introduced computational mistake in the
intel_pstate driver that leads to excessive rounding errors from
Doug Smythies.
- Fix for a failure code path in cpufreq_update_policy() that fails
to unlock the locks acquired previously from Aaron Plattner.
- Fix for the cpuidle mvebu driver to use shorter state names which
will prevent the sysfs interface from returning mangled strings.
From Gregory Clement.
- ACPI LPSS driver fix to make sure that the I2C controllers included
in BayTrail SoCs are not held in the reset state while they are
being probed from Mika Westerberg.
- New kernel command line arguments making it possible to build
kernel images with hibernation and kASLR included at the same time
and to select which of them will be used via the command line (they
are still functionally mutually exclusive, though). From Kees
Cook.
- ACPI battery driver quirk for Acer Aspire V5-573G that fails to
send battery status change notifications timely from Alexander
Mezin.
- Two ACPI core cleanups from Christoph Jaeger and Fabian Frederick"
* tag 'pm+acpi-3.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: mvebu: Fix the name of the states
cpufreq: unlock when failing cpufreq_update_policy()
intel_pstate: Correct rounding in busy calculation
ACPI: use kstrto*() instead of simple_strto*()
ACPI / processor replace __attribute__((packed)) by __packed
ACPI / battery: add quirk for Acer Aspire V5-573G
ACPI / battery: use callback for setting up quirks
ACPI / LPSS: Take I2C host controllers out of reset
x86, kaslr: boot-time selectable with hibernation
PM / hibernate: introduce "nohibernate" boot parameter
cpufreq: cpufreq-cpu0: fix CPU_THERMAL dependency
ACPI / ia64 / sba_iommu: Restore the working initialization ordering
After running:
# mount -t cgroup cpu xxx /cgroup && mkdir /cgroup/sub && \
rmdir /cgroup/sub && umount /cgroup
I found the cgroup root still existed:
# cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 0 1 1
cpu 1 1 1
...
It turned out css_has_online_children() is broken.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Sigend-off-by: Tejun Heo <tj@kernel.org>
Changes kASLR from being compile-time selectable (blocked by
CONFIG_HIBERNATION), to being boot-time selectable (with hibernation
available by default) via the "kaslr" kernel command line.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
To support using kernel features that are not compatible with hibernation,
this creates the "nohibernate" kernel boot parameter to disable both
hibernation and resume. This allows hibernation support to be a boot-time
choice instead of only a compile-time choice.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When the rtmutex fast path is enabled the slow unlock function can
create the following situation:
spin_lock(foo->m->wait_lock);
foo->m->owner = NULL;
rt_mutex_lock(foo->m); <-- fast path
free = atomic_dec_and_test(foo->refcnt);
rt_mutex_unlock(foo->m); <-- fast path
if (free)
kfree(foo);
spin_unlock(foo->m->wait_lock); <--- Use after free.
Plug the race by changing the slow unlock to the following scheme:
while (!rt_mutex_has_waiters(m)) {
/* Clear the waiters bit in m->owner */
clear_rt_mutex_waiters(m);
owner = rt_mutex_owner(m);
spin_unlock(m->wait_lock);
if (cmpxchg(m->owner, owner, 0) == owner)
return;
spin_lock(m->wait_lock);
}
So in case of a new waiter incoming while the owner tries the slow
path unlock we have two situations:
unlock(wait_lock);
lock(wait_lock);
cmpxchg(p, owner, 0) == owner
mark_rt_mutex_waiters(lock);
acquire(lock);
Or:
unlock(wait_lock);
lock(wait_lock);
mark_rt_mutex_waiters(lock);
cmpxchg(p, owner, 0) != owner
enqueue_waiter();
unlock(wait_lock);
lock(wait_lock);
wakeup_next waiter();
unlock(wait_lock);
lock(wait_lock);
acquire(lock);
If the fast path is disabled, then the simple
m->owner = NULL;
unlock(m->wait_lock);
is sufficient as all access to m->owner is serialized via
m->wait_lock;
Also document and clarify the wakeup_next_waiter function as suggested
by Oleg Nesterov.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This essentially reverts commit:
ecd50f714c ("kprobes, x86: Call exception_enter after kprobes handled")
since it causes build errors with CONFIG_CONTEXT_TRACKING and
that has been made from misunderstandings;
context_track_user_*() don't involve much in interrupt context,
it just returns if in_interrupt() is true.
Instead of changing the do_debug/int3(), this just adds
context_track_user_*() to kprobes blacklist, since those are
still can be called right before kprobes handles int3 and debug
exceptions, and probing those will cause an infinite loop.
Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/20140614064711.7865.45957.stgit@kbuild-fedora.novalocal
Signed-off-by: Ingo Molnar <mingo@kernel.org>
if "possible cpus" is greater than actual CPUs (including offline CPUs).
Namhyung Kim did some reviews of the patches I sent this merge window and
found a memory leak and had a few clean ups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTmcYfAAoJEKQekfcNnQGusVIH/2wvfh1D4Mu+qCUsG7HqMLWM
wHhTHcJsiE5rBpfcvc+XoLLBMGn9IKCeClGG59KYJGzznbJHHLmk1dy4qdSqNAel
POTEtGh+AUX0wFBZtVLl2AesFZRsbtaxoNqD0ZBLhkHCV9FBEKbsm8uDCtJIf8wR
vDDz27nPXkiFWX+wM9Z+tFSL7rohxvwREKlQjfk5Z9plyUcURE8GDbrWl870ZSJX
uYzSYzioCyYdy/cpasDuTX22/I+nNUxA4dWTeyciigYC+6IqLJRIwqEXJ4RjYmfm
v9+hf6KkqF8CY6mDjObLkKmy0gubPBQ8Op1G/3ChluUrjkwdffgb6oiLb352yHY=
=XD1A
-----END PGP SIGNATURE-----
Merge tag 'trace-3.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing cleanups and bugfixes from Steven Rostedt:
"One bug fix that goes back to 3.10. Accessing a non existent buffer
if "possible cpus" is greater than actual CPUs (including offline
CPUs).
Namhyung Kim did some reviews of the patches I sent this merge window
and found a memory leak and had a few clean ups"
* tag 'trace-3.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix check of ftrace_trace_arrays list_empty() check
tracing: Fix leak of per cpu max data in instances
tracing: Cleanup saved_cmdlines_size changes
ring-buffer: Check if buffer exists before polling
Pull more scheduler updates from Ingo Molnar:
"Second round of scheduler changes:
- try-to-wakeup and IPI reduction speedups, from Andy Lutomirski
- continued power scheduling cleanups and refactorings, from Nicolas
Pitre
- misc fixes and enhancements"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Delete extraneous extern for to_ratio()
sched/idle: Optimize try-to-wake-up IPI
sched/idle: Simplify wake_up_idle_cpu()
sched/idle: Clear polling before descheduling the idle thread
sched, trace: Add a tracepoint for IPI-less remote wakeups
cpuidle: Set polling in poll_idle
sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)
sched: Rename capacity related flags
sched: Final power vs. capacity cleanups
sched: Remove remaining dubious usage of "power"
sched: Let 'struct sched_group_power' care about CPU capacity
sched/fair: Disambiguate existing/remaining "capacity" usage
sched/fair: Change "has_capacity" to "has_free_capacity"
sched/fair: Remove "power" from 'struct numa_stats'
sched: Fix signedness bug in yield_to()
sched/fair: Use time_after() in record_wakee()
sched/balancing: Reduce the rate of needless idle load balancing
sched/fair: Fix unlocked reads of some cfs_b->quota/period