Add basic support for more than one hstate in hugetlbfs. This is the key
to supporting multiple hugetlbfs page sizes at once.
- Rather than a single hstate, we now have an array, with an iterator
- default_hstate continues to be the struct hstate which we use by default
- Add functions for architectures to register new hstates
[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
a similar manner to the reservations taken for MAP_SHARED mappings. The
reserve count is accounted both globally and on a per-VMA basis for
private mappings. This guarantees that a process that successfully calls
mmap() will successfully fault all pages in the future unless fork() is
called.
The characteristics of private mappings of hugetlbfs files behaviour after
this patch are;
1. The process calling mmap() is guaranteed to succeed all future faults until
it forks().
2. On fork(), the parent may die due to SIGKILL on writes to the private
mapping if enough pages are not available for the COW. For reasonably
reliable behaviour in the face of a small huge page pool, children of
hugepage-aware processes should not reference the mappings; such as
might occur when fork()ing to exec().
3. On fork(), the child VMAs inherit no reserves. Reads on pages already
faulted by the parent will succeed. Successful writes will depend on enough
huge pages being free in the pool.
4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
and at fault time otherwise.
Before this patch, all reads or writes in the child potentially needs page
allocations that can later lead to the death of the parent. This applies
to reads and writes of uninstantiated pages as well as COW. After the
patch it is only a write to an instantiated page that causes problems.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds proper extern declarations for five variables in
include/linux/vmstat.h
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove the :vim=ft=help tag from trace files.
I used them years ago to syntax-highlight traces and forgot about this hack.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: hrtick_enabled() should use cpu_active()
sched, x86: clean up hrtick implementation
sched: fix build error, provide partition_sched_domains() unconditionally
sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed
cpu hotplug: Make cpu_active_map synchronization dependency clear
cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)
sched: rework of "prioritize non-migratable tasks over migratable ones"
sched: reduce stack size in isolated_cpu_setup()
Revert parts of "ftrace: do not trace scheduler functions"
Fixed up conflicts in include/asm-x86/thread_info.h (due to the
TIF_SINGLESTEP unification vs TIF_HRTICK_RESCHED removal) and
kernel/sched_fair.c (due to cpu_active_map vs for_each_cpu_mask_nr()
introduction).
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
cpumask: Provide a generic set of CPUMASK_ALLOC macros
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
Revert "cpumask: introduce new APIs"
cpumask: make for_each_cpu_mask a bit smaller
net: Pass reference to cpumask variable in net/sunrpc/svc.c
...
Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually
* 'core/softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
softlockup: fix invalid proc_handler for softlockup_panic
softlockup: fix watchdog task wakeup frequency
softlockup: fix watchdog task wakeup frequency
softlockup: show irqtrace
softlockup: print a module list on being stuck
softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
softlockup: fix false positives on nohz if CPU is 100% idle for more than 60 seconds
softlockup: fix softlockup_thresh fix
softlockup: fix softlockup_thresh unaligned access and disable detection at runtime
softlockup: allow panic on lockup
This adds a fast path for 64-bit syscall entry and exit when
TIF_SYSCALL_AUDIT is set, but no other kind of syscall tracing.
This path does not need to save and restore all registers as
the general case of tracing does. Avoiding the iret return path
when syscall audit is enabled helps performance a lot.
Signed-off-by: Roland McGrath <roland@redhat.com>
The bug was reported and analysed by Mark McLoughlin <markmc@redhat.com>,
the patch is based on his and Roland's suggestions.
posix_timer_event() always rewrites the pre-allocated siginfo before sending
the signal. Most of the written info is the same all the time, but memset(0)
is very wrong. If ->sigq is queued we can race with collect_signal() which
can fail to find this siginfo looking at .si_signo, or copy_siginfo() can
copy the wrong .si_code/si_tid/etc.
In short, sys_timer_settime() can in fact stop the active timer, or the user
can receive the siginfo with the wrong .si_xxx values.
Move "memset(->info, 0)" from posix_timer_event() to alloc_posix_timer(),
change send_sigqueue() to set .si_overrun = 0 when ->sigq is not queued.
It would be nice to move the whole sigq->info initialization from send to
create path, but this is not easy to do without uglifying timer_create()
further.
As Roland rightly pointed out, we need more cleanups/fixes here, see the
"FIXME" comment in the patch. Hopefully this patch makes sense anyway, and
it can mask the most bad implications.
Reported-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Mark McLoughlin <markmc@redhat.com>
Cc: Oliver Pinter <oliver.pntr@gmail.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: stable@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
kernel/posix-timers.c | 17 +++++++++++++----
kernel/signal.c | 1 +
2 files changed, 14 insertions(+), 4 deletions(-)
Since 15a647eba9 set_irq_wake returned -ENXIO
if another device had it already enabled. Zero is the right value to
return in this case. Moreover the change to desc->status was not reverted
if desc->chip->set_wake returned an error.
Signed-off-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com>
Acked-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton reported this s390 allmodconfig build failure:
kernel/built-in.o: In function `hrtick_start_fair':
sched.c:(.text+0x69c6): undefined reference to `__smp_call_function_single'
the reason is that s390 is not a generic-ipi SMP platform yet, while
the hrtick code relies on it. Fix the dependency.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
remove CONFIG_KMOD from core kernel code
remove CONFIG_KMOD from lib
remove CONFIG_KMOD from sparc64
rework try_then_request_module to do less in non-modular kernels
remove mention of CONFIG_KMOD from documentation
make CONFIG_KMOD invisible
modules: Take a shortcut for checking if an address is in a module
module: turn longs into ints for module sizes
Shrink struct module: CONFIG_UNUSED_SYMBOLS ifdefs
module: reorder struct module to save space on 64 bit builds
module: generic each_symbol iterator function
module: don't use stop_machine for waiting rmmod
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (79 commits)
arm: bus_id -> dev_name() and dev_set_name() conversions
sparc64: fix up bus_id changes in sparc core code
3c59x: handle pci_name() being const
MTD: handle pci_name() being const
HP iLO driver
sysdev: Convert the x86 mce tolerant sysdev attribute to generic attribute
sysdev: Add utility functions for simple int/ulong variable sysdev attributes
sysdev: Pass the attribute to the low level sysdev show/store function
driver core: Suppress sysfs warnings for device_rename().
kobject: Transmit return value of call_usermodehelper() to caller
sysfs-rules.txt: reword API stability statement
debugfs: Implement debugfs_remove_recursive()
HOWTO: change email addresses of James in HOWTO
always enable FW_LOADER unless EMBEDDED=y
uio-howto.tmpl: use unique output names
uio-howto.tmpl: use standard copyright/legal markings
sysfs: don't call notify_change
sysdev: fix debugging statements in registration code.
kobject: should use kobject_put() in kset-example
kobject: reorder kobject to save space on 64 bit builds
...
Always compile request_module when the kernel allows modules.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This patch keeps track of the boundaries of module allocation, in
order to speed up module_text_address().
Inspired by Arjan's version, which required arch-specific defines:
Various pieces of the kernel (lockdep, latencytop, etc) tend
to store backtraces, sometimes at a relatively high
frequency. In itself this isn't a big performance deal (after
all you're using diagnostics features), but there have been
some complaints from people who have over 100 modules loaded
that this is a tad too slow.
This is due to the new backtracer code which looks at every
slot on the stack to see if it's a kernel/module text address,
so that's 1024 slots. 1024 times 100 modules... that's a lot
of list walking.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This shrinks module.o and each *.ko file.
And finally, structure members which hold length of module
code (four such members there) and count of symbols
are converted from longs to ints.
We cannot possibly have a module where 32 bits won't
be enough to hold such counts.
For one, module loading checks module size for sanity
before loading, so such insanely big module will fail
that test first.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
module.c and module.h conatains code for finding
exported symbols which are declared with EXPORT_UNUSED_SYMBOL,
and this code is compiled in even if CONFIG_UNUSED_SYMBOLS is not set
and thus there can be no EXPORT_UNUSED_SYMBOLs in modules anyway
(because EXPORT_UNUSED_SYMBOL(x) are compiled out to nothing then).
This patch adds required #ifdefs.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Introduce an each_symbol() iterator to avoid duplicating the knowledge
about the 5 different sections containing symbols. Currently only
used by find_symbol(), but will be used by symbol_put_addr() too.
(Includes NULL ptr deref fix by Jiri Kosina <jkosina@suse.cz>)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jiri Kosina <jkosina@suse.cz>
rmmod has a little-used "-w" option, meaning that instead of failing if the
module is in use, it should block until the module becomes unused.
In this case, we don't need to use stop_machine: Max Krasnyansky
indicated that would be useful for SystemTap which loads/unloads new
modules frequently.
Cc: Max Krasnyansky <maxk@qualcomm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This allow to dynamically generate attributes and share show/store
functions between attributes. Right now most attributes are generated
by special macros and lots of duplicated code. With the attribute
passed it's instead possible to attach some data to the attribute
and then use that in shared low level functions to do different things.
I need this for the dynamically generated bank attributes in the x86
machine check code, but it'll allow some further cleanups.
I converted all users in tree to the new show/store prototype. It's a single
huge patch to avoid unbisectable sections.
Runtime tested: x86-32, x86-64
Compiled only: ia64, powerpc
Not compile tested/only grep converted: sh, arm, avr32
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
random uvesafb failures were reported against Gentoo:
http://bugs.gentoo.org/show_bug.cgi?id=222799
and Mihai Moldovan bisected it back to:
> 8f4d37ec07 is first bad commit
> commit 8f4d37ec07
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Fri Jan 25 21:08:29 2008 +0100
>
> sched: high-res preemption tick
Linus suspected it to be hrtick + vm86 interaction and observed:
> Btw, Peter, Ingo: I think that commit is doing bad things. They aren't
> _incorrect_ per se, but they are definitely bad.
>
> Why?
>
> Using random _TIF_WORK_MASK flags is really impolite for doing
> "scheduling" work. There's a reason that arch/x86/kernel/entry_32.S
> special-cases the _TIF_NEED_RESCHED flag: we don't want to exit out of
> vm86 mode unnecessarily.
>
> See the "work_notifysig_v86" label, and how it does that
> "save_v86_state()" thing etc etc.
Right, I never liked having to fiddle with those TIF flags. Initially I
needed it because the hrtimer base lock could not nest in the rq lock.
That however is fixed these days.
Currently the only reason left to fiddle with the TIF flags is remote
wakeups. We cannot program a remote cpu's hrtimer. I've been thinking
about using the new and improved IPI function call stuff to implement
hrtimer_start_on().
However that does require that smp_call_function_single(.wait=0) works
from interrupt context - /me looks at the latest series from Jens - Yes
that does seem to be supported, good.
Here's a stab at cleaning this stuff up ...
Mihai reported test success as well.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Mihai Moldovan <ionic@ionic.de>
Cc: Michal Januszewski <spock@gentoo.org>
Cc: Antonino Daplas <adaplas@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Optimize various places where a pointer to the cpumask_of_cpu value
will result in reducing stack pressure.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* This patch replaces the dangerous lvalue version of cpumask_of_cpu
with new cpumask_of_cpu_ptr macros. These are patterned after the
node_to_cpumask_ptr macros.
In general terms, if there is a cpumask_of_cpu_map[] then a pointer to
the cpumask_of_cpu_map[cpu] entry is used. The cpumask_of_cpu_map
is provided when there is a large NR_CPUS count, reducing
greatly the amount of code generated and stack space used for
cpumask_of_cpu(). The pointer to the cpumask_t value is needed for
calling set_cpus_allowed_ptr() to reduce the amount of stack space
needed to pass the cpumask_t value.
If there isn't a cpumask_of_cpu_map[], then a temporary variable is
declared and filled in with value from cpumask_of_cpu(cpu) as well as
a pointer variable pointing to this temporary variable. Afterwards,
the pointer is used to reference the cpumask value. The compiler
will optimize out the extra dereference through the pointer as well
as the stack space used for the pointer, resulting in identical code.
A good example of the orthogonal usages is in net/sunrpc/svc.c:
case SVC_POOL_PERCPU:
{
unsigned int cpu = m->pool_to[pidx];
cpumask_of_cpu_ptr(cpumask, cpu);
*oldmask = current->cpus_allowed;
set_cpus_allowed_ptr(current, cpumask);
return 1;
}
case SVC_POOL_PERNODE:
{
unsigned int node = m->pool_to[pidx];
node_to_cpumask_ptr(nodecpumask, node);
*oldmask = current->cpus_allowed;
set_cpus_allowed_ptr(current, nodecpumask);
return 1;
}
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we disable a screaming irq we never see it again. If the irq
line is shared or if the driver half works this is a real pain. So
periodically poll the handlers for screaming interrupts.
I use a timer instead of the classic irq poll technique of working off
the timer interrupt because when we use the local apic timers
note_interrupt is never called (bug?). Further on a system with
dynamic ticks the timer interrupt might not even fire unless there is
a timer telling it it needs to.
I forced this case on my test system with an e1000 nic and my ssh
session remained responsive despite the interrupt handler only being
called every 10th of a second.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When PREEMPT_TRACER and IRQSOFF_TRACER are both configured and irqsoff
tracer is running, the preempt_off sections might also be traced.
Thanks to Andrew Morton for pointing out my mistake of spin_lock disabling
interrupts while he was reviewing ftrace.txt. Seems that my example I used
actually hit this bug.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The type of softlockup_panic is int, but the proc_handler is
proc_doulongvec_minmax(). This handler is for unsigned long.
This handler should be proc_dointvec_minmax().
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jack Ren and Eric Miao tracked down the following long standing
problem in the NOHZ code:
scheduler switch to idle task
enable interrupts
Window starts here
----> interrupt happens (does not set NEED_RESCHED)
irq_exit() stops the tick
----> interrupt happens (does set NEED_RESCHED)
return from schedule()
cpu_idle(): preempt_disable();
Window ends here
The interrupts can happen at any point inside the race window. The
first interrupt stops the tick, the second one causes the scheduler to
rerun and switch away from idle again and we end up with the tick
disabled.
The fact that it needs two interrupts where the first one does not set
NEED_RESCHED and the second one does made the bug obscure and extremly
hard to reproduce and analyse. Kudos to Jack and Eric.
Solution: Limit the NOHZ functionality to the idle loop to make sure
that we can not run into such a situation ever again.
cpu_idle()
{
preempt_disable();
while(1) {
tick_nohz_stop_sched_tick(1); <- tell NOHZ code that we
are in the idle loop
while (!need_resched())
halt();
tick_nohz_restart_sched_tick(); <- disables NOHZ mode
preempt_enable_no_resched();
schedule();
preempt_disable();
}
}
In hindsight we should have done this forever, but ...
/me grabs a large brown paperbag.
Debugged-by: Jack Ren <jack.ren@marvell.com>,
Debugged-by: eric miao <eric.y.miao@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is v2, it's a little deference from v1 that I
had send to lkml.
use ACCESS_ONCE
use rcu_batch_after/rcu_batch_before for batch # comparison.
rcutorture test result:
(hotplugs: do cpu-online/offline once per second)
No CONFIG_NO_HZ: OK, 12hours
No CONFIG_NO_HZ, hotplugs: OK, 12hours
CONFIG_NO_HZ=y: OK, 24hours
CONFIG_NO_HZ=y, hotplugs: Failed.
(Failed also without my patch applied, exactly the same bug occurred,
http://lkml.org/lkml/2008/7/3/24)
v1's email thread:
http://lkml.org/lkml/2008/6/2/539
v1's description:
The code/algorithm of the implement of current callbacks-processing
is very efficient and technical. But when I studied it and I found
a disadvantage:
In multi-CPU systems, when a new RCU callback is being
queued(call_rcu[_bh]), this callback will be invoked after the grace
period for the batch with batch number = rcp->cur+2 has completed
very very likely in current implement. Actually, this callback can be
invoked after the grace period for the batch with
batch number = rcp->cur+1 has completed. The delay of invocation means
that latency of synchronize_rcu() is extended. But more important thing
is that the callbacks usually free memory, and these works are delayed
too! it's necessary for reclaimer to free memory as soon as
possible when left memory is few.
A very simple way can solve this problem:
a field(struct rcu_head::batch) is added to record the batch number for
the RCU callback. And when a new RCU callback is being queued, we
determine the batch number for this callback(head->batch = rcp->cur+1)
and we move this callback to rdp->donelist if we find
that head->batch <= rcp->completed when we process callbacks.
This simple way reduces the wait time for invocation a lot. (about
2.5Grace Period -> 1.5Grace Period in average in multi-CPU systems)
This is my algorithm. But I do not add any field for struct rcu_head
in my implement. We just need to memorize the last 2 batches and
their batch number, because these 2 batches include all entries that
for whom the grace period hasn't completed. So we use a special
linked-list rather than add a field.
Please see the comment of struct rcu_data.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use a batch number(rcp->pending) instead of a flag(rcp->next_pending)
rcu_start_batch() need to change this flag, so mb()s is needed
for memory-access safe.
but(after this patch applied) rcu_start_batch() do not change
this batch number(rcp->pending), rcp->pending is managed by
__rcu_process_callbacks only, and troublesome mb()s are eliminated.
And codes look simpler and clearer.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix inc_rt_tasks() to not declare variable 'rq' if it's not needed. It is
declared if CONFIG_SMP or CONFIG_RT_GROUP_SCHED, but only used if CONFIG_SMP.
This is a consequence of patch 1f11eb6a8b plus
patch 1100ac91b6.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This goes on top of the cpu_active_map (take 2) patch.
Currently we depend on the stop_machine to provide nescessesary
synchronization for the cpu_active_map updates.
As Dmitry Adamushko pointed this is fragile and is not much clearer
than the previous scheme. In other words we do not want to depend on
the internal stop machine operation here.
So make the synchronization rules clear by doing synchronize_sched()
after clearing out cpu active bit.
Tested on quad-Core2 with:
while true; do
for i in 1 2 3; do
echo 0 > /sys/devices/system/cpu/cpu$i/online
done
for i in 1 2 3; do
echo 1 > /sys/devices/system/cpu/cpu$i/online
done
done
and
stress -c 200
No lockdep, preempt or other complaints.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is based on Linus' idea of creating cpu_active_map that prevents
scheduler load balancer from migrating tasks to the cpu that is going
down.
It allows us to simplify domain management code and avoid unecessary
domain rebuilds during cpu hotplug event handling.
Please ignore the cpusets part for now. It needs some more work in order
to avoid crazy lock nesting. Although I did simplfy and unify domain
reinitialization logic. We now simply call partition_sched_domains() in
all the cases. This means that we're using exact same code paths as in
cpusets case and hence the test below cover cpusets too.
Cpuset changes to make rebuild_sched_domains() callable from various
contexts are in the separate patch (right next after this one).
This not only boots but also easily handles
while true; do make clean; make -j 8; done
and
while true; do on-off-cpu 1; done
at the same time.
(on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).
Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
this on right now in gnome-terminal and things are moving just fine.
Also this is running with most of the debug features enabled (lockdep,
mutex, etc) no BUG_ONs or lockdep complaints so far.
I believe I addressed all of the Dmitry's comments for original Linus'
version. I changed both fair and rt balancer to mask out non-active cpus.
And replaced cpu_is_offline() with !cpu_active() in the main scheduler
code where it made sense (to me).
Signed-off-by: Max Krasnyanskiy <maxk@qualcomm.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Cc: dmitry.adamushko@gmail.com
Cc: pj@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
(1) handle in a generic way all cases when a newly woken-up task is
not migratable (not just a corner case when "rt_se->nr_cpus_allowed ==
1")
(2) if current is to be preempted, then make sure "p" will be picked
up by pick_next_task_rt().
i.e. move task's group at the head of its list as well.
currently, it's not a case for the group-scheduling case as described
here: http://www.ussg.iu.edu/hypermail/linux/kernel/0807.0/0134.html
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ftrace: do not trace library functions
ftrace: do not trace scheduler functions
ftrace: fix lockup with MAXSMP
ftrace: fix merge buglet
Xen save/restore needs bits of code enabled by PM_SLEEP, and PM_SLEEP
depends on PM. So make XEN_SAVE_RESTORE depend on PM and PM_SLEEP
depend on XEN_SAVE_RESTORE.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This fixes an arcane bug that we think was a regression introduced
by commit b2b2cbc4b2. When a parent
ignores SIGCHLD (or uses SA_NOCLDWAIT), its children would self-reap
but they don't if it's using ptrace on them. When the parent thread
later exits and ceases to ptrace a child but leaves other live
threads in the parent's thread group, any zombie children are left
dangling. The fix makes them self-reap then, as they would have
done earlier if ptrace had not been in use.
Signed-off-by: Roland McGrath <roland@redhat.com>
This reverts the effect of commit f2cc3eb133
"do_wait: fix security checks". That change reverted the effect of commit
7324328446. The rationale for the original
commit still stands. The inconsistent treatment of children hidden by
ptrace was an unintended omission in the original change and in no way
invalidates its purpose.
This makes do_wait return the error returned by security_task_wait()
(usually -EACCES) in place of -ECHILD when there are some children the
caller would be able to wait for if not for the permission failure. A
permission error will give the user a clue to look for security policy
problems, rather than for mysterious wait bugs.
Signed-off-by: Roland McGrath <roland@redhat.com>
ptrace no longer fiddles with the children/sibling links, and the
old ptrace_children list is gone. Now ptrace, whether of one's own
children or another's via PTRACE_ATTACH, just uses the new ptraced
list instead.
There should be no user-visible difference that matters. The only
change is the order in which do_wait() sees multiple stopped
children and stopped ptrace attachees. Since wait_task_stopped()
was changed earlier so it no longer reorders the children list, we
already know this won't cause any new problems.
Signed-off-by: Roland McGrath <roland@redhat.com>
This breaks out the guts of do_wait into three subfunctions.
The control flow is less nonobvious without so much goto.
do_wait_thread and ptrace_do_wait contain the main work of the outer loop.
wait_consider_task contains the main work of the inner loop.
Signed-off-by: Roland McGrath <roland@redhat.com>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (72 commits)
Revert "x86/PCI: ACPI based PCI gap calculation"
PCI: remove unnecessary volatile in PCIe hotplug struct controller
x86/PCI: ACPI based PCI gap calculation
PCI: include linux/pm_wakeup.h for device_set_wakeup_capable
PCI PM: Fix pci_prepare_to_sleep
x86/PCI: Fix PCI config space for domains > 0
Fix acpi_pm_device_sleep_wake() by providing a stub for CONFIG_PM_SLEEP=n
PCI: Simplify PCI device PM code
PCI PM: Introduce pci_prepare_to_sleep and pci_back_from_sleep
PCI ACPI: Rework PCI handling of wake-up
ACPI: Introduce new device wakeup flag 'prepared'
ACPI: Introduce acpi_device_sleep_wake function
PCI: rework pci_set_power_state function to call platform first
PCI: Introduce platform_pci_power_manageable function
ACPI: Introduce acpi_bus_power_manageable function
PCI: make pci_name use dev_name
PCI: handle pci_name() being const
PCI: add stub for pci_set_consistent_dma_mask()
PCI: remove unused arch pcibios_update_resource() functions
PCI: fix pci_setup_device()'s sprinting into a const buffer
...
Fixed up conflicts in various files (arch/x86/kernel/setup_64.c,
arch/x86/pci/irq.c, arch/x86/pci/pci.h, drivers/acpi/sleep/main.c,
drivers/pci/pci.c, drivers/pci/pci.h, include/acpi/acpi_bus.h) from x86
and ACPI updates manually.
We can avoid taking the BKL in snapshot_ioctl() if pm_mutex is used to prevent
the ioctls from being executed concurrently.
In addition, although it is only possible to open /dev/snapshot once, the task
which has done that may spawn a child that will inherit the open descriptor,
so in theory they can call snapshot_write(), snapshot_read() and
snapshot_release() concurrently. pm_mutex can also be used for mutual
exclusion in such cases.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Push BKL down into ioctl handlers - snapshot device.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
The freezer currently attempts to distinguish kernel threads from
user space tasks by checking if their mm pointer is unset and it
does not send fake signals to kernel threads. However, there are
kernel threads, mostly related to networking, that behave like
user space tasks and may want to be sent a fake signal to be frozen.
Introduce the new process flag PF_FREEZER_NOSIG that will be set
by default for all kernel threads and make the freezer only send
fake signals to the tasks having PF_FREEZER_NOSIG unset. Provide
the set_freezable_with_signal() function to be called by the kernel
threads that want to be sent a fake signal for freezing.
This patch should not change the freezer's observable behavior.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
The ACPI device node for the cpu has already been unregistered
when acpi_processor_handle_eject is called.
Thus we should offline the cpu and continue, rather than a failure here.
http://bugzilla.kernel.org/show_bug.cgi?id=9772
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
This reverts commit 6fbbec428c8e7bb617da2e8a589af2e97bcf3bc4.
Rafael doesnt like it - it breaks various assumptions.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Xen save/restore requires PM_SLEEP to be set without requiring
SUSPEND or HIBERNATION.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a GFP_ATOMIC allocation fails, it falls back to allocating the
data on the stack and converting it to a waiting call.
Make sure we actually wait in this case.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'generic-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (22 commits)
generic-ipi: more merge fallout
generic-ipi: merge fix
x86, visws: use mach-default/entry_arch.h
x86, visws: fix generic-ipi build
generic-ipi: fixlet
generic-ipi: fix s390 build bug
generic-ipi: fix linux-next tree build failure
fix: "smp_call_function: get rid of the unused nonatomic/retry argument"
fix: "smp_call_function: get rid of the unused nonatomic/retry argument"
fix "smp_call_function: get rid of the unused nonatomic/retry argument"
on_each_cpu(): kill unused 'retry' parameter
smp_call_function: get rid of the unused nonatomic/retry argument
sh: convert to generic helpers for IPI function calls
parisc: convert to generic helpers for IPI function calls
mips: convert to generic helpers for IPI function calls
m32r: convert to generic helpers for IPI function calls
arm: convert to generic helpers for IPI function calls
alpha: convert to generic helpers for IPI function calls
ia64: convert to generic helpers for IPI function calls
powerpc: convert to generic helpers for IPI function calls
...
Fix trivial conflicts due to rcu updates in kernel/rcupdate.c manually
* 'timers/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: add PCI ID for 6300ESB force hpet
x86: add another PCI ID for ICH6 force-hpet
kernel-paramaters: document pmtmr= command line option
acpi_pm clccksource: fix printk format warning
nohz: don't stop idle tick if softirqs are pending.
pmtmr: allow command line override of ioport
nohz: reduce jiffies polling overhead
hrtimer: Remove unused variables in ktime_divns()
hrtimer: remove warning in hres_timers_resume
posix-timers: print RT watchdog message
* 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (241 commits)
[ARM] 5171/1: ep93xx: fix compilation of modules using clocks
[ARM] 5133/2: at91sam9g20 defconfig file
[ARM] 5130/4: Support for the at91sam9g20
[ARM] 5160/1: IOP3XX: gpio/gpiolib support
[ARM] at91: Fix NAND FLASH timings for at91sam9x evaluation kits.
[ARM] 5084/1: zylonite: Register AC97 device
[ARM] 5085/2: PXA: Move AC97 over to the new central device declaration model
[ARM] 5120/1: pxa: correct platform driver names for PXA25x and PXA27x UDC drivers
[ARM] 5147/1: pxaficp_ir: drop pxa_gpio_mode calls, as pin setting
[ARM] 5145/1: PXA2xx: provide api to control IrDA pins state
[ARM] 5144/1: pxaficp_ir: cleanup includes
[ARM] pxa: remove pxa_set_cken()
[ARM] pxa: allow clk aliases
[ARM] Feroceon: don't disable BPU on boot
[ARM] Orion: LED support for HP mv2120
[ARM] Orion: add RD88F5181L-FXO support
[ARM] Orion: add RD88F5181L-GE support
[ARM] Orion: add Netgear WNR854T support
[ARM] s3c2410_defconfig: update for current build
[ARM] Acer n30: Minor style and indentation fixes.
...
* 'core/softirq' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
softirq: remove irqs_disabled warning from local_bh_enable
softirq: remove initialization of static per-cpu variable
Remove argument from open_softirq which is always NULL
* 'core/printk' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, generic: mark early_printk as asmlinkage
printk: export console_drivers
printk: remember the message level for multi-line output
printk: refactor processing of line severity tokens
printk: don't prefer unsuited consoles on registration
printk: clean up recursion check related static variables
namespacecheck: more kernel/printk.c fixes
namespacecheck: fix kernel printk.c
* 'core/locking' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: fix kernel/fork.c warning
lockdep: fix ftrace irq tracing false positive
lockdep: remove duplicate definition of STATIC_LOCKDEP_MAP_INIT
lockdep: add lock_class information to lock_chain and output it
lockdep: add lock_class information to lock_chain and output it
lockdep: output lock_class key instead of address for forward dependency output
__mutex_lock_common: use signal_pending_state()
mutex-debug: check mutex magic before owner
Fixed up conflict in kernel/fork.c manually
* 'sched/new-API-sched_setscheduler' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: add new API sched_setscheduler_nocheck: add a flag to control access checks
* 'tracing/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (228 commits)
ftrace: build fix for ftraced_suspend
ftrace: separate out the function enabled variable
ftrace: add ftrace_kill_atomic
ftrace: use current CPU for function startup
ftrace: start wakeup tracing after setting function tracer
ftrace: check proper config for preempt type
ftrace: trace schedule
ftrace: define function trace nop
ftrace: move sched_switch enable after markers
ftrace: prevent ftrace modifications while being kprobe'd, v2
fix "ftrace: store mcount address in rec->ip"
mmiotrace broken in linux-next (8-bit writes only)
ftrace: avoid modifying kprobe'd records
ftrace: freeze kprobe'd records
kprobes: enable clean usage of get_kprobe
ftrace: store mcount address in rec->ip
ftrace: build fix with gcc 4.3
namespacecheck: fixes
ftrace: fix "notrace" filtering priority
ftrace: fix printout
...
* 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (76 commits)
sched_clock: and multiplier for TSC to gtod drift
sched_clock: record TSC after gtod
sched_clock: only update deltas with local reads.
sched_clock: fix calculation of other CPU
sched_clock: stop maximum check on NO HZ
sched_clock: widen the max and min time
sched_clock: record from last tick
sched: fix accounting in task delay accounting & migration
sched: add avg-overlap support to RT tasks
sched: terminate newidle balancing once at least one task has moved over
sched: fix warning
sched: build fix
sched: sched_clock_cpu() based cpu_clock(), lockdep fix
sched: export cpu_clock
sched: make sched_{rt,fair}.c ifdefs more readable
sched: bias effective_load() error towards failing wake_affine().
sched: incremental effective_load()
sched: correct wakeup weight calculations
sched: fix mult overflow
sched: update shares on wakeup
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (25 commits)
security: remove register_security hook
security: remove dummy module fix
security: remove dummy module
security: remove unused sb_get_mnt_opts hook
LSM/SELinux: show LSM mount options in /proc/mounts
SELinux: allow fstype unknown to policy to use xattrs if present
security: fix return of void-valued expressions
SELinux: use do_each_thread as a proper do/while block
SELinux: remove unused and shadowed addrlen variable
SELinux: more user friendly unknown handling printk
selinux: change handling of invalid classes (Was: Re: 2.6.26-rc5-mm1 selinux whine)
SELinux: drop load_mutex in security_load_policy
SELinux: fix off by 1 reference of class_to_string in context_struct_compute_av
SELinux: open code sidtab lock
SELinux: open code load_mutex
SELinux: open code policy_rwlock
selinux: fix endianness bug in network node address handling
selinux: simplify ioctl checking
SELinux: enable processes with mac_admin to get the raw inode contexts
Security: split proc ptrace checking into read vs. attach
...
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits)
splice: fix generic_file_splice_read() race with page invalidation
ramfs: enable splice write
drivers/block/pktcdvd.c: avoid useless memset
cdrom: revert commit 22a9189 (cdrom: use kmalloced buffers instead of buffers on stack)
scsi: sr avoids useless buffer allocation
block: blk_rq_map_kern uses the bounce buffers for stack buffers
block: add blk_queue_update_dma_pad
DAC960: push down BKL
pktcdvd: push BKL down into driver
paride: push ioctl down into driver
block: use get_unaligned_* helpers
block: extend queue_flag bitops
block: request_module(): use format string
Add bvec_merge_data to handle stacked devices and ->merge_bvec()
block: integrity flags can't use bit ops on unsigned short
cmdfilter: extend default read filter
sg: fix odd style (extra parenthesis) introduced by cmd filter patch
block: add bounce support to blk_rq_map_user_iov
cfq-iosched: get rid of enable_idle being unused warning
allow userspace to modify scsi command filter on per device basis
...
Enable security modules to distinguish reading of process state via
proc from full ptrace access by renaming ptrace_may_attach to
ptrace_may_access and adding a mode argument indicating whether only
read access or full attach access is requested. This allows security
modules to permit access to reading process state without granting
full ptrace access. The base DAC/capability checking remains unchanged.
Read access to /proc/pid/mem continues to apply a full ptrace attach
check since check_mem_permission() already requires the current task
to already be ptracing the target. The other ptrace checks within
proc for elements like environ, maps, and fds are changed to pass the
read mode instead of attach.
In the SELinux case, we model such reading of process state as a
reading of a proc file labeled with the target process' label. This
enables SELinux policy to permit such reading of process state without
permitting control or manipulation of the target process, as there are
a number of cases where programs probe for such information via proc
but do not need to be able to control the target (e.g. procps,
lsof, PolicyKit, ConsoleKit). At present we have to choose between
allowing full ptrace in policy (more permissive than required/desired)
or breaking functionality (or in some cases just silencing the denials
via dontaudit rules but this can hide genuine attacks).
This version of the patch incorporates comments from Casey Schaufler
(change/replace existing ptrace_may_attach interface, pass access
mode), and Chris Wright (provide greater consistency in the checking).
Note that like their predecessors __ptrace_may_attach and
ptrace_may_attach, the __ptrace_may_access and ptrace_may_access
interfaces use different return value conventions from each other (0
or -errno vs. 1 or 0). I retained this difference to avoid any
changes to the caller logic but made the difference clearer by
changing the latter interface to return a bool rather than an int and
by adding a comment about it to ptrace.h for any future callers.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: James Morris <jmorris@namei.org>
When callbacks are moved from offline cpu to this cpu,
the qlen field of this rdp should be updated.
[ Paul E. McKenney: ]
The effect of this bug would be for force_quiescent_state() to be invoked
when it should not and vice versa -- wasting cycles in the first case
and letting RCU callbacks remain piled up in the second case. The bug
is thus "benign" in that it does not result in premature grace-period
termination, but should of course be fixed nonetheless.
Preemption is disabled by the caller's get_cpu_var(), so we are guaranteed
to remain on the same CPU, as required. The local_irq_disable() is indeed
needed, otherwise, an interrupt might invoke call_rcu() or call_rcu_bh(),
which could cause that interrupt's increment of ->qlen to be lost.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit f18f982ab ("sched: CPU hotplug events must not destroy scheduler
domains created by the cpusets") introduced a hotplug-related problem as
described below:
Upon CPU_DOWN_PREPARE,
update_sched_domains() -> detach_destroy_domains(&cpu_online_map)
does the following:
/*
* Force a reinitialization of the sched domains hierarchy. The domains
* and groups cannot be updated in place without racing with the balancing
* code, so we temporarily attach all running cpus to the NULL domain
* which will prevent rebalancing while the sched domains are recalculated.
*/
The sched-domains should be rebuilt when a CPU_DOWN ops. has been
completed, effectively either upon CPU_DEAD{_FROZEN} (upon success) or
CPU_DOWN_FAILED{_FROZEN} (upon failure -- restore the things to their
initial state). That's what update_sched_domains() also does but only
for !CPUSETS case.
With f18f982ab, sched-domains' reinitialization is delegated to
CPUSETS code:
cpuset_handle_cpuhp() -> common_cpu_mem_hotplug_unplug() ->
rebuild_sched_domains()
Being called for CPU_UP_PREPARE and if its callback is called after
update_sched_domains()), it just negates all the work done by
update_sched_domains() -- i.e. a soon-to-be-offline cpu is included in
the sched-domains and that makes it visible for the load-balancer
while the CPU_DOWN ops. is in progress.
__migrate_live_tasks() moves the tasks off a 'dead' cpu (it's already
"offline" when this function is called).
try_to_wake_up() is called for one of these tasks from another CPU ->
the load-balancer (wake_idle()) picks up a "dead" CPU and places the
task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
-> oops.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: miaox@cn.fujitsu.com
Cc: rostedt@goodmis.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
kernel/trace/ftrace.c:1615: error: 'ftraced_suspend' undeclared (first use in this function)
kernel/trace/ftrace.c:1615: error: (Each undeclared identifier is reported only once
kernel/trace/ftrace.c:1615: error: for each function it appears in.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The sched_clock code currently tries to keep all CPU clocks of all CPUS
somewhat in sync. At every clock tick it records the gtod clock and
uses that and jiffies and the TSC to calculate a CPU clock that tries to
stay in sync with all the other CPUs.
ftrace depends heavily on this timer and it detects when this timer
"jumps". One problem is that the TSC and the gtod also drift.
When the TSC is 0.1% faster or slower than the gtod it is very noticeable
in ftrace. To help compensate for this, I've added a multiplier that
tries to keep the CPU clock updating at the same rate as the gtod.
I've tried various ways to get it to be in sync and this ended up being
the most reliable. At every scheduler tick we calculate the new multiplier:
multi = delta_gtod / delta_TSC
This means we perform a 64 bit divide at the tick (once a HZ). A shift
is used to handle the accuracy.
Other methods that failed due to dynamic HZ are:
(not used) multi += (gtod - tsc) / delta_gtod
(not used) multi += (gtod - (last_tsc + delta_tsc)) / delta_gtod
as well as other variants.
This code still allows for a slight drift between TSC and gtod, but
it keeps the damage down to a minimum.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To read the gtod we need to grab the xtime lock for read. Reading the gtod
before the TSC can cause a bigger gab if the xtime lock is contended.
This patch simply reverses the order to read the TSC after the gtod.
The locking in the reading of the gtod handles any barriers one might
think is needed.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reading the CPU clock should try to stay accurate within the CPU.
By reading the CPU clock from another CPU and updating the deltas can
cause unneeded jumps when reading from the local CPU.
This patch changes the code to update the last read TSC only when read
from the local CPU.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The algorithm to calculate the 'now' of another CPU is not correct.
At each scheduler tick, each CPU records the last sched_clock and
gtod (tick_raw and tick_gtod respectively). If the TSC is somewhat the
same in speed between two clocks the algorithm would be:
tick_gtod1 + (now1 - tick_raw1) = tick_gtod2 + (now2 - tick_raw2)
To calculate now2 we would have:
now2 = (tick_gtod1 - tick_gtod2) + (tick_raw2 - tick_raw1) + now1
Currently the algorithm is:
now2 = (tick_gtod1 - tick_gtod2) + (tick_raw1 - tick_raw2) + now1
This solves most of the rest of the issues I've had with timestamps in
ftace.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Working with ftrace I would get large jumps of 11 millisecs or more with
the clock tracer. This killed the latencing timings of ftrace and also
caused the irqoff self tests to fail.
What was happening is with NO_HZ the idle would stop the jiffy counter and
before the jiffy counter was updated the sched_clock would have a bad
delta jiffies to compare with the gtod with the maximum.
The jiffies would stop and the last sched_tick would record the last gtod.
On wakeup, the sched clock update would compare the gtod + delta jiffies
(which would be zero) and compare it to the TSC. The TSC would have
correctly (with a stable TSC) moved forward several jiffies. But because the
jiffies has not been updated yet the clock would be prevented from moving
forward because it would appear that the TSC jumped too far ahead.
The clock would then virtually stop, until the jiffies are updated. Then
the next sched clock update would see that the clock was very much behind
since the delta jiffies is now correct. This would then jump the clock
forward by several jiffies.
This caused ftrace to report several milliseconds of interrupts off
latency at every resume from NO_HZ idle.
This patch adds hooks into the nohz code to disable the checking of the
maximum clock update when nohz is in effect. It resumes the max check
when nohz has updated the jiffies again.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With keeping the max and min sched time within one jiffy of the gtod clock
was too tight. Just before a schedule tick the max could easily be hit, as
well as just after a schedule_tick the min could be hit. This caused the
clock to jump around by a jiffy.
This patch widens the minimum to
last gtod + (delta_jiffies ? delta_jiffies - 1 : 0) * TICK_NSECS
and the maximum to
last gtod + (2 + delta_jiffies) * TICK_NSECS
This keeps the minum to gtod or if one jiffy less than delta jiffies
and the maxim 2 jiffies ahead of gtod. This may cause unstable TSCs to be
a bit more sporadic, but it helps keep a clock with a stable TSC working well.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The sched_clock code tries to keep within the gtod time by one tick (jiffy).
The current code mistakenly keeps track of the delta jiffies between
updates of the clock, where the the delta is used to compare with the
number of jiffies that have past since an update of the gtod. The gtod is
updated at each schedule tick not each sched_clock update. After one
jiffy passes the clock is updated fine. But the delta is taken from the
last update so if the next update happens before the next tick the delta
jiffies used will be incorrect.
This patch changes the code to check the delta of jiffies between ticks
and not updates to match the comparison of the updates with the gtod.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently the function tracer uses the global tracer_enabled variable that
is used to keep track if the tracer is enabled or not. The function tracing
startup needs to be separated out, otherwise the internal happenings of
the tracer startup is also recorded.
This patch creates a ftrace_function_enabled variable to all the starting
of the function traces to happen after everything has been started.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It has been suggested that I add a way to disable the function tracer
on an oops. This code adds a ftrace_kill_atomic. It is not meant to be
used in normal situations. It will disable the ftrace tracer, but will
not perform the nice shutdown that requires scheduling.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is more of a clean up. Currently the function tracer initializes the
tracer with which ever CPU was last used for tracing. This value isn't
realy useful for function tracing, but at least it should be something other
than a random number.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Enabling the wakeup tracer before enabling the function tracing causes
some strange results due to the dynamic enabling of the functions.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no CONFIG_PREEMPT_DESKTOP. Use the proper entry CONFIG_PREEMPT.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After the sched_clock code has been removed from sched.c we can now trace
the scheduler. The scheduler has a lot of functions that would be worth
tracing.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When CONFIG_FTRACE is not enabled, the tracing_start_functon_trace
and tracing_stop_function_trace should be nops.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We have two markers now that are enabled on sched_switch. One that records
the context switching and the other that records task wake ups. Currently
we enable the tracing first and then set the markers. This causes some
confusing traces:
# tracer: sched_switch
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
trace-cmd-3973 [00] 115.834817: 3973:120:R + 3: 0:S
trace-cmd-3973 [01] 115.834910: 3973:120:R + 6: 0:S
trace-cmd-3973 [02] 115.834910: 3973:120:R + 9: 0:S
trace-cmd-3973 [03] 115.834910: 3973:120:R + 12: 0:S
trace-cmd-3973 [02] 115.834910: 3973:120:R + 9: 0:S
<idle>-0 [02] 115.834910: 0:140:R ==> 3973:120:R
Here we see that trace-cmd with PID 3973 wakes up task 9 but the next line
shows the idle task doing a context switch to task 3973.
Enabling the tracing to _after_ the markers are set creates a much saner
output:
# tracer: sched_switch
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<idle>-0 [02] 7922.634225: 0:140:R ==> 4790:120:R
trace-cmd-4789 [03] 7922.634225: 0:140:R + 4790:120:R
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In case a cpu goes idle but softirqs are pending only an error message is
printed to the console. It may take a very long time until the pending
softirqs will finally be executed. Worst case would be a hanging system.
With this patch the timer tick just continues and the softirqs will be
executed after the next interrupt. Still a delay but better than a
hanging system.
Currently we have at least two device drivers on s390 which under certain
circumstances schedule a tasklet from process context. This is a reason
why we can end up with pending softirqs when going idle. Fixing these
drivers seems to be non-trivial.
However there is no question that the drivers should be fixed.
This patch shouldn't be considered as a bug fix. It just is intended to
keep a system running even if device drivers are buggy.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Glauber <jan.glauber@de.ibm.com>
Cc: Stefan Weinhuber <wein@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clean up __migrate_task(): to just have separate "done" and "fail"
cases, instead of that "out" case with random error behavior.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
PREEMPT_RCU without HOTPLUG_CPU is broken. The rcu_online_cpu is called
to initially populate rcu_cpu_online_map with all online CPUs when the
hotplug event handler is installed, and also to populate the map with
CPUs as they come online. The former case is meant to happen with and
without HOTPLUG_CPU, but without HOTPLUG_CPU, the rcu_offline_cpu
function is no-oped -- while it still gets called, it does not set the
rcu CPU map.
With a blank RCU CPU map, grace periods get to tick by completely
oblivious to active RCU read side critical sections. This results in
free-before-grace bugs.
Fix is obvious once the problem is known. (Also, change __devinit to
__cpuinit so the function gets thrown away on !HOTPLUG_CPU kernels).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reported-and-tested-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ Nick is my personal hero of the day - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I think we may have a race between try_to_wake_up() and
migrate_live_tasks() -> move_task_off_dead_cpu() when the later one
may end up looping endlessly.
Interrupts are enabled on other CPUs when migration_call(CPU_DEAD, ...) is
called so we may get a race between try_to_wake_up() and
migrate_live_tasks() -> move_task_off_dead_cpu(). The former one may push
a task out of a dead CPU causing the later one to loop endlessly.
Heiko Carstens observed:
| That's exactly what explains a dump I got yesterday. Thanks for fixing! :)
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: miaox@cn.fujitsu.com
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In http://bugzilla.kernel.org/show_bug.cgi?id=9580 it was pointed out
that the desc->chip checks are extraneous. In fact these are left
overs from early development and can be removed safely.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Replace usages of MAX_NUMNODES with nr_node_ids in kernel/sched.c,
where appropriate. This saves some allocated space as well as many
wasted cycles going through node entries that are non-existent.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
C1E on AMD machines is like C3 but without control from the OS. Up to
now we disabled the local apic timer for those machines as it stops
when the CPU goes into C1E. This excludes those machines from high
resolution timers / dynamic ticks, which hurts especially X2 based
laptops.
The current boot time C1E detection has another, more serious flaw
as well: some BIOSes do not enable C1E until the ACPI processor module
is loaded. This causes systems to stop working after that point.
To work nicely with C1E enabled machines we use a separate idle
function, which checks on idle entry whether C1E was enabled in the
Interrupt Pending Message MSR. This allows us to do timer broadcasting
for C1E and covers the late enablement of C1E as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Most places in the kernel that go BUG: print a module list
(which is very useful for doing statistics and finding patterns),
however the softlockup detector does not do this yet.
This patch adds the one line change to fix this gap.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This commit includes a bugfix for the fragile setuid fixup code in the
case that filesystem capabilities are supported (in access()). The effect
of this fix is gated on filesystem capability support because changing
securebits is only supported when filesystem capabilities support is
configured.)
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove all clameter@sgi.com addresses from the kernel tree since they will
become invalid on June 27th. Change my maintainer email address for the
slab allocators to cl@linux-foundation.org (which will be the new email
address for the future).
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On Thu, Jun 19, 2008 at 12:27:14PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-05 at 10:50 +0530, Ankita Garg wrote:
>
> > Thanks Peter for the explanation...
> >
> > I agree with the above and that is the reason why I did not see weird
> > values with cpu_time. But, run_delay still would suffer skews as the end
> > points for delta could be taken on different cpus due to migration (more
> > so on RT kernel due to the push-pull operations). With the below patch,
> > I could not reproduce the issue I had seen earlier. After every dequeue,
> > we take the delta and start wait measurements from zero when moved to a
> > different rq.
>
> OK, so task delay delay accounting is broken because it doesn't take
> migration into account.
>
> What you've done is make it symmetric wrt enqueue, and account it like
>
> cpu0 cpu1
>
> enqueue
> <wait-d1>
> dequeue
> enqueue
> <wait-d2>
> run
>
> Where you add both d1 and d2 to the run_delay,.. right?
>
Thanks for reviewing the patch. The above is exactly what I have done.
> This seems like a good fix, however it looks like the patch will break
> compilation in !CONFIG_SCHEDSTATS && !CONFIG_TASK_DELAY_ACCT, of it
> failing to provide a stub for sched_info_dequeue() in that case.
Fixed. Pl. find the new patch below.
Signed-off-by: Ankita Garg <ankita@in.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: rostedt@goodmis.org
Cc: suresh.b.siddha@intel.com
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dhaval@linux.vnet.ibm.com
Cc: vatsa@linux.vnet.ibm.com
Cc: David Bahi <DBahi@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We have the notion of tracking process-coupling (a.k.a. buddy-wake) via
the p->se.last_wake / p->se.avg_overlap facilities, but it is only used
for cfs to cfs interactions. There is no reason why an rt to cfs
interaction cannot share in establishing a relationhip in a similar
manner.
Because PREEMPT_RT runs many kernel threads as FIFO priority, we often
times have heavy interaction between RT threads waking CFS applications.
This patch offers a substantial boost (50-60%+) in perfomance under those
circumstances.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: npiggin@suse.de
Cc: rostedt@goodmis.org
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Due to a possible deadlock, the waking of the softirq was pushed outside
of the hrtimer base locks. See commit 0c96c5979a
Unfortunately this allows the task to migrate after setting up the softirq
and raising it. Since softirqs run a queue that is per-cpu we may raise the
softirq on the wrong CPU and this will keep the queued softirq task from
running.
To solve this issue, this patch disables preemption around the releasing
of the hrtimer lock and raising of the softirq.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: fix divide error when trying to configure rt_period to zero
The moment mmiotrace is enabled, I hit a NULL deref in:
IP: [<ffffffff80256e71>] __trace_special+0x17c/0x23a
Call Trace:
[<ffffffff802573cc>] ftrace_special+0x6f/0x9a
[<ffffffff8023e3e4>] down+0x19/0x4a
[<ffffffff80228adc>] acquire_console_sem+0x42/0x58
[<ffffffff8035d273>] con_flush_chars+0x28/0x43
[<ffffffff80354a70>] write_chan+0x22e/0x334
[<ffffffff802244e9>] ? default_wake_function+0x0/0xf
[<ffffffff8035236d>] tty_write+0x195/0x228
[<ffffffff80354842>] ? write_chan+0x0/0x334
[<ffffffff8027c23a>] vfs_write+0xae/0x137
[<ffffffff8027c6e3>] sys_write+0x47/0x70
[<ffffffff8020b1db>] system_call_after_swapgs+0x7b/0x80
which means 'entry' in __trace_special() is NULL.
[ mingo@elte.hu: that ftrace_special() was a leftover. ]
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: proski@gnu.org
Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dhaval Giani reported this warning during cpu hotplug stress-tests:
| On running kernel compiles in parallel with cpu hotplug:
|
| WARNING: at arch/x86/kernel/smp.c:118
| native_smp_send_reschedule+0x21/0x36()
| Modules linked in:
| Pid: 27483, comm: cc1 Not tainted 2.6.26-rc7 #1
| [...]
| [<c0110355>] native_smp_send_reschedule+0x21/0x36
| [<c014fe8f>] force_quiescent_state+0x47/0x57
| [<c014fef0>] call_rcu+0x51/0x6d
| [<c01713b3>] __fput+0x130/0x158
| [<c0171231>] fput+0x17/0x19
| [<c016fd99>] filp_close+0x4d/0x57
| [<c016fdff>] sys_close+0x5c/0x97
IMHO the warning is a spurious one.
cpu_online_map is updated by the _cpu_down() using stop_machine_run().
Since force_quiescent_state is invoked from irqs disabled section,
stop_machine_run() won't be executing while a cpu is executing
force_quiescent_state(). Hence the cpu_online_map is stable while we're
in the irq disabled section.
However, a cpu might have been offlined _just_ before we disabled irqs
while entering force_quiescent_state(). And rcu subsystem might not yet
have handled the CPU_DEAD notification, leading to the offlined cpu's
bit being set in the rcp->cpumask.
Hence cpumask = (rcp->cpumask & cpu_online_map) to prevent sending
smp_reschedule() to an offlined CPU.
Here's the timeline:
CPU_A CPU_B
--------------------------------------------------------------
cpu_down(): .
. .
. .
stop_machine(): /* disables preemption, .
* and irqs */ .
. .
. .
take_cpu_down(); .
. .
. .
. .
cpu_disable(); /*this removes cpu .
*from cpu_online_map .
*/ .
. .
. .
restart_machine(); /* enables irqs */ .
------WINDOW DURING WHICH rcp->cpumask is stale ---------------
. call_rcu();
. /* disables irqs here */
. .force_quiescent_state();
.CPU_DEAD: .for_each_cpu(rcp->cpumask)
. . smp_send_reschedule();
. .
. . WARN_ON() for offlined CPU!
.
.
.
rcu_cpu_notify:
.
-------- WINDOW ENDS ------------------------------------------
rcu_offline_cpu() /* Which calls cpu_quiet()
* which removes
* cpu from rcp->cpumask.
*/
If a new batch was started just before calling stop_machine_run(), the
"tobe-offlined" cpu is still present in rcp-cpumask.
During a cpu-offline, from take_cpu_down(), we queue an rt-prio idle
task as the next task to be picked by the scheduler. We also call
cpu_disable() which will disable any further interrupts and remove the
cpu's bit from the cpu_online_map.
Once the stop_machine_run() successfully calls take_cpu_down(), it calls
schedule(). That's the last time a schedule is called on the offlined
cpu, and hence the last time when rdp->passed_quiesc will be set to 1
through rcu_qsctr_inc().
But the cpu_quiet() will be on this cpu will be called only when the
next RCU_SOFTIRQ occurs on this CPU. So at this time, the offlined CPU
is still set in rcp->cpumask.
Now coming back to the idle_task which truely offlines the CPU, it does
check for a pending RCU and raises the softirq, since it will find
rdp->passed_quiesc to be 0 in this case. However, since the cpu is
offline I am not sure if the softirq will trigger on the CPU.
Even if it doesn't the rcu_offline_cpu() will find that rcp->completed
is not the same as rcp->cur, which means that our cpu could be holding
up the grace period progression. Hence we call cpu_quiet() and move
ahead.
But because of the window explained in the timeline, we could still have
a call_rcu() before the RCU subsystem executes it's CPU_DEAD
notification, and we send smp_send_reschedule() to offlined cpu while
trying to force the quiescent states. The appended patch adds comments
and prevents checking for offlined cpu everytime.
cpu_online_map is updated by the _cpu_down() using stop_machine_run().
Since force_quiescent_state is invoked from irqs disabled section,
stop_machine_run() won't be executing while a cpu is executing
force_quiescent_state(). Hence the cpu_online_map is stable while we're
in the irq disabled section.
Reported-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rusty Russel <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The print_timestamp can never be bigger than the touch_timestamp, at
maximum it can be equal. And if it is, the second check for
touch_timestamp + 1 bigger print_timestamp is always true, too.
The check for equality is sufficient as we proceed in one-second-steps
and are at least one second away from the last print-out if we have
another timestamp.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Updating the timestamp more often is pointless as we print the warnings
only if we exceed the threshold. And the check for hung tasks relies on
the last timestamp, so it will keep working correctly, too.
Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently x86_32, sh and cris-v32 provide per-device coherent dma
memory allocator.
However their implementation is nearly identical. Refactor out
common code to be reused by them.
Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes the following warning:
kernel/sched.c:1667: warning: 'cfs_rq_set_shares' defined but not used
This seems the correct way to fix this; cfs_rq_set_shares() is only used
in a single place, which is also inside #ifdef CONFIG_FAIR_GROUP_SCHED.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
kernel/sched.c: In function ‘sched_group_set_shares':
kernel/sched.c:8635: error: implicit declaration of function ‘cfs_rq_set_shares'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'audit.b52' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
[PATCH] remove useless argument type in audit_filter_user()
[PATCH] audit: fix kernel-doc parameter notation
[PATCH] kernel/audit.c: nlh->nlmsg_type is gotten more than once
Vegard Nossum reported:
> WARNING: at kernel/lockdep.c:2738 check_flags+0x142/0x160()
which happens due to:
unsigned long long cpu_clock(int cpu)
{
unsigned long long clock;
unsigned long flags;
raw_local_irq_save(flags);
as lower level functions can take locks, we must not do that, use
proper lockdep-annotated irq save/restore.
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
the CPU hotplug problems (crashes under high-volume unplug+replug
tests) seem to be related to migrate_dead_tasks().
Firstly I added traces to see all tasks being migrated with
migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem
pops up (the one with "se == NULL" in the loop of
pick_next_task_fair()) shortly after the traces indicate that some has
been migrated with migrate_dead_tasks()). btw., I can reproduce it
much faster now with just a plain cpu down/up loop.
[disclaimer] Well, unless I'm really missing something important in
this late hour [/desclaimer] pick_next_task() is not something
appropriate for migrate_dead_tasks() :-)
the following change seems to eliminate the problem on my setup
(although, I kept it running only for a few minutes to get a few
messages indicating migrate_dead_tasks() does move tasks and the
system is still ok)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On qemu, the backtrace would show up _after_ the "end of backtrace
testing" message.
This patch changes it to use completions instead, which will guarantee
that no such race exists.
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds saved stack-traces to the backtrace suite of self-tests.
Note that we don't depend on or unconditionally enable CONFIG_STACKTRACE
because not all architectures may have it (and we still want to enable the
other tests for those architectures).
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Measurement shows that the difference between cgroup:/ and cgroup:/foo
wake_affine() results is that the latter succeeds significantly more.
Therefore bias the calculations towards failing the test.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Increase the accuracy of the effective_load values.
Not only consider the current increment (as per the attempted wakeup), but
also consider the delta between when we last adjusted the shares and the
current situation.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It was observed these mults can overflow.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We found that the affine wakeup code needs rather accurate load figures
to be effective. The trouble is that updating the load figures is fairly
expensive with group scheduling. Therefore ratelimit the updating.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In case the domain is empty, pretend there is a single task on each cpu, so
that together with the boost logic we end up giving 1/n shares to each
cpu.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The bias given by source/target_load functions can be very large, disable
it by default to get faster convergence.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Priority looses much of its meaning in a hierarchical context. So don't
use it in balance decisions.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently task_h_load() computes the load of a task and uses that to either
subtract it from the total, or add to it.
However, removing or adding a task need not have any effect on the total load
at all. Imagine adding a task to a group that is local to one cpu - in that
case the total load of that cpu is unaffected.
So properly compute addition/removal:
s_i = S * rw_i / \Sum_j rw_j
s'_i = S * (rw_i + wl) / (\Sum_j rw_j + wg)
then s'_i - s_i gives the change in load.
Where s_i is the shares for cpu i, S the group weight, rw_i the runqueue weight
for that cpu, wl the weight we add (subtract) and wg the weight contribution to
the runqueue.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
doing the load balance will change cfs_rq->load.weight (that's the whole point)
but since that's part of the scale factor, we'll scale back with a different
amount.
Weight getting smaller would result in an inflated moved_load which causes
it to stop balancing too soon.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
find_busiest_group() has some assumptions about task weight being in the
NICE_0_LOAD range. Hierarchical task groups break this assumption - fix this
by replacing it with the average task weight, which will adapt the situation.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With hierarchical grouping we can't just compare task weight to rq weight - we
need to scale the weight appropriately.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the fall-back to SCHED_LOAD_SCALE by remembering the previous value of
cpu_avg_load_per_task() - this is useful because of the hierarchical group
model in which task weight can be much smaller.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Finding the least idle cpu is more accurate when done with updated shares.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Re-compute the shares on newidle - so we can make a decision based on
recent data.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While thinking about the previous patch - I realized that using per domain
aggregate load values in load_balance_fair() is wrong. We should use the
load value for that CPU.
By not needing per domain hierarchical load values we don't need to store
per domain aggregate shares, which greatly simplifies all the math.
It basically falls apart in two separate computations:
- per domain update of the shares
- per CPU update of the hierarchical load
Also get rid of the move_group_shares() stuff - just re-compute the shares
again after a successful load balance.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We only need to know the task_weight of the busiest rq - nothing to do
if there are no tasks there.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We used to try and contain the loss of 'shares' by playing arithmetic
games. Replace that by noticing that at the top sched_domain we'll
always have the full weight in shares to distribute.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The idea was to balance groups until we've reached the global goal, however
Vatsa rightly pointed out that we might never reach that goal this way -
hence take out this logic.
[ the initial rationale for this 'feature' was to promote max concurrency
within a group - it does not however affect fairness ]
Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It was observed that in __update_group_shares_cpu()
rq_weight > aggregate()->rq_weight
This is caused by forks/wakeups in between the initial aggregate pass and
locking of the RQs for load balance. To avoid this situation partially re-do
the aggregation once we have the RQs locked (which avoids new tasks from
appearing).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Keeping the aggregate on the first cpu of the sched domain has two problems:
- it could collide between different sched domains on different cpus
- it could slow things down because of the remote accesses
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
show all the schedstats in /debug/sched_debug as well.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Uncouple buddy selection from wakeup granularity.
The initial idea was that buddies could run ahead as far as a normal task
can - do this by measuring a pair 'slice' just as we do for a normal task.
This means we can drop the wakeup_granularity back to 5ms.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
with sched_clock_cpu() being reasonably in sync between cpus (max 1 jiffy
difference) use this to provide cpu_clock().
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ok, so why are we in this mess, it was:
1/w
but now we mixed that rw in the mix like:
rw/w
rw being \Sum w suggests: fiddling w, we should also fiddle rw, humm?
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
calc_delta_asym() is supposed to do the same as calc_delta_fair() except
linearly shrink the result for negative nice processes - this causes them
to have a smaller preemption threshold so that they are more easily preempted.
The problem is that for task groups se->load.weight is the per cpu share of
the actual task group weight; take that into account.
Also provide a debug switch to disable the asymmetry (which I still don't
like - but it does greatly benefit some workloads)
This would explain the interactivity issues reported against group scheduling.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In file included from /mnt/build/linux-2.6/kernel/sched.c:1496:
/mnt/build/linux-2.6/kernel/sched_rt.c: In function '__enable_runtime':
/mnt/build/linux-2.6/kernel/sched_rt.c:339: warning: unused variable 'rd'
/mnt/build/linux-2.6/kernel/sched_rt.c: In function 'requeue_rt_entity':
/mnt/build/linux-2.6/kernel/sched_rt.c:692: warning: unused variable 'queue'
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
kernel/smp.c: In function 'smp_call_function_mask':
kernel/smp.c:303: error: too many arguments to function 'smp_call_function_single'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's never used and the comments refer to nonatomic and retry
interchangably. So get rid of it.
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This adds kernel/smp.c which contains helpers for IPI function calls. In
addition to supporting the existing smp_call_function() in a more efficient
manner, it also adds a more scalable variant called smp_call_function_single()
for calling a given function on a single CPU only.
The core of this is based on the x86-64 patch from Nick Piggin, lots of
changes since then. "Alan D. Brunelle" <Alan.Brunelle@hp.com> has
contributed lots of fixes and suggestions as well. Also thanks to
Paul E. McKenney <paulmck@linux.vnet.ibm.com> for reviewing RCU usage
and getting rid of the data allocation fallback deadlock.
Acked-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This patch adds some information about when interrupts were last
enabled and disabled to the output of the softlockup detector.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Johannes Weiner <hannes@saeurebad.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The second argument "type" is not used in audit_filter_user(), so I think that type can be removed. If I'm wrong, please tell me.
Signed-off-by: Peng Haitao <penght@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix auditfilter kernel-doc misssing parameter description:
Warning(lin2626-rc3//kernel/auditfilter.c:1551): No description found for parameter 'sessionid'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The first argument "nlh->nlmsg_type" of audit_receive_filter() should be modified to "msg_type" in audit_receive_msg().
Signed-off-by: Peng Haitao <penght@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Fix warning reported by sparse
kernel/kgdb.c:1502:6: warning: symbol 'kgdb_console_write' was not declared.
Should it be static?
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
It is based on x86/master branch of git-x86 tree, and has been tested
on x86_64 platform.
ChangeLog:
v2:
- Enclosing proc file system related code into CONFIG_PROVE_LOCKING.
- Fix nr_chain_hlocks update code.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hidehiro Kawai noticed that sched_setscheduler() can fail in
stop_machine: it calls sched_setscheduler() from insmod, which can
have CAP_SYS_MODULE without CAP_SYS_NICE.
Two cases could have failed, so are changed to sched_setscheduler_nocheck:
kernel/softirq.c:cpu_callback()
- CPU hotplug callback
kernel/stop_machine.c:__stop_machine_run()
- Called from various places, including modprobe()
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi OSHIMA <satoshi.oshima.fk@hitachi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Avoid modifying the mcount call-site if there is a kprobe installed on it.
These records are not marked as failed however. This allowed the filter
rules on them to remain up-to-date. Whenever the kprobe on the corresponding
record is removed, the record gets updated as normal.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Let records identified as being kprobe'd be marked as "frozen". The trouble
with records which have a kprobe installed on their mcount call-site is
that they don't get updated. So if such a function which is currently being
traced gets its tracing disabled due to a new filter rule (or because it
was added to the notrace list) then it won't be updated and continue being
traced. This patch allows scanning of all frozen records during tracing to
check if they should be traced.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Record the address of the mcount call-site. Currently all archs except sparc64
record the address of the instruction following the mcount call-site. Some
general cleanups are entailed. Storing mcount addresses in rec->ip enables
looking them up in the kprobe hash table later on to check if they're kprobe'd.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: davem@davemloft.net
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch addresses a very sporadic pi-futex related failure in
highly threaded java apps on large SMP systems.
David Holmes reported that the pi_state consistency check in
lookup_pi_state triggered with his test application. This means that
the kernel internal pi_state and the user space futex variable are out
of sync. First we assumed that this is a user space data corruption,
but deeper investigation revieled that the problem happend because the
pi-futex code is not handling a fault in the futex_lock_pi path when
the user space variable needs to be fixed up.
The fault happens when a fork mapped the anon memory which contains
the futex readonly for COW or the page got swapped out exactly between
the unlock of the futex and the return of either the new futex owner
or the task which was the expected owner but failed to acquire the
kernel internal rtmutex. The current futex_lock_pi() code drops out
with an inconsistent in case it faults and returns -EFAULT to user
space. User space has no way to fixup that state.
When we wrote this code we thought that we could not drop the hash
bucket lock at this point to handle the fault.
After analysing the code again it turned out to be wrong because there
are only two tasks involved which might modify the pi_state and the
user space variable:
- the task which acquired the rtmutex
- the pending owner of the pi_state which did not get the rtmutex
Both tasks drop into the fixup_pi_state() function before returning to
user space. The first task which acquired the hash bucket lock faults
in the fixup of the user space variable, drops the spinlock and calls
futex_handle_fault() to fault in the page. Now the second task could
acquire the hash bucket lock and tries to fixup the user space
variable as well. It either faults as well or it succeeds because the
first task already faulted the page in.
One caveat is to avoid a double fixup. After returning from the fault
handling we reacquire the hash bucket lock and check whether the
pi_state owner has been modified already.
Reported-by: David Holmes <david.holmes@sun.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Holmes <david.holmes@sun.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 73 insertions(+), 20 deletions(-)
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
rcupreempt: remove export of rcu_batches_completed_bh
cpuset: limit the input of cpuset.sched_relax_domain_level
Simplify the code and fix the boundary condition of
wait_for_completion_timeout(,0).
We can kill the first __remove_wait_queue() as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
There's no need to use local_irq_save() over local_irq_disable() in the
local_bh_enable code since it is a bug to call it with irqs disabled and
do_softirq will enable irqs if there is any pending work.
Consolidate the code from local_bh_enable and ..._ip to avoid having a
disconnect between them in the warnings they trigger that is currently
there.
Also always trigger the warning on in_irq(), not just in the
trace-irqflags case.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Michael Buesch <mb@bu3sch.de>
Cc: David Ellingsworth <david@identd.dyndns.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It seems that the current implementaton of wait_for_completion_timeout()
has a small problem under very high load for the common pattern:
if (!wait_for_completion_timeout(&done, timeout))
/* handle failure */
because the implementation very roughly does (lots of code deleted to
show the basic flow):
static inline long __sched
do_wait_for_common(struct completion *x, long timeout, int state)
{
if (x->done)
return timeout;
do {
timeout = schedule_timeout(timeout);
if (!timeout)
return timeout;
} while (!x->done);
return timeout;
}
so if the system is very busy and x->done is not set when
do_wait_for_common() is entered, it is possible that the first call to
schedule_timeout() returns 0 because the task doing wait_for_completion
doesn't get rescheduled for a long time, even if it is woken up early
enough.
In this case, wait_for_completion_timeout() returns 0 without even
checking x->done again, and the code above falls into its failure case
purely for scheduler reasons, even if the hardware event or whatever was
being waited for happened early enough.
It would make sense to add an extra test to do_wait_for() in the timeout
case and return 1 if x->done is actually set.
A quick audit (not exhaustive) of wait_for_completion_timeout() callers
seems to indicate that no one actually cares about the return value in
the success case -- they just test for 0 (timed out) versus non-zero
(wait succeeded).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch records array of lock_class into lock_chain, and export
lock_chain information via /proc/lockdep_chains.
It is based on x86/master branch of git-x86 tree, and has been tested
on x86_64 platform.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
So if the group ever gets throttled, it will never wake up again.
Reported-by: "Daniel K." <dk@uw.no>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
So if the group ever gets throttled, it will never wake up again.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Daniel K." <dk@uw.no>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reported-by: "Daniel K." <dk@uw.no>
Now we exceed the runtime and get throttled - the period rollover tick
will subtract the cpu quota from the runtime and check if we're below
quota. However with this cpu having a very small portion of the runtime
it will not refresh as fast as it should.
Therefore, also rebalance the runtime when we're throttled.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Daniel K." <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch corrects the incorrect value of per process run-queue wait
time reported by delay statistics. The anomaly was due to the following
reason. When a process leaves the CPU and immediately starts waiting for
CPU on the runqueue (which means it remains in the TASK_RUNNABLE state),
the time of re-entry into the run-queue is never recorded. Due to this,
the waiting time on the runqueue from this point of re-entry upto the
next time it hits the CPU is not accounted for. This is solved by
recording the time of re-entry of a process leaving the CPU in the
sched_info_depart() function IF the process will go back to waiting on
the run-queue. This IF condition is verified by checking whether the
process is still in the TASK_RUNNABLE state.
The patch was tested on 2.6.26-rc6 using two simple CPU hog programs.
The values noted prior to the fix did not account for the time spent on
the runqueue waiting. After the fix, the correct values were reported
back to user space.
Signed-off-by: Bharath Ravi <bharathravi1@gmail.com>
Signed-off-by: Madhava K R <madhavakr@gmail.com>
Cc: dhaval@linux.vnet.ibm.com
Cc: vatsa@in.ibm.com
Cc: balbir@in.ibm.com
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch re-institutes the ability to build rcutorture directly into
the Linux kernel. The reason that this capability was removed was that
this could result in your kernel being pretty much useless, as rcutorture
would be running starting from early boot. This problem has been avoided
by (1) making rcutorture run only three seconds of every six by default,
(2) adding a CONFIG_RCU_TORTURE_TEST_RUNNABLE that permits rcutorture
to be quiesced at boot time, and (3) adding a sysctl in /proc named
/proc/sys/kernel/rcutorture_runnable that permits rcutorture to be
quiesced and unquiesced when built into the kernel.
Please note that this /proc file is -not- available when rcutorture
is built as a module. Please also note that to get the earlier
take-no-prisoners behavior, you must use the boot command line to set
rcutorture's "stutter" parameter to zero.
The rcutorture quiescing mechanism is currently quite crude: loops
in each rcutorture process that poll a global variable once per tick.
Suggestions for improvement are welcome. The default action will
be to reduce the polling rate to a few times per second.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The touch_nmi_watchdog() routine on x86 ultimately calls
touch_softlockup_watchdog(). The problem is that to touch the
softlockup watchdog, the cpu_clock code has to be called which could
involve multiple cpu locks and can lead to a hard hang if one of the
locks is held by a processor that is not going to return anytime soon
(such as could be the case with kgdb or perhaps even with some other
kind of exception).
This patch causes the public version of the
touch_softlockup_watchdog() to defer the cpu clock access to a later
point.
The test case for this problem is to use the following kernel config
options:
CONFIG_KGDB_TESTS=y
CONFIG_KGDB_TESTS_ON_BOOT=y
CONFIG_KGDB_TESTS_BOOT_STRING="V1F100I100000"
It should be noted that kgdb test suite and these options were not
available until 2.6.26-rc2, so it was necessary to patch the kgdb
test suite during the bisection.
I would consider this patch a regression fix because the problem first
appeared in commit 27ec440779 when some
logic was added to try to periodically sync the clocks. It was
possible to work around this particular problem by simply not
performing the sync anytime the system was in a critical context.
This was ok until commit 3e51f33fcc,
which added config option CONFIG_HAVE_UNSTABLE_SCHED_CLOCK and some
multi-cpu locks to sync the clocks. It became clear that accessing
this code from an nmi was the source of the lockups. Avoiding the
access to the low level clock code from an code inside the NMI
processing also fixed the problem with the 27ec44... commit.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In rcupreempt, rcu_batches_completed_bh is defined as a static inline in
the header file. This does not need to be exported, and not only that,
this breaks my PPC build.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: paulus@samba.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL
for inputs outside this range.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
First issue is not related to the cpusets. We're simply leaking doms_cur.
It's allocated in arch_init_sched_domains() which is called for every
hotplug event. So we just keep reallocation doms_cur without freeing it.
I introduced free_sched_domains() function that cleans things up.
Second issue is that sched domains created by the cpusets are
completely destroyed by the CPU hotplug events. For all CPU hotplug
events scheduler attaches all CPUs to the NULL domain and then puts
them all into the single domain thereby destroying domains created
by the cpusets (partition_sched_domains).
The solution is simple, when cpusets are enabled scheduler should not
create default domain and instead let cpusets do that. Which is
exactly what the patch does.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: menage@google.com
Cc: rostedt@goodmis.org
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In tick_task_rt() we first call update_curr_rt() which can dequeue a runqueue
due to it running out of runtime, and then we try to requeue it, of it also
having exhausted its RR quota. Obviously requeueing something that is no longer
on the runqueue will not have the expected result.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The bandwidth throttle code dequeues a group when it runs out of quota, and
re-queues it once the period rolls over and the quota gets refreshed.
Sadly it failed to take the hierarchy into consideration. Share more of the
enqueue/dequeue code with regular task opterations.
Also, some operations like sched_setscheduler() can dequeue/enqueue tasks that
are in throttled runqueues, we should not inadvertly re-enqueue empty runqueues
so check for that.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Don't re-set the entity's runqueue to the wrong rq after we've set it
to the right one.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Daniel K. <dk@uw.no>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Most places in the kernel that go BUG: print a module list
(which is very useful for doing statistics and finding patterns),
however the softlockup detector does not do this yet.
This patch adds the one line change to fix this gap.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch takes a step towards making rcutorture more brutal by allowing
the test to be automatically periodically paused, with the default being
to run the test for five seconds then pause for five seconds and repeat.
This behavior can be controlled using a new "stutter" module parameter, so
that "stutter=0" gives the old default behavior of running continuously.
Starting and stopping rcutorture more heavily stresses RCU's interaction
with the scheduler, as well as exercising more paths through the
grace-period detection code.
Note that the default to "shuffle_interval" has also been adjusted from
5 seconds to 3 seconds to provide varying overlap with the "stutter"
interval.
I am still unable to provoke the failures that Alexey has been seeing,
even with this patch, but will be doing a few additional things to beef
up rcutorture.
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's not explicitly marked as asmlinkage, but invoked from x86_32
startup code with parameters on stack.
No other architectures define early_printk and none of them are affected
by this change, since defines asmlinkage as empty token.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
regarding this commit: 45c01e8249
I think we can do it simpler. Please take a look at the patch below.
Instead of having 2 separate arrays (which is + ~800 bytes on x86_32 and
twice so on x86_64), let's add "exclusive" (the ones that are bound to
this CPU) tasks to the head of the queue and "shared" ones -- to the
end.
In case of a few newly woken up "exclusive" tasks, they are 'stacked'
(not queued as now), meaning that a task {i+1} is being placed in front
of the previously woken up task {i}. But I don't think that this
behavior may cause any realistic problems.
There are a couple of changes on top of this one.
(1) in check_preempt_curr_rt()
I don't think there is a need for the "pick_next_rt_entity(rq, &rq->rt)
!= &rq->curr->rt" check.
enqueue_task_rt(p) and check_preempt_curr_rt() are always called one
after another with rq->lock being held so the following check
"p->rt.nr_cpus_allowed == 1 && rq->curr->rt.nr_cpus_allowed != 1" should
be enough (well, just its left part) to guarantee that 'p' has been
queued in front of the 'curr'.
(2) in set_cpus_allowed_rt()
I don't thinks there is a need for requeue_task_rt() here.
Perhaps, the only case when 'requeue' (+ reschedule) might be useful is
as follows:
i) weight == 1 && cpu_isset(task_cpu(p), *new_mask)
i.e. a task is being bound to this CPU);
ii) 'p' != rq->curr
but here, 'p' has already been on this CPU for a while and was not
migrated. i.e. it's possible that 'rq->curr' would not have high chances
to be migrated right at this particular moment (although, has chance in
a bit longer term), should we allow it to be preempted.
Anyway, I think we should not perhaps make it more complex trying to
address some rare corner cases. For instance, that's why a single queue
approach would be preferable. Unless I'm missing something obvious, this
approach gives us similar functionality at lower cost.
Verified only compilation-wise.
(Almost)-Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix this warning, which appears with !CONFIG_SMP:
kernel/sched.c:1216: warning: `init_hrtick' defined but not used
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since softirq works for rcu reclaimer, rcu_tasklet is unused now.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a fix to give notrace filter rules priority over "set_ftrace_filter"
rules.
This fix ensures that functions which are set to be filtered and are
concurrently marked as "notrace" don't get recorded. As of now, if
a record is marked as FTRACE_FL_FILTER and is enabled, then the notrace
flag is not checked. Tested on x86-32.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- shorter code and better atomicity with regards to printk().
(It's been tested with the backtrace self-test code on i386 and x86_64.)
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The key instead of address of lock_class should be output in
/proc/lockdep when forward dependency is output, because key is
output for lock_class itself as identifier too.
This patch is based on x86/auto-latest branch of git-x86 tree, and has
been tested on x86_64 platform.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix error checking routine to catch an error which occurs in first
__register_*probe().
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: David Miller <davem@davemloft.net>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ACPI PM: Add possibility to change suspend sequence
There are some systems out there that don't work correctly with
our current suspend/hibernation code ordering. Provide a workaround
for these systems allowing them to pass 'acpi_sleep=old_ordering' in
the kernel command line so that it will use the pre-ACPI 2.0 ("old")
suspend code ordering.
Unfortunately, this requires us to add a platform hook to the
resuming of devices for recovering the platform in case one of the
device drivers' .suspend() routines returns error code. Namely,
ACPI 1.0 specifies that _PTS should be called before suspending
devices, but _WAK still should be called before resuming them in
order to undo the changes made by _PTS. However, if there is an
error during suspending devices, they are automatically resumed
without returning control to the PM core, so the _WAK has to be
called from within device_resume() in that cases.
The patch also reorders and refactors the ACPI suspend/hibernation
code to avoid duplication as far as reasonably possible.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
(overflow means weight >= 2^32 here, because inv_weigh = 2^32/weight)
A weight of a cfs_rq is the sum of weights of which entities
are queued on this cfs_rq, so it will overflow when there are
too many entities.
Although, overflow occurs very rarely, but it break fairness when
it occurs. 64-bits systems have more memory than 32-bit systems
and 64-bit systems can create more process usually, so overflow may
occur more frequently.
This patch guarantees fairness when overflow happens on 64-bit systems.
Thanks to the optimization of compiler, it changes nothing on 32-bit.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I found a bug which can be reproduced by this way:(linux-2.6.26-rc5, x86-64)
(use 2^32, 2^33, ...., 2^63 as shares value)
# mkdir /dev/cpuctl
# mount -t cgroup -o cpu cpuctl /dev/cpuctl
# cd /dev/cpuctl
# mkdir sub
# echo 0x8000000000000000 > sub/cpu.shares
# echo $$ > sub/tasks
oops here! divide by zero.
This is because do_div() expects the 2th parameter to be 32 bits,
but unsigned long is 64 bits in x86_64.
Peter Zijstra pointed it out that the sane thing to do is limit the
shares value to something smaller instead of using an even more
expensive divide.
Also, I found another bug about "the shares value is too large":
pid1 and pid2 are set affinity to cpu#0
pid1 is attached to cg1 and pid2 is attached to cg2
if cg1/cpu.shares = 1024 cg2/cpu.shares = 2000000000
then pid2 got 100% usage of cpu, and pid1 0%
if cg1/cpu.shares = 1024 cg2/cpu.shares = 20000000000
then pid2 got 0% usage of cpu, and pid1 100%
And a weight of a cfs_rq is the sum of weights of which entities
are queued on this cfs_rq, so the shares value should be limited
to a smaller value.
I think that (1UL << 18) is a good limited value:
1) it's not too large, we can create a lot of group before overflow
2) it's several times the weight value for nice=-19 (not too small)
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do not print loglevel before "entries of %ld bytes". Move it to the previous
pr_info.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce 'struct pm_ops' and 'struct pm_ext_ops' ('ext' meaning
'extended') representing suspend and hibernation operations for bus
types, device classes, device types and device drivers.
Modify the PM core to use 'struct pm_ops' and 'struct pm_ext_ops'
objects, if defined, instead of the ->suspend(), ->resume(),
->suspend_late(), and ->resume_early() callbacks (the old callbacks
will be considered as legacy and gradually phased out).
The main purpose of doing this is to separate suspend (aka S2RAM and
standby) callbacks from hibernation callbacks in such a way that the
new callbacks won't take arguments and the semantics of each of them
will be clearly specified. This has been requested for multiple
times by many people, including Linus himself, and the reason is that
within the current scheme if ->resume() is called, for example, it's
difficult to say why it's been called (ie. is it a resume from RAM or
from hibernation or a suspend/hibernation failure etc.?).
The second purpose is to make the suspend/hibernation callbacks more
flexible so that device drivers can handle more than they can within
the current scheme. For example, some drivers may need to prevent
new children of the device from being registered before their
->suspend() callbacks are executed or they may want to carry out some
operations requiring the availability of some other devices, not
directly bound via the parent-child relationship, in order to prepare
for the execution of ->suspend(), etc.
Ultimately, we'd like to stop using the freezing of tasks for suspend
and therefore the drivers' suspend/hibernation code will have to take
care of the handling of the user space during suspend/hibernation.
That, in turn, would be difficult within the current scheme, without
the new ->prepare() and ->complete() callbacks.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Found that inspite of setting the current_tracer to "none", trace from
the previous trace type continued to be collected. The patch below fixes
this and causes the trace to be disabled when the "none" type is
selected.
Compile and boot tested the patch for functionality.
Signed-off-by: Ankita Garg <ankita@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Sitsofe Wheeler bisected the following commit to cause a lockdep to
warn about itself and turn itself off:
> commit c6531cce6e
> Author: Ingo Molnar <mingo@elte.hu>
> Date: Mon May 12 21:21:14 2008 +0200
>
> sched: do not trace sched_clock
do not use raw irq flags in cpu_clock() as it causes lockdep to lose
track of the true state of the IRQ flag.
Reported-and-bisected-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Building with CONFIG_FAIR_GROUP_SCHED=y on UP results in an unused
cfs_rq_set_shares() reference. As nothing is using this dummy function
in the first place, just kill it off.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kthreads that have called kthread_bind() are bound to specific cpus, so
other tasks should not be able to change their cpus_allowed from under
them. Otherwise, it is possible to move kthreads, such as the migration
or software watchdog threads, so they are not allowed access to the cpu
they work on.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cliff Wickman wrote:
> I built an ia64 kernel from Andrew's tree (2.6.26-rc2-mm1)
> and get a very predictable hotplug cpu problem.
> billberry1:/tmp/cpw # ./dis
> disabled cpu 17
> enabled cpu 17
> billberry1:/tmp/cpw # ./dis
> disabled cpu 17
> enabled cpu 17
> billberry1:/tmp/cpw # ./dis
>
> The script that disables the cpu always hangs (unkillable)
> on the 3rd attempt.
>
> And a bit further:
> The kstopmachine thread always sits on the run queue (real time) for about
> 30 minutes before running.
this fix solves some (but not all) issues between CPU hotplug and
RT bandwidth throttling.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt wrote:
> If we unload a module and reload it, will it ever get converted again?
The intent was always to filter core kernel functions to prevent their freeing.
Here's a fix which should allow re-recording of module call-sites.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Identify functions which had their mcount call-site updates failed. This can
help us track functions which ftrace shouldn't fiddle with, and are thus not
being traced. If there is no race with any external agent which is modifying
the mcount call-site, then this file displays no entries (normal case).
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prevent freeing of records which cause problems and correspond to function from
core kernel text. A new flag, FTRACE_FL_CONVERTED is used to mark a record
as "converted". All other records are patched lazily to NOPs. Failed records
now also remain on frace_hash table. Each invocation of ftrace_record_ip now
checks whether the traced function has ever been recorded (including past
failures) and doesn't re-record it again.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change __mutex_lock_common() to use signal_pending_state() for the sake of
the code re-use.
This adds 7 bytes to kernel/mutex.o, but afaics only because gcc isn't smart
enough.
(btw, uninlining of __mutex_lock_common() shrinks .text from 2722 to 1542,
perhaps it is worth doing).
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
this allows us to do
current->state = TASK_INTERRUPTIBLE;
schedule();
without fear to sleep with pending signal.
However, the code like
current->state = TASK_KILLABLE;
schedule();
is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
no effect).
Introduce the new helper, signal_pending_state(), and change schedule() to
use it. Hopefully it will have more users, that is why the task's state is
passed separately.
Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
This is needed to preserve the current behaviour (ptrace_notify). I hope
this check will be removed soon, but this (afaics good) change needs the
separate discussion.
The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)",
basically the same that schedule() does now. However, this patch of course
bloats schedule().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6:
capabilities: remain source compatible with 32-bit raw legacy capability support.
LSM: remove stale web site from MAINTAINERS
Adding a nonexistent cpu to a cpuset will be omitted quietly. It should
return -EINVAL.
Example: (real_nr_cpus <= 4 < NR_CPUS or cpu#4 was just offline)
# cat cpus
0-1
# /bin/echo 4 > cpus
# /bin/echo $?
0
# cat cpus
#
The same occurs when add a nonexistent mem.
This patch will fix this bug.
And when *buf == "", the check is unneeded.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/cpu.c seems a more logical place for those maps since they do not really
have much to do with the scheduler these days.
kernel/cpu.c is now built for the UP kernel too, but it does not affect the size
the kernel sections.
$ size vmlinux
before
text data bss dec hex filename
3313797 307060 310352 3931209 3bfc49 vmlinux
after
text data bss dec hex filename
3313797 307060 310352 3931209 3bfc49 vmlinux
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: menage@google.com
Cc: rostedt@goodmis.org
Cc: mingo@elte.hu
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
First issue is not related to the cpusets. We're simply leaking doms_cur.
It's allocated in arch_init_sched_domains() which is called for every
hotplug event. So we just keep reallocation doms_cur without freeing it.
I introduced free_sched_domains() function that cleans things up.
Second issue is that sched domains created by the cpusets are
completely destroyed by the CPU hotplug events. For all CPU hotplug
events scheduler attaches all CPUs to the NULL domain and then puts
them all into the single domain thereby destroying domains created
by the cpusets (partition_sched_domains).
The solution is simple, when cpusets are enabled scheduler should not
create default domain and instead let cpusets do that. Which is
exactly what the patch does.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: menage@google.com
Cc: rostedt@goodmis.org
Cc: mingo@elte.hu
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
this patch was not built on !SMP:
kernel/sched_rt.c: In function 'inc_rt_tasks':
kernel/sched_rt.c:404: error: 'struct rq' has no member named 'online'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter pointed out that the last version of the "fix" was still one off
under certain circumstances. Use BITS_TO_LONG instead to get an
accurate result.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
A rounding error was pointed out by Peter Zijlstra which would result
in the structure holding priorities to be off by one.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The RT folks over at RedHat found an issue w.r.t. hotplug support which
was traced to problems with the cpupri infrastructure in the scheduler:
https://bugzilla.redhat.com/show_bug.cgi?id=449676
This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel. This patch
applies to 25.4-rt4, though it should trivially apply to most cpupri enabled
kernels mentioned above.
It turned out that the issue was that offline cpus could get inadvertently
registered with cpupri so that they were erroneously selected during
migration decisions. The end result would be an OOPS as the offline cpu
had tasks routed to it.
This patch generalizes the old join/leave domain interface into an
online/offline interface, and adjusts the root-domain/hotplug code to
utilize it.
I was able to easily reproduce the issue prior to this patch, and am no
longer able to reproduce it after this patch. I can offline cpus
indefinately and everything seems to be in working order.
Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point
me in the right direction. Also thank you to Peter for reviewing the
early iterations of this patch.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
While printing out the visual representation of the sched-domains, print
the level (MC, SMT, CPU, NODE, ... ) of each of the sched_domains.
Credit: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For the normal WARN_ON() etc we added a print-the-modules-list already,
which is very useful to figure out candidates for certain types of bugs.
This patch adds the same print to the "scheduling while atomic" BUG warning,
for the same reason: when we get here it's very useful to see which modules
are loaded, to narrow down the candidate code list.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix this warning, which appears with !CONFIG_SMP:
kernel/sched.c:1216: warning: `init_hrtick' defined but not used
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Nothing really serious here, mainly just a matter of nit-picking :-/
From: Dmitry Adamushko <dmitry.adamushko@gmail.com>
For CONFIG_SCHED_DEBUG && CONFIG_SYSCT configs, sd->flags can be altered
while being manipulated in rebalance_domains(). Let's do an atomic check.
We rely here on the atomicity of read/write accesses for aligned words.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The currently logic inadvertently skips the last task on the run-queue,
resulting in missed balance opportunities.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: David Bahi <dbahi@novell.com>
CC: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The current code use a linear algorithm which causes scaling issues
on larger SMP machines. This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
it is safe to ignore timers and flags when the feature is disabled.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Dmitry Adamushko pointed out a known flaw in the rt-balancing algorithm
that could allow suboptimal balancing if a non-migratable task gets
queued behind a running migratable one. It is discussed in this thread:
http://lkml.org/lkml/2008/4/22/296
This issue has been further exacerbated by a recent checkin to
sched-devel (git-id 5eee63a5ebc19a870ac40055c0be49457f3a89a3).
>From a pure priority standpoint, the run-queue is doing the "right"
thing. Using Dmitry's nomenclature, if T0 is on cpu1 first, and T1
wakes up at equal or lower priority (affined only to cpu1) later, it
*should* wait for T0 to finish. However, in reality that is likely
suboptimal from a system perspective if there are other cores that
could allow T0 and T1 to run concurrently. Since T1 can not migrate,
the only choice for higher concurrency is to try to move T0. This is
not something we addessed in the recent rt-balancing re-work.
This patch tries to enhance the balancing algorithm by accomodating this
scenario. It accomplishes this by incorporating the migratability of a
task into its priority calculation. Within a numerical tsk->prio, a
non-migratable task is logically higher than a migratable one. We
maintain this by introducing a new per-priority queue (xqueue, or
exclusive-queue) for holding non-migratable tasks. The scheduler will
draw from the xqueue over the standard shared-queue (squeue) when
available.
There are several details for utilizing this properly.
1) During task-wake-up, we not only need to check if the priority
preempts the current task, but we also need to check for this
non-migratable condition. Therefore, if a non-migratable task wakes
up and sees an equal priority migratable task already running, it
will attempt to preempt it *if* there is a likelyhood that the
current task will find an immediate home.
2) Tasks only get this non-migratable "priority boost" on wake-up. Any
requeuing will result in the non-migratable task being queued to the
end of the shared queue. This is an attempt to prevent the system
from being completely unfair to migratable tasks during things like
SCHED_RR timeslicing.
I am sure this patch introduces potentially "odd" behavior if you
concoct a scenario where a bunch of non-migratable threads could starve
migratable ones given the right pattern. I am not yet convinced that
this is a problem since we are talking about tasks of equal RT priority
anyway, and there never is much in the way of guarantees against
starvation under that scenario anyway. (e.g. you could come up with a
similar scenario with a specific timing environment verses an affinity
environment). I can be convinced otherwise, but for now I think this is
"ok".
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Current IRQ affinity interface does not provide a way to set affinity
for the IRQs that will be allocated/activated in the future.
This patch creates /proc/irq/default_smp_affinity that lets users set
default affinity mask for the newly allocated IRQs. Changing the default
does not affect affinity masks for the currently active IRQs, they
have to be changed explicitly.
Updated based on Paul J's comments and added some more documentation.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Cc: pj@sgi.com
Cc: a.p.zijlstra@chello.nl
Cc: tglx@linutronix.de
Cc: rdunlap@xenotime.net
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kgdbts: Use HW breakpoints with CONFIG_DEBUG_RODATA
kgdb: use common ascii helpers and put_unaligned_be32 helper
In dynamic ftrace, the mcount function starts off pointing to a stub
function that just returns.
On start up, the call to the stub is modified to point to a "record_ip"
function. The job of the record_ip function is to add the function to
a pre-allocated hash list. If the function is already there, it simply is
ignored, otherwise it is added to the list.
Later, a ftraced daemon wakes up and calls kstop_machine if any functions
have been recorded, and changes the calls to the recorded functions to
a simple nop. If no functions were recorded, the daemon goes back to sleep.
The daemon wakes up once a second to see if it needs to update any newly
recorded functions into nops. Usually it does not, but if a lot of code
has been executed for the first time in the kernel, the ftraced daemon
will call kstop_machine to update those into nops.
The problem currently is that there's no way to stop the daemon from doing
this, and it can cause unneeded latencies (800us which for some is bothersome).
This patch adds a new file /debugfs/tracing/ftraced_enabled. If the daemon
is active, reading this will return "enabled\n" and "disabled\n" when the
daemon is not running. To disable the daemon, the user can echo "0" or
"disable" into this file, and "1" or "enable" to re-enable the daemon.
Since the daemon is used to convert the functions into nops to increase
the performance of the system, I also added that anytime something is
written into the ftraced_enabled file, kstop_machine will run if there
are new functions that have been detected that need to be converted.
This way the user can disable the daemon but still be able to control the
conversion of the mcount calls to nops by simply,
"echo 0 > /debugfs/tracing/ftraced_enabled"
when they need to do more conversions.
To see the number of converted functions:
"cat /debugfs/tracing/dyn_ftrace_total_info"
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tracing functions via ftrace which have a kretprobe installed on them, can produce misleading output in their trace logs. E.g, consider the correct trace of the following sequence:
do_IRQ()
{
~
irq_enter();
~
}
Trace log (sample):
<idle>-0 [00] 4154504455.781616: irq_enter <- do_IRQ
But if irq_enter() has a kretprobe installed on it, the return value stored on the stack at each invocation is modified to divert the return to a kprobe trampoline function called kretprobe_trampoline(). So with this the trace would (currently) look like:
<idle>-0 [00] 4154504455.781616: irq_enter <- kretprobe_trampoline
Now this is quite misleading to the end user, as it suggests something that didn't actually happen. So just to avoid such misinterpretations, the inlined patch aims to output such a log as:
<idle>-0 [00] 4154504455.781616: irq_enter <- [unknown/kretprobe'd]
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The touch_nmi_watchdog() routine on x86 ultimately calls
touch_softlockup_watchdog(). The problem is that to touch the
softlockup watchdog, the cpu_clock code has to be called which could
involve multiple cpu locks and can lead to a hard hang if one of the
locks is held by a processor that is not going to return anytime soon
(such as could be the case with kgdb or perhaps even with some other
kind of exception).
This patch causes the public version of the
touch_softlockup_watchdog() to defer the cpu clock access to a later
point.
The test case for this problem is to use the following kernel config
options:
CONFIG_KGDB_TESTS=y
CONFIG_KGDB_TESTS_ON_BOOT=y
CONFIG_KGDB_TESTS_BOOT_STRING="V1F100I100000"
It should be noted that kgdb test suite and these options were not
available until 2.6.26-rc2, so it was necessary to patch the kgdb
test suite during the bisection.
I would consider this patch a regression fix because the problem first
appeared in commit 27ec440779 when some
logic was added to try to periodically sync the clocks. It was
possible to work around this particular problem by simply not
performing the sync anytime the system was in a critical context.
This was ok until commit 3e51f33fcc,
which added config option CONFIG_HAVE_UNSTABLE_SCHED_CLOCK and some
multi-cpu locks to sync the clocks. It became clear that accessing
this code from an nmi was the source of the lockups. Avoiding the
access to the low level clock code from an code inside the NMI
processing also fixed the problem with the 27ec44... commit.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Source code out there hard-codes a notion of what the
_LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the
raw capability system calls capget() and capset(). Its unfortunate, but
true.
Since the confusing header file has been in a released kernel, there is
software that is erroneously using 64-bit capabilities with the semantics
of 32-bit compatibilities. These recently compiled programs may suffer
corruption of their memory when sys_getcap() overwrites more memory than
they are coded to expect, and the raising of added capabilities when using
sys_capset().
As such, this patch does a number of things to clean up the situation
for all. It
1. forces the _LINUX_CAPABILITY_VERSION define to always retain its
legacy value.
2. adopts a new #define strategy for the kernel's internal
implementation of the preferred magic.
3. deprecates v2 capability magic in favor of a new (v3) magic
number. The functionality of v3 is entirely equivalent to v2,
the only difference being that the v2 magic causes the kernel
to log a "deprecated" warning so the admin can find applications
that may be using v2 inappropriately.
[User space code continues to be encouraged to use the libcap API which
protects the application from details like this. libcap-2.10 is the first
to support v3 capabilities.]
Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518.
Thanks to Bojan Smojver for the report.
[akpm@linux-foundation.org: s/depreciate/deprecate/g]
[akpm@linux-foundation.org: be robust about put_user size]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Bojan Smojver <bojan@rexursive.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Fix (probably theoretical only) rq->clock update bug:
in tick_nohz_update_jiffies() [which is called on all irq
entry on all cpus where the irq entry hits an idle cpu] we
call touch_softlockup_watchdog() before we update jiffies.
That works fine most of the time when idle timeouts are within
60 seconds. But when an idle timeout is beyond 60 seconds,
jiffies is updated with a jump of more than 60 seconds,
which causes a jump in cpu-clock of more than 60 seconds,
triggering a false positive.
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make sched_clock_cpu() return 0 before it has been initialized and avoid
corrupting its state due to doing so.
This fixes the weird printk timestamp jump reported.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Yanmin Zhang reported:
Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1.
It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito.
With bisect, I located the following patch:
| 18d95a2832 is first bad commit
| commit 18d95a2832
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date: Sat Apr 19 19:45:00 2008 +0200
|
| sched: fair-group: SMP-nice for group scheduling
Revert it so that we get v2.6.25 behavior.
Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The Coverity checker spotted a memleak introduced by commit
39106dcf85 (cpumask: use new cpus_scnprintf
function).
It seems the kfree() got lost between v2 and v3 of this patch...
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Mike Travis <travis@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Splice isn't always incrementing the ppos correctly, which broke
relay splice.
Signed-off-by: Tom Zanussi <zanussi@comcast.net>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Without console= arguments on the kernel command line, the first
console to register becomes enabled and the preferred console (the one
behind /dev/console). This is normally tty (assuming
CONFIG_VT_CONSOLE is enabled, which it commonly is).
This is okay as long tty is a useful console. But unless we have the
PV framebuffer, and it is enabled for this domain, tty0 in domU is
merely a dummy. In that case, we want the preferred console to be the
Xen console hvc0, and we want it without having to fiddle with the
kernel command line. Commit b8c2d3dfbc
did that for us.
Since we now have the PV framebuffer, we want to enable and prefer tty
again, but only when PVFB is enabled. But even then we still want to
enable the Xen console as well.
Problem: when tty registers, we can't yet know whether the PVFB is
enabled. By the time we can know (xenstore is up), the console setup
game is over.
Solution: enable console tty by default, but keep hvc as the preferred
console. Change the preferred console to tty when PVFB probes
successfully, unless we've been given console kernel parameters.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The variables dns and inc are not used, remove them.
Signed-off-by: Carlos R. Mafra <crmafra@gmail.com>
Cc: tglx@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
hres_timers_resume() warns if there appears to be more than one cpu
online. This warning makes sense when the suspend/resume mechanism
offlines all cpus but one during the suspend/resume process.
However, Xen suspend does not need to offline the other cpus; it
merely keeps them tied up in stop_machine() while the virtual machine
is suspended. The warning hres_timers_resume issues is therefore
spurious.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Hi Ingo/Steven,
Ftrace currently maintains an update count which includes false updates,
i.e, updates which failed. If anything, such failures should be tracked
by some separate variable, but this patch provides a minimal fix.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: rostedt@goodmis.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Hi Steven,
I noticed that concurrent instances of ftrace_record_ip()
have a race between ftrace_hash list traversal during
ftrace_ip_in_hash() (before acquiring ftrace_shutdown_lock)
and ftrace_add_hash(). If it's so then this should fix it.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: rostedt@goodmis.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The new work with converting the trace hooks over to markers broke the
command line recording of ftrace. This patch fixes it again.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
While debugging latencies in the RT kernel, I found that it would be nice
to be able to filter away functions from the trace than just to filter
on functions.
I added a new interface to the debugfs tracing directory called
set_ftrace_notrace
When dynamic frace is enabled, this lets you filter away functions that will
not be recorded in the trace. It is similar to adding 'notrace' to those
functions but by doing it without recompiling the kernel.
Here's how set_ftrace_filter and set_ftrace_notrace interact. Remember, if
set_ftrace_filter is set, it removes all functions from the trace execpt for
those listed in the set_ftrace_filter. set_ftrace_notrace will prevent those
functions from being traced.
If you were to set one function in both set_ftrace_filter and
set_ftrace_notrace and that function was the same, then you would end up
with an empty trace.
the set of functions to trace is:
set_ftrace_filter == empty then
all functions not in set_ftrace_notrace
else
set of the set_ftrace_filter and not in set of set_ftrace_notrace.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Printing out new max latencies was fine for the old RT tracer. But for
mainline it is a bit messy. We also need to test if the run queue
is locked before we can do the print. This means that we may not be
printing out latencies if the run queue is locked on another CPU.
This produces inconsistencies in the output.
This patch simply removes the print altogether.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: pq@iki.fi
Cc: proski@gnu.org
Cc: sandmann@redhat.com
Cc: a.p.zijlstra@chello.nl
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds function tracing to the functions that are called
on the CPU of the task being traced.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: pq@iki.fi
Cc: proski@gnu.org
Cc: sandmann@redhat.com
Cc: a.p.zijlstra@chello.nl
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The check_pages function is called often enough that it can cause problems
with trace outputs or even bringing the system to a halt.
This patch limits the check_pages to the places that are most likely to
have problems. The check is made at the flip between the global array and
the max save array, as well as when the size of the buffers changes and
the self tests.
This patch also removes the BUG_ON from check_pages and replaces it with
a WARN_ON and disabling of the tracer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: pq@iki.fi
Cc: proski@gnu.org
Cc: sandmann@redhat.com
Cc: a.p.zijlstra@chello.nl
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Based on Roland's patch. This approach was suggested by Austin Clements
from the very beginning, and then by Linus.
As Austin pointed out, the execing task can be killed by SI_TIMER signal
because exec flushes the signal handlers, but doesn't discard the pending
signals generated by posix timers. Perhaps not a bug, but people find this
surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently sigqueue_free() removes sigqueue from list, but doesn't cancel the
pending signal. This is not consistent, the task should either receive the
"full" signal along with siginfo_t, or it shouldn't receive the signal at all.
Change sigqueue_free() to clear SIGQUEUE_PREALLOC but leave sigqueue on list
if it is queued.
This is a user-visible change. If the signal is blocked, it stays queued
after sys_timer_delete() until unblocked with the "stale" si_code/si_value,
and of course it is still counted wrt RLIMIT_SIGPENDING which also limits
the number of posix timers.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As git-grep shows, open_softirq() is always called with the last argument
being NULL
block/blk-core.c: open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL);
kernel/hrtimer.c: open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL);
kernel/rcuclassic.c: open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
kernel/rcupreempt.c: open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
kernel/sched.c: open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
kernel/softirq.c: open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
kernel/softirq.c: open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
kernel/timer.c: open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL);
net/core/dev.c: open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
net/core/dev.c: open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);
This observation has already been made by Matthew Wilcox in June 2002
(http://www.cs.helsinki.fi/linux/linux-kernel/2002-25/0687.html)
"I notice that none of the current softirq routines use the data element
passed to them."
and the situation hasn't changed since them. So it appears we can safely
remove that extra argument to save 128 (54) bytes of kernel data (text).
Signed-off-by: Carlos R. Mafra <crmafra@ift.unesp.br>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix unaligned access errors when setting softlockup_thresh on
64 bit platforms.
Allow softlockup detection to be disabled by setting
softlockup_thresh <= 0.
Detect that boot time softlockup detection has been disabled
earlier in softlockup_tick.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
allow users to configure the softlockup detector to generate a panic
instead of a warning message.
high-availability systems might opt for this strict method (combined
with panic_timeout= boot option/sysctl), instead of generating
softlockup warnings ad infinitum.
also, automated tests work better if the system reboots reliably (into
a safe kernel) in case of a lockup.
The full spectrum of configurability is supported: boot option, sysctl
option and Kconfig option.
it's default-disabled.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
printk(KERN_ALERT "Danger Will Robinson!\nAlien Approaching!\n");
At present this will result in one message at ALERT level and one
at the current default message loglevel (e.g. WARNING). This is
non-intuitive.
Modify vprintk() to remember the message loglevel each time it
is specified and use it for subsequent lines of output which do
not specify one, within the same call to printk.
Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Restructure the logic of vprintk() so the processing of the leading
3 characters of each input line is in one place, regardless whether
printk_time is enabled. This makes the code smaller and easier to
understand.
size reduction in kernel/printk.o:
text data bss dec hex filename
6157 397 1049804 1056358 101e66 printk.o.before
6117 397 1049804 1056318 101e3e printk.o.after
and some style uncleanlinesses removed as well as a side-effect:
Before:
total: 19 errors, 22 warnings, 1340 lines checked
After:
total: 17 errors, 22 warnings, 1333 lines checked
Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
console election: If some console happens to be registered first which does
not provide a tty binding (!console->device), it prevents that more suited
consoles which are registered later on can enter the candidate pool for
console_device(). This is observable with KGDB's console which may already
be registered (and exploited!) during early debugger connections, that is
before any regular console registration.
This patch fixes the issue by postponing the final, automated
preferred_console selection until someone with a non-NULL device handler
comes around.
Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make printk_recursion_bug_msg static and drop printk prefix from recursion
variables.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is a slight change in the namespace cgroup subsystem api.
The change is that previously when cgroup_clone() was called (currently
only from the unshare path in ns_proxy cgroup, you'd get a new group named
"node_$pid" whereas now you'll get a group named after just your pid.)
The only users who would notice it are those who are using the ns_proxy
cgroup subsystem to auto-create cgroups when namespaces are unshared -
something of an experimental feature, which I think really needs more
complete container/namespace support in order to be useful. I suspect the
only users are Cedric and Serge, or maybe a few others on
containers@lists.linux-foundation.org. And in fact it would only be
noticed by the users who make the assumption about how the name is
generated, rather than getting it from the /proc/<pid>/cgroups file for
the process in question.
Whether the change is actually needed or not I'm fairly agnostic on, but I
guess it is more elegant to just use the pid as the new group name rather
than adding a fairly arbitrary "node_" prefix on the front.
[menage@google.com: provided changelog]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Paul Menage" <menage@google.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If none of the switch cases match, the PR_SET_PDEATHSIG and
PR_SET_DUMPABLE cases of the switch statement will never write to local
variable `error'.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Cc: Andrew G. Morgan <morgan@kernel.org>
Acked-by: "Serge E. Hallyn" <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__exit_signal() does flush_sigqueue(tsk->pending) outside of ->siglock.
This can race with another thread doing sigqueue_free(), we can free the
same SIGQUEUE_PREALLOC sigqueue twice or corrupt the pending->list.
Note that even sys_exit_group() can trigger this race, not only
sys_timer_delete().
Move the callsite of flush_sigqueue(tsk->pending) under ->siglock.
This patch doesn't touch flush_sigqueue(->shared_pending) below, it is
called when there are no other threads which can play with signals, and
sigqueue_free() can't be used outside of our thread group.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's useful to detect which process is killed by RT watchdog.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Non-zero pid indicates the MMIO access originated in user space.
We do not catch that kind of accesses yet, so always print zero for now.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
another weekend, another patch. This should apply on top of my previous patch
from March 23rd.
Summary of changes:
- Print PCI device list in output header
- work around recursive probe hits on SMP
- refactor dis/arm_kmmio_fault_page() and add check for page levels
- remove un/reference_kmmio(), the die notifier hook is registered
permanently into the list
- explicitly check for single stepping in die notifier callback
I have tested this version on my UP Athlon64 desktop with Nouveau, and
SMP Core 2 Duo laptop with the proprietary nvidia driver. Both systems
are 64-bit. One previously unknown bug crept into daylight: the ftrace
framework's output routines print the first entry last after buffer has
wrapped around.
The most important regressions compared to non-ftrace mmiotrace at this
time are:
- failure of trace_pipe file
- illegal lines in output file
- unaware of losing data due to buffer full
Personally I'd like to see these three solved before submitting to
mainline. Other issues may come up once we know when we lose events.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
here is a patch that makes mmiotrace work almost well within the tracing
framework. The patch applies on top of my previous patch. I have my own
output formatting in place now.
Summary of changes:
- fix the NULL dereference that was due to not calling tracing_reset()
- add print_line() callback into struct tracer
- implement print_line() for mmiotrace, producing up-to-spec text
- add my output header, but that is not really called in the right place
- rewrote the main structs in mmiotrace
- added two new trace entry types: TRACE_MMIO_RW and TRACE_MMIO_MAP
- made some functions in trace.c non-static
- check current==NULL in tracing_generic_entry_update()
- fix(?) comparison in trace_seq_printf()
Things seem to work fine except a few issues. Markers (text lines injected
into mmiotrace log) are missing, I did not feel hacking them in before we
have variable length entries. My output header is printed only for 'trace'
file, but not 'trace_pipe'. For some reason, despite my quick fix,
iter->trace is NULL in print_trace_line() when called from 'trace_pipe'
file, which means I don't get proper output formatting.
I only tried by loading nouveau.ko, which just detects the card, and that
is traced fine. I didn't try further. Map, two reads and unmap. Works
perfectly.
I am missing the information about overflows, I'd prefer to have a
counter for lost events. I didn't try, but I guess currently there is no
way of knowning when it overflows?
So, not too far from being fully operational, it seems :-)
And looking at the diffstat, there also is some 700-900 lines of user space
code that just became obsolete.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On Sat, 22 Mar 2008 13:07:47 +0100
Ingo Molnar <mingo@elte.hu> wrote:
> > > i'd suggest the following: pull x86.git and sched-devel.git into a
> > > single tree [the two will combine without rejects]. Then try to add a
> > > kernel/tracing/trace_mmiotrace.c ftrace plugin. The trace_sysprof.c
> > > plugin might be a good example.
> >
> > I did this and now I have mmiotrace enabled/disabled via the tracing
> > framework (what do we call this, since ftrace is one of the tracers?).
>
> cool! could you send the patches for that? (even if they are not fully
> functional yet)
Patch attached in the end. Nice to see how much code disappeared. I tried
to mark all the features I had to break with XXX-comments.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch makes the stacktrace printout code \warn when the entries
pointer is unset rather than crashing when trying to access it in an
attempt to make it a bit more robust.
I was saving a stacktrace into an skb and forgot to copy it across skb
copies... I have since fixed the code, but it would have been easier
had the kernel not crashed in an interrupt.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Check that current->mm is non-NULL before attempting to trace the user
stack.
Also take depth of the kernel stack into account when comparing
against sample_max_depth.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
add kernel backtracing to the sysprof tracer.
change the format of the data, so that type=0 means
beginning of stack trace, 1 means kernel address, 2 means user
address, and 3 means end of trace.
EIP addresses are no longer distinguished from return addresses,
mostly because sysprof userspace doesn't make use of it. It may be
worthwhile adding this back in though, just in case it becomes
interesting.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
that's the only tested platform for now. If there's interest we
can make it generic easily.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now that ftrace is being ported to other architectures, it has become
apparent that DYNAMIC_FTRACE is dependent on whether or not that
architecture implements dynamic ftrace. FTRACE itself may be ported to
an architecture without porting dynamic ftrace.
This patch adds HAVE_DYNAMIC_FTRACE to allow architectures to port ftrace
without having to also port the dynamic aspect as well.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch removes the Makefile turd and uses the nice CFLAGS_REMOVE macro
in the kernel directory.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds full support for ftrace for PowerPC (both 64 and 32 bit).
This includes dynamic tracing and function filtering.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
David S. Miller noticed the following bug: the -pg instrumentation
function callback is named differently on each platform. On x86 it
is mcount, on sparc it is _mcount. So the export does not make sense
in kernel/trace/ftrace.c - move it to x86.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It causes unaligned access traps on platforms like sparc
(ftrace_page may be marked packed, but once we return
a dyn_ftrace sub-object from this array to another piece
of code, the "packed" part of the typing information doesn't
propagate).
But also, it didn't serve any purpose either. Even if packed,
on 64-bit or 32-bit, it didn't give us any more dyn_ftrace
entries per-page.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Porting ftrace to the marker infrastructure.
Don't need to chain to the wakeup tracer from the sched tracer, because markers
support multiple probes connected.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Denys Vlasenko <vda.linux@googlemail.com> :
> Not in this patch, but I noticed:
>
> #define __trace_mark(name, call_private, format, args...) \
> do { \
> static const char __mstrtab_##name[] \
> __attribute__((section("__markers_strings"))) \
> = #name "\0" format; \
> static struct marker __mark_##name \
> __attribute__((section("__markers"), aligned(8))) = \
> { __mstrtab_##name, &__mstrtab_##name[sizeof(#name)], \
> 0, 0, marker_probe_cb, \
> { __mark_empty_function, NULL}, NULL }; \
> __mark_check_format(format, ## args); \
> if (unlikely(__mark_##name.state)) { \
> (*__mark_##name.call) \
> (&__mark_##name, call_private, \
> format, ## args); \
> } \
> } while (0)
>
> In this call:
>
> (*__mark_##name.call) \
> (&__mark_##name, call_private, \
> format, ## args); \
>
> you make gcc allocate duplicate format string. You can use
> &__mstrtab_##name[sizeof(#name)] instead since it holds the same string,
> or drop ", format," above and "const char *fmt" from here:
>
> void (*call)(const struct marker *mdata, /* Probe wrapper */
> void *call_private, const char *fmt, ...);
>
> since mdata->format is the same and all callees which need it can take it there.
Very good point. I actually thought about dropping it, since it would
remove an unnecessary argument from the stack. And actually, since I now
have the marker_probe_cb sitting between the marker site and the
callbacks, there is no API change required. Thanks :)
Mathieu
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently there is no protection from the root user to use up all of
memory for trace buffers. If the root user allocates too many entries,
the OOM killer might start kill off all tasks.
This patch adds an algorith to check the following condition:
pages_requested > (freeable_memory + current_trace_buffer_pages) / 4
If the above is met then the allocation fails. The above prevents more
than 1/4th of freeable memory from being used by trace buffers.
To determine the freeable_memory, I made determine_dirtyable_memory in
mm/page-writeback.c global.
Special thanks goes to Peter Zijlstra for suggesting the above calculation.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Refactor code from tracing_read_pipe() and create trace_seq_to_user().
Moved trace_seq_reset() call before iter->trace->read() call so that
when all leftover data is returned, trace_seq is reset automatically.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Due to debug hooks in the kernel that can change the way smp_processor_id
works, use raw_smp_processor_id in mcount called functions (namely
ftrace_record_ip). Currently we annotate most debug functions from calling
mcount, but we should not rely on that to prevent kernel lockups.
This patch uses the raw_smp_processor_id to prevent a recusive crash
that can happen if a debug hook in smp_processor_id calls mcount.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In resetting the iterator in read_pipe, the reset of pos was
postitioned in the wrong location with respect to the memset
operation. The current code sets pos, incorrectly, to zero.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds a method for open_pipe and open_read to the pluggins
so that they can add a header to the trace pipe call.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch sets up the infrastructure to record overruns of the tracing
buffer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In cleaning up of the sched_switch code, the function trace recording
of task comms was removed. This patch adds back the recording of comms
for function trace. The output of ftrace now has the task comm instead
of <...>.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is first installment of adding documentation to the ftrace.
Expect many more patches of this kind in the near future.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently ftrace allocates a trace buffer for every possible CPU.
Work is being done to change it to only online CPUs and add hooks
to hotplug CPUS.
This patch lays out the infrastructure for such a change.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Andrew Morton suggested using strict_strtoul over simple_strtoul.
This patch replaces them in ftrace.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Andrew Morton mentioned some clean ups that should be done to ftrace.
This patch does some of the simple clean ups.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With the introduction of ftrace, it is possible to recurse into
the lockdep functions via the mcount call. To prevent possible
lockups, updating the lockdep_recursion counter on grabbing the internal
lockdep_lock should prevent deadlocks.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds /debug/tracing/trace_entries that allows users to
see as well as modify the number of trace entries the buffers hold.
The number of entries only increments in ENTRIES_PER_PAGE which is
calculated by the size of an entry with the number of entries that
can fit in a page. The user does not need to use an exact size, but
the entries will be rounded to one of the increments.
Trying to set the entries to 0 will return with -EINVAL.
To avoid race conditions, the modification of the buffer size can only
be done when tracing is completely disabled (current_tracer == none).
A info message will be printed if a user tries to modify the buffer size
when not set to none.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The wakeup selftest used an internal API for setting the test task priority.
This patch fixes it to use the proper API for performing such a task.
Thanks goes to Randy Dunlap for pointing out this build failure.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch implements "NONBLOCK" for trace_pipe. If the trace_pipe is opened
with O_NONBLOCK, then the trace_pipe read will not block when buffer is empty.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Break out of while loop with EOF when the current_trace changes.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The trace iterator is reset in the read. We still need to restore the tracer
that the trace_pipe was opened with.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We expect things like "cat" to block on reads to trace_pipe. That's what
trace_pipe is for.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The conversion of task states to a character in the sched_switch tracer (part
of latency tracer infrastructure), seems to be incorrect. We currently do it
by indexing into the state_to_char array using the state value. The state
values do not map directly into the array index and are thus incorrect. The
following patch addresses this issue. This is also what is being done even
in the show_task() routine in kernel/sched.c
The patch has been compile and run tested.
Signed-off-by: Ankita Garg <ankita@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is a slight race condition in the selftest where the max update
of the wakeup and irqs/preemption off tests can be doing a max update as
the buffers are being tested. If this happens the system can crash with
a GPF.
This patch adds the max update spinlock around the checking of the
buffers to prevent such a race.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If the trace output changes on reading the trace files, there is a chance
that the start function will return NULL. If the start function of a sequence
returns NULL the stop equivalent is not called. In this case, all locks
that are taken must be released even if they are released in the stop function.
This patch fixes a case that a mutex was not released on return of NULL
in the start sequence function.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When dynamic ftrace fails and sets itself disabled, the ftraced daemon
will go back to sleep everytime it wakes up. The setting of the
ftraced state to UNINTERRUPTIBLE is skipped in this process, and the
daemon takes up 100% of the CPU. This patch makes sure the ftraced daemon
sets itself to UNINTERRUPTIBLE in that loop.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch removes the "notrace" annotation from lockdep and adds the debugging
files in the kernel director to those that should not be compiled with
"-pg" mcount tracing.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Lock debugging enabled cause huge performance problems for tracing. Having
the lock verification happening for every function that is called
because mcount calls spin_lock can cripple the system.
This patch converts the spin_locks used by ftrace into raw_spin_locks.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch changes the use of __get_cpu_var to explicitly calling
raw_smp_processor_id and using the per_cpu() macro. On some debug
configurations, the use of __get_cpu_var may cause ftrace to trigger
and this can cause problems with the irqsoff tracing.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With the adding of the configuration changes in the Makefile to prevent
tracing of functions in the ftrace code, all tracing of all the ftrace
code has been removed. Unfortunately, one of the selftests, relied on
a function to be traced. With the new change, the function was no longer
traced and the test failed.
This patch separates out the test function into its own file so that
we can add the "-pg" flag to the compilation of that function and the
adding of the mcount call to that function.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The selftest validation code checks for valid entries in the trace buffer.
TRACE_STACK and TRACE_SPECIAL have been added to the code but not to
the validator. This patch adds the two to prevent them from flagging a
failure in the selftest.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
printk called from wakeup critical timings and irqs off can
cause deadlocks since printk might do a wakeup itself. If the
call to printk happens with the runqueue lock held, it can
deadlock.
This patch protects the printk from being called in trace irqs off
with a test to see if the runqueue for the current CPU is locked.
If it is locked, the printk is skipped.
The wakeup always holds the runqueue lock, so the printk is
simply removed.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
trace_function is called by mcount and calling wake_up from that
can have unpredictable results. This patch removes the wakeup from
trace_function.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
now that we have a kbuild method for notrace, no need to pollute the
C code with the annotations.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
PowerPC is very fragile when it comes to use of function names
and function addresses. ftrace needs to either use all function
addresses or function names (i.e. my_func as suppose to &my_func).
This patch chooses to use the names and not the addresses, and
makes ftrace consistent.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
A new check was added in the ftrace function that wont trace if the CPU
trace buffer is disabled. Unfortunately, other tracers used ftrace() to
write to the buffer after they disabled it. The new disable check makes
these calls into a nop.
This patch changes the __ftrace that is called without the check into a
new api for the other tracers to use, called "trace_function". The other
tracers use this interface instead when the trace CPU buffer is already
disabled.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
no need to backmerge, only affects ftrace-enabled kernels. (which is
not the default)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch is the correct way to have tracing enabled by default.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Since ftrace touches practically every function. If we detect any
anomaly, we want to fully disable ftrace. This patch adds code
to try shutdown ftrace as much as possible without doing any more
harm is something is detected not quite correct.
This only kills ftrace, this patch does have checks for other parts of
the tracer (irqsoff, wakeup, etc.).
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The ftrace dynamic function update allocates a record to store the
instruction pointers that are being modified. If the modified
instruction pointer fails to update, then the record is marked as
failed and nothing more is done.
Worse, if the modification fails, but the record ip function is still
called, it will allocate a new record and try again. In just a matter
of time, will this cause a serious memory leak and crash the system.
This patch plugs this memory leak. When a record fails, it is
included back into the pool of records to be used. Now a record may
fail over and over again, but the number of allocated records will
not increase.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some fixes for better output with the trace pipe.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
implement globally synchronized, fast and scalable time source for tracing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
we can hold all cpu trace buffer locks at once - put each into a
separate lock class.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch creates a file called trace_pipe in the tracing
debug directory. This file is a consumer of the trace buffers.
This means that reads of this file consumes the entries from
the trace buffers so that they will not be read a second time,
as contrast to the static buffers latency_trace and trace.
Reading from the trace_pipe will remove the entries from trace
and latency_trace too.
The advantage that trace_pipe has is that it can record live
traces. It will block when there is nothing in the buffer,
and read the entries as they are entered. An EOF happens when
tracing is disabled (tracing_enabled = 0).
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Later patches will need to print the same things as the seq output
does. But those outputs will not use the seq utility. This patch
adds a buffer to the iterator, that can be used by either the
seq utility or other output.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch changes the way the CPU trace buffers are handled.
Instead of always starting from the trace page head, the logic
is changed to a producer consumer logic. This allows for the
buffers to be drained while they are alive.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The tests may leave stuff in the buffers. This resets the buffers
after each test is run. If a test fails, it does not reset the
buffer to avoid touching a buffer that is corrupted.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If the trace buffer is detected to be corrupted, then we
disable all tracers.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
fix time offset calculations and ordering, plus make code more consistent.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds a startup self test on dynamic code modification
and filters. The test filters on a specific function, makes sure that
no other function is traced, exectutes the function, then makes sure that
the function is traced.
This patch also fixes a slight bug with the ftrace selftest, where
tracer_enabled was not being set.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch fixes some bugs to the updating of the max trace that
was caused by implementing the new buffering.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Using raw_local_irq_save/restore confuses lockdep.
It's fine to use the normal ones.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add notrace annotations to lockdep to keep ftrace from causing
recursive problems with lock tracing and debugging.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The irqsoff function tracer did a __get_cpu_var to determine
if it should trace the function or not. The problem is that
__get_cpu_var can preempt between getting the CPU and reading
the cpu variable. This means that the cpu variable that is
being read is not from the cpu being run on.
At worst, this can give a false positive, where we trace the
function when we should not. It will never give a false negative
since we only want to trace when interrupts are disabled
and we never preempt when they are.
This fix adds a check after reading the irq flags to only
trace if the interrupts are actually disabled. It also changes
the reading of the cpu variable to use a raw_smp_processor_id
since we now don't care if we preempt. We still catch that fact.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Allocating large buffers for the tracer may fail easily.
This patch converts the buffer from a large ordered allocation
to single pages. It uses the struct page LRU field to link the
pages together.
Later patches may also implement dynamic increasing and decreasing
of the trace buffers.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds two files to the debugfs system:
/debugfs/tracing/available_filter_functions
and
/debugfs/tracing/set_ftrace_filter
The available_filter_functions lists all functions that has been
recorded by the ftraced that has called the ftrace_record_ip function.
This is to allow users to see what functions have been converted
to nops and can be enabled for tracing.
To enable functions, simply echo the names (whitespace delimited)
into set_ftrace_filter. Simple wildcards are also allowed.
echo 'scheduler' > /debugfs/tracing/set_ftrace_filter
Will have only the scheduler be activated when tracing is enabled.
echo 'sched_*' > /debugfs/tracing/set_ftrace_filter
Will have only the functions starting with 'sched_' be activated.
echo '*lock' > /debugfs/tracing/set_ftrace_filter
Will have only functions ending with 'lock' be activated.
echo '*lock*' > /debugfs/tracing/set_ftrace_filter
Will have only functions with 'lock' in its name be activated.
Note: 'sched*lock' will not work. The only wildcards that are
allowed is an asterisk and the beginning and or end of the string
passed in.
Multiple names can be passed in with whitespace delimited:
echo 'scheduler *lock *acpi*' > /debugfs/tracing/set_ftrace_filter
is also the same as:
echo 'scheduler' > /debugfs/tracing/set_ftrace_filter
echo '*lock' >> /debugfs/tracing/set_ftrace_filter
echo '*acpi*' >> /debugfs/tracing/set_ftrace_filter
Appending does just that. It appends to the list.
To disable all filters simply echo an empty line in:
echo > /debugfs/tracing/set_ftrace_filter
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch replaces the indirect call to the mcount function
pointer with a direct call that will be patched by the
dynamic ftrace routines.
On boot up, the mcount function calls the ftace_stub function.
When the dynamic ftrace code is initialized, the ftrace_stub
is replaced with a call to the ftrace_record_ip, which records
the instruction pointers of the locations that call it.
Later, the ftraced daemon will call kstop_machine and patch all
the locations to nops.
When a ftrace is enabled, the original calls to mcount will now
be set top call ftrace_caller, which will do a direct call
to the registered ftrace function. This direct call is also patched
when the function that should be called is updated.
All patching is performed by a kstop_machine routine to prevent any
type of race conditions that is associated with modifying code
on the fly.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch moves the memory management of the ftrace
records out of the arch code and into the generic code
making the arch code simpler.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds back the sysctl ftrace_enabled. This time it is
defaulted to on, if DYNAMIC_FTRACE is configured. When ftrace_enabled
is disabled, the ftrace function is set to the stub return.
If DYNAMIC_FTRACE is also configured, on ftrace_enabled = 0,
the registered ftrace functions will all be set to jmps, but no more
new calls to ftrace recording (used to find the ftrace calling sites)
will be called.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds a feature to dynamically replace the ftrace code
with the jmps to allow a kernel with ftrace configured to run
as fast as it can without it configured.
The way this works, is on bootup (if ftrace is enabled), a ftrace
function is registered to record the instruction pointer of all
places that call the function.
Later, if there's still any code to patch, a kthread is awoken
(rate limited to at most once a second) that performs a stop_machine,
and replaces all the code that was called with a jmp over the call
to ftrace. It only replaces what was found the previous time. Typically
the system reaches equilibrium quickly after bootup and there's no code
patching needed at all.
e.g.
call ftrace /* 5 bytes */
is replaced with
jmp 3f /* jmp is 2 bytes and we jump 3 forward */
3:
When we want to enable ftrace for function tracing, the IP recording
is removed, and stop_machine is called again to replace all the locations
of that were recorded back to the call of ftrace. When it is disabled,
we replace the code back to the jmp.
Allocation is done by the kthread. If the ftrace recording function is
called, and we don't have any record slots available, then we simply
skip that call. Once a second a new page (if needed) is allocated for
recording new ftrace function calls. A large batch is allocated at
boot up to get most of the calls there.
Because we do this via stop_machine, we don't have to worry about another
CPU executing a ftrace call as we modify it. But we do need to worry
about NMI's so all functions that might be called via nmi must be
annotated with notrace_nmi. When this code is configured in, the NMI code
will not call notrace.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add preempt off timings. A lot of kernel core code is taken from the RT patch
latency trace that was written by Ingo Molnar.
This adds "preemptoff" and "preemptirqsoff" to /debugfs/tracing/available_tracers
Now instead of just tracing irqs off, preemption off can be selected
to be recorded.
When this is selected, it shares the same files as irqs off timings.
One can either trace preemption off, irqs off, or one or the other off.
By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording
of preempt off only is performed. "irqsoff" will only record the time
irqs are disabled, but "preemptirqsoff" will take the total time irqs
or preemption are disabled. Runtime switching of these options is now
supported by simpling echoing in the appropriate trace name into
/debugfs/tracing/current_tracer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds the latency tracer infrastructure. This patch
does not add anything that will select and turn it on, but will
be used by later patches.
If it were to be compiled, it would add the following files
to the debugfs:
The root tracing directory:
/debugfs/tracing/
This patch also adds the following files:
available_tracers
list of available tracers. Currently no tracers are
available. Looking into this file only shows
"none" which is used to unregister all tracers.
current_tracer
The trace that is currently active. Empty on start up.
To switch to a tracer simply echo one of the tracers that
are listed in available_tracers:
example: (used with later patches)
echo function > /debugfs/tracing/current_tracer
To disable the tracer:
echo disable > /debugfs/tracing/current_tracer
tracing_enabled
echoing "1" into this file starts the ftrace function tracing
(if sysctl kernel.ftrace_enabled=1)
echoing "0" turns it off.
latency_trace
This file is readonly and holds the result of the trace.
trace
This file outputs a easier to read version of the trace.
iter_ctrl
Controls the way the output of traces look.
So far there's two controls:
echoing in "symonly" will only show the kallsyms variables
without the addresses (if kallsyms was configured)
echoing in "verbose" will change the output to show
a lot more data, but not very easy to understand by
humans.
echoing in "nosymonly" turns off symonly.
echoing in "noverbose" turns off verbose.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If CONFIG_FTRACE is selected and /proc/sys/kernel/ftrace_enabled is
set to a non-zero value the ftrace routine will be called everytime
we enter a kernel function that is not marked with the "notrace"
attribute.
The ftrace routine will then call a registered function if a function
happens to be registered.
[ This code has been highly hacked by Steven Rostedt and Ingo Molnar,
so don't blame Arnaldo for all of this ;-) ]
Update:
It is now possible to register more than one ftrace function.
If only one ftrace function is registered, that will be the
function that ftrace calls directly. If more than one function
is registered, then ftrace will call a function that will loop
through the functions to call.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The tracer wants to be able to convert the state number
into a user visible character. This patch pulls that conversion
string out the scheduler into the header. This way if it were to
ever change, other parts of the kernel will know.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
add 3 lightweight callbacks to the tracer backend.
zero impact if tracing is turned off.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Change references from for_each_cpu_mask to for_each_cpu_mask_nr
where appropriate
Reviewed-by: Paul Jackson <pj@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Change references from for_each_cpu_mask to for_each_cpu_mask_nr
where appropriate
Reviewed-by: Paul Jackson <pj@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Replace usages of MAX_NUMNODES with nr_node_ids in kernel/sched.c,
where appropriate. This saves some allocated space as well as many
wasted cycles going through node entries that are non-existent.
For inclusion into sched-devel/latest tree.
Based on:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
+ sched-devel/latest .../mingo/linux-2.6-sched-devel.git
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On kvm I have seen some rare hangs in stop_machine when I used more guest
cpus than hosts cpus. e.g. 32 guest cpus on 1 host cpu triggered the
hang quite often. I could also reproduce the problem on a 4 way z/VM host with
a 64 way guest.
It turned out that the guest was consuming all available cpus mostly for
spinning on scheduler locks like rq->lock. This is expected as the threads are
calling yield all the time.
The problem is now, that the host scheduling decisings together with the guest
scheduling decisions and spinlocks not being fair managed to create an
interesting scenario similar to a live lock. (Sometimes the hang resolved
itself after some minutes)
Changing stop_machine to yield the cpu to the hypervisor when yielding inside
the guest fixed the problem for me. While I am not completely happy with this
patch, I think it causes no harm and it really improves the situation for me.
I used cpu_relax for yielding to the hypervisor, does that work on all
architectures?
p.s.: If you want to reproduce the problem, cpu hotplug and kprobes use
stop_machine_run and both triggered the problem after some retries.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
CC: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The comment was correct -- need to make the code match the comment.
Without this patch, if a CPU goes dynticks idle (and stays there forever)
in just the right phase of preemptible-RCU grace-period processing,
grace periods stall. The offending sequence of events (courtesy
of Promela/spin, at least after I got the liveness criterion coded
correctly...) is as follows:
o CPU 0 is in dynticks-idle mode. Its dynticks_progress_counter
is (say) 10.
o CPU 0 takes an interrupt, so rcu_irq_enter() increments CPU 0's
dynticks_progress_counter to 11.
o CPU 1 is doing RCU grace-period processing in rcu_try_flip_idle(),
sees rcu_pending(), so invokes dyntick_save_progress_counter(),
which in turn takes a snapshot of CPU 0's dynticks_progress_counter
into CPU 0's rcu_dyntick_snapshot -- now set to 11. CPU 1 then
updates the RCU grace-period state to rcu_try_flip_waitack().
o CPU 0 returns from its interrupt, so rcu_irq_exit() increments
CPU 0's dynticks_progress_counter to 12.
o CPU 1 later invokes rcu_try_flip_waitack(), which notices that
CPU 0 has not yet responded, and hence in turn invokes
rcu_try_flip_waitack_needed(). This function examines the
state of CPU 0's dynticks_progress_counter and rcu_dyntick_snapshot
variables, which it copies to curr (== 12) and snap (== 11),
respectively.
Because curr!=snap, the first condition fails.
Because curr-snap is only 1 and snap is odd, the second
condition fails.
rcu_try_flip_waitack_needed() therefore incorrectly concludes
that it must wait for CPU 0 to explicitly acknowledge the
counter flip.
o CPU 0 remains forever in dynticks-idle mode, never taking
any more hardware interrupts or any NMIs, and never running
any more tasks. (Of course, -something- will usually eventually
happen, which might be why we haven't seen this one in the
wild. Still should be fixed!)
Therefore the grace period never ends. Fix is to make the code match
the comment, as shown below. With this fix, the above scenario
would be satisfied with curr being even, and allow the grace period
to proceed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Josh Triplett <josh@kernel.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move rcu-protected lists from list.h into a new header file rculist.h.
This is done because list are a very used primitive structure all over the
kernel and it's currently impossible to include other header files in this
list.h without creating some circular dependencies.
For example, list.h implements rcu-protected list and uses rcu_dereference()
without including rcupdate.h. It actually compiles because users of
rcu_dereference() are macros. Others RCU functions could be used too but
aren't probably because of this.
Therefore this patch creates rculist.h which includes rcupdates without to
many changes/troubles.
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add entry to rcu_torture_ops allowing the correct barrier function to
be used upon exit from rcutorture. Also add torture options for the
new call_rcu_sched() API.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add rcu_barrier_sched() and rcu_barrier_bh(). With these in place,
rcutorture no longer gives the occasional oops when repeatedly starting
and stopping torturing rcu_bh. Also adds the API needed to flush out
pre-existing call_rcu_sched() callbacks.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add comments to the logic that infers quiescent states when interrupting
from either user mode or the idle loop. Also add a memory barrier: it
appears that James Huang was in fact onto something, as the scheduler
is much less synchronization happy than it once was, so we can no longer
rely on its memory barriers in all cases.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reported-by: James Huang <jamesclhuang@yahoo.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fourth cut of patch to provide the call_rcu_sched(). This is again to
synchronize_sched() as call_rcu() is to synchronize_rcu().
Should be fine for experimental and -rt use, but not ready for inclusion.
With some luck, I will be able to tell Andrew to come out of hiding on
the next round.
Passes multi-day rcutorture sessions with concurrent CPU hotplugging.
Fixes since the first version include a bug that could result in
indefinite blocking (spotted by Gautham Shenoy), better resiliency
against CPU-hotplug operations, and other minor fixes.
Fixes since the second version include reworking grace-period detection
to avoid deadlocks that could happen when running concurrently with
CPU hotplug, adding Mathieu's fix to avoid the softlockup messages,
as well as Mathieu's fix to allow use earlier in boot.
Fixes since the third version include a wrong-CPU bug spotted by
Andrew, getting rid of the obsolete synchronize_kernel API that somehow
snuck back in, merging spin_unlock() and local_irq_restore() in a
few places, commenting the code that checks for quiescent states based
on interrupting from user-mode execution or the idle loop, removing
some inline attributes, and some code-style changes.
Known/suspected shortcomings:
o I still do not entirely trust the sleep/wakeup logic. Next step
will be to use a private snapshot of the CPU online mask in
rcu_sched_grace_period() -- if the CPU wasn't there at the start
of the grace period, we don't need to hear from it. And the
bit about accounting for changes in online CPUs inside of
rcu_sched_grace_period() is ugly anyway.
o It might be good for rcu_sched_grace_period() to invoke
resched_cpu() when a given CPU wasn't responding quickly,
but resched_cpu() is declared static...
This patch also fixes a long-standing bug in the earlier preemptable-RCU
implementation of synchronize_rcu() that could result in loss of
concurrent external changes to a task's CPU affinity mask. I still cannot
remember who reported this...
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
rcu_batches_completed and rcu_patches_completed_bh are both declared
in rcuclassic.h and rcupreempt.h. This patch removes the extra
prototypes for them from rcupdate.h.
rcu_batches_completed_bh is defined as a static inline in the rcupreempt.h
header file. Trying to export this as EXPORT_SYMBOL_GPL causes linking problems
with the powerpc linker. There's no need to export a static inlined function.
Modules must be compiled with the same type of RCU implementation as the
kernel they are for.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
All uses of list_for_each_rcu() can be profitably replaced by the
easier-to-use list_for_each_entry_rcu(). This patch makes this change
for the Audit system, in preparation for removing the list_for_each_rcu()
API entirely. This time with well-formed SOB.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10663
Reporter: Daniel Marjamki <danielm77@spray.se>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and
we don't want size calculation for ->fd[] to overflow.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently, the mutex debug code checks the lock->owner before lock->magic, so
a corrupt mutex will most likely result in failing the owner check, rather
than the magic check.
This change to debug_mutex_unlock does the magic check first, so
we have a better idea of what breaks.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a common hex array in hexdump.c so everyone can use it.
Add a common hi/lo helper to avoid the shifting masking that is
done to get the upper and lower nibbles of a byte value.
Pull the pack_hex_byte helper from kgdb as it is opencoded many
places in the tree that will be consolidated.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Return type of cpu_rt_runtime_write() should be int instead of ssize_t.
Signed-off-by: Mirco Tischler <mt-ml@gmx.de>
Acked-by: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It acts exactly like a regular 'cond_resched()', but will not get
optimized away when CONFIG_PREEMPT is set.
Normal kernel code is already preemptable in the presense of
CONFIG_PREEMPT, so cond_resched() is optimized away (see commit
02b67cc3ba "sched: do not do
cond_resched() when CONFIG_PREEMPT").
But when wanting to conditionally reschedule while holding a lock, you
need to use "cond_sched_lock(lock)", and the new function is the BKL
equivalent of that.
Also make fs/locks.c use it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The generic semaphore rewrite had a huge performance regression on AIM7
(and potentially other BKL-heavy benchmarks) because the generic
semaphores had been rewritten to be simple to understand and fair. The
latter, in particular, turns a semaphore-based BKL implementation into a
mess of scheduling.
The attempt to fix the performance regression failed miserably (see the
previous commit 00b41ec261 'Revert
"semaphore: fix"'), and so for now the simple and sane approach is to
instead just go back to the old spinlock-based BKL implementation that
never had any issues like this.
This patch also has the advantage of being reported to fix the
regression completely according to Yanmin Zhang, unlike the semaphore
hack which still left a couple percentage point regression.
As a spinlock, the BKL obviously has the potential to be a latency
issue, but it's not really any different from any other spinlock in that
respect. We do want to get rid of the BKL asap, but that has been the
plan for several years.
These days, the biggest users are in the tty layer (open/release in
particular) and Alan holds out some hope:
"tty release is probably a few months away from getting cured - I'm
afraid it will almost certainly be the very last user of the BKL in
tty to get fixed as it depends on everything else being sanely locked."
so while we're not there yet, we do have a plan of action.
Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit bf726eab37, as it has
been reported to cause a regression with processes stuck in __down(),
apparently because some missing wakeup.
Quoth Sven Wegener:
"I'm currently investigating a regression that has showed up with my
last git pull yesterday. Bisecting the commits showed bf726e
"semaphore: fix" to be the culprit, reverting it fixed the issue.
Symptoms: During heavy filesystem usage (e.g. a kernel compile) I get
several compiler processes in uninterruptible sleep, blocking all i/o
on the filesystem. System is an Intel Core 2 Quad running a 64bit
kernel and userspace. Filesystem is xfs on top of lvm. See below for
the output of sysrq-w."
See
http://lkml.org/lkml/2008/5/10/45
for full report.
In the meantime, we can just fix the BKL performance regression by
reverting back to the good old BKL spinlock implementation instead,
since any sleeping lock will generally perform badly, especially if it
tries to be fair.
Reported-by: Sven Wegener <sven.wegener@stealer.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus found a logic bug: we ignore the version number in a module's
vermagic string if we have CONFIG_MODVERSIONS set, but modversions
also lets through a module with no __versions section for modprobe
--force (with tainting, but still).
We should only ignore the start of the vermagic string if the module
actually *has* crcs to check. Rather than (say) having an
entertaining hissy fit and creating a config option to work around the
buggy code.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We allow missing __versions sections, because modprobe --force strips
it. It makes less sense to allow sections where there's no version
for a specific symbol the module uses, so disallow that.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
Revert "relay: fix splice problem"
docbook: fix bio missing parameter
block: use unitialized_var() in bio_alloc_bioset()
block: avoid duplicate calls to get_part() in disk stat code
cfq-iosched: make io priorities inherit CPU scheduling class as well as nice
block: optimize generic_unplug_device()
block: get rid of likely/unlikely predictions in merge logic
vfs: splice remove_suid() cleanup
cfq-iosched: fix RCU race in the cfq io_context destructor handling
block: adjust tagging function queue bit locking
block: sysfs store function needs to grab queue_lock and use queue_flag_*()
Due to a merge conflict, the sched_relax_domain_level control file was marked
as being handled by cpuset_read/write_u64, but the code to handle it was
actually in cpuset_common_file_read/write.
Since the value being written/read is in fact a signed integer, it should be
treated as such; this patch adds cpuset_read/write_s64 functions, and uses
them to handle the sched_relax_domain_level file.
With this patch, the sched_relax_domain_level can be read and written, and the
correct contents seen/updated.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The conversion between virtual and real time is as follows:
dvt = rw/w * dt <=> dt = w/rw * dvt
Since we want the fair sleeper granularity to be in real time, we actually
need to do:
dvt = - rw/w * l
This bug could be related to the regression reported by Yanmin Zhang:
| Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots
| of regressions with 2.6.26-rc1:
|
| 1) 8-core stoakley: 28%;
| 2) 16-core tigerton: 20%;
| 3) Itanium Montvale: 50%.
Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yanmin Zhang reported:
| Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th
| regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton,
| and Itanium Montecito. Bisect located the patch below:
|
| 64ac24e738 is first bad commit
| commit 64ac24e738
| Author: Matthew Wilcox <matthew@wil.cx>
| Date: Fri Mar 7 21:55:58 2008 -0500
|
| Generic semaphore implementation
|
| After I manually reverted the patch against 2.6.26-rc1 while fixing
| lots of conflicts/errors, aim7 regression became less than 2%.
i reproduced the AIM7 workload and can confirm Yanmin's findings that
-.26-rc1 regresses over .25 - by over 67% here.
Looking at the workload i found and fixed what i believe to be the real
bug causing the AIM7 regression: it was inefficient wakeup / scheduling
/ locking behavior of the new generic semaphore code, causing suboptimal
performance.
The problem comes from the following code. The new semaphore code does
this on down():
spin_lock_irqsave(&sem->lock, flags);
if (likely(sem->count > 0))
sem->count--;
else
__down(sem);
spin_unlock_irqrestore(&sem->lock, flags);
and this on up():
spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
__up(sem);
spin_unlock_irqrestore(&sem->lock, flags);
where __up() does:
list_del(&waiter->list);
waiter->up = 1;
wake_up_process(waiter->task);
and where __down() does this in essence:
list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
waiter.up = 0;
for (;;) {
[...]
spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);
if (waiter.up)
return 0;
}
the fastpath looks good and obvious, but note the following property of
the contended path: if there's a task on the ->wait_list, the up() of
the current owner will "pass over" ownership to that waiting task, in a
wake-one manner, via the waiter->up flag and by removing the waiter from
the wait list.
That is all and fine in principle, but as implemented in
kernel/semaphore.c it also creates a nasty, hidden source of contention!
The contention comes from the following property of the new semaphore
code: the new owner owns the semaphore exclusively, even if it is not
running yet.
So if the old owner, even if just a few instructions later, does a
down() [lock_kernel()] again, it will be blocked and will have to wait
on the new owner to eventually be scheduled (possibly on another CPU)!
Or if another task gets to lock_kernel() sooner than the "new owner"
scheduled, it will be blocked unnecessarily and for a very long time
when there are 2000 tasks running.
I.e. the implementation of the new semaphores code does wake-one and
lock ownership in a very restrictive way - it does not allow
opportunistic re-locking of the lock at all and keeps the scheduler from
picking task order intelligently.
This kind of scheduling, with 2000 AIM7 processes running, creates awful
cross-scheduling between those 2000 tasks, causes reduced parallelism, a
throttled runqueue length and a lot of idle time. With increasing number
of CPUs it causes an exponentially worse behavior in AIM7, as the chance
for a newly woken new-owner task to actually run anytime soon is less
and less likely.
Note that it takes just a tiny bit of contention for the 'new-semaphore
catastrophy' to happen: the wakeup latencies get added to whatever small
contention there is, and quickly snowball out of control!
I believe Yanmin's findings and numbers support this analysis too.
The best fix for this problem is to use the same scheduling logic that
the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and
wanted because we do not want to over-schedule), but also allow
opportunistic locking of the lock even if a wakee is already "in
flight".
The patch below implements this new logic. With this patch applied the
AIM7 regression is largely fixed on my quad testbox:
# v2.6.25 vanilla:
..................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 56096.4 91 207.5 789.7 0.4675
2000 55894.4 94 208.2 792.7 0.4658
# v2.6.26-rc1-166-gc0a1811 vanilla:
...................................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 33230.6 83 350.3 784.5 0.2769
2000 31778.1 86 366.3 783.6 0.2648
# v2.6.26-rc1-166-gc0a1811 + semaphore-speedup:
...............................................
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 55707.1 92 209.0 795.6 0.4642
2000 55704.4 96 209.0 796.0 0.4642
i.e. a 67% speedup. We are now back to within 1% of the v2.6.25
performance levels and have zero idle time during the test, as expected.
Btw., interactivity also improved dramatically with the fix - for
example console-switching became almost instantaneous during this
workload (which after all is running 2000 tasks at once!), without the
patch it was stuck for a minute at times.
There's another nice side-effect of this speedup patch, the new generic
semaphore code got even smaller:
text data bss dec hex filename
1241 0 0 1241 4d9 semaphore.o.before
1207 0 0 1207 4b7 semaphore.o.after
(because the waiter.up complication got removed.)
Longer-term we should look into using the mutex code for the generic
semaphore code as well - but i's not easy due to legacies and it's
outside of the scope of v2.6.26 and outside the scope of this patch as
well.
Bisected-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
this replaces the rq->clock stuff (and possibly cpu_clock()).
- architectures that have an 'imperfect' hardware clock can set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
- the 'jiffie' window might be superfulous when we update tick_gtod
before the __update_sched_clock() call in sched_clock_tick()
- cpu_clock() might be implemented as:
sched_clock_cpu(smp_processor_id())
if the accuracy proves good enough - how far can TSC drift in a
single jiffie when considering the filtering and idle hooks?
[ mingo@elte.hu: various fixes and cleanups ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
David Miller pointed it out that nothing in cpu_clock() sets
prev_cpu_time. This caused __sync_cpu_clock() to be called
all the time - against the intention of this code.
The result was that in practice we hit a global spinlock every
time cpu_clock() is called - which - even though cpu_clock()
is used for tracing and debugging, is suboptimal.
While at it, also:
- move the irq disabling to the outest layer,
this should make cpu_clock() warp-free when called with irqs
enabled.
- use long long instead of cycles_t - for platforms where cycles_t
is 32-bit.
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When I echoed 0 into the "cpu.shares" file, a Div0 error occured.
We found it is caused by the following calling.
sched_group_set_shares(tg, shares)
set_se_shares(tg->se[i], shares/nr_cpu_ids)
__set_se_shares(se, shares)
div64_64((1ULL<<32), shares)
When the echoed value was less than the number of processores, the result of the
sentence "shares/nr_cpu_ids" was 0, and then the system called div64() to divide
the result, the Div0 error occured.
It is unnecessary that the shares value is divided by nr_cpu_ids, I think.
Because in the function __update_group_shares_cpu() and init_tg_cfs_entry(),
the shares value isn't divided by nr_cpu_ids when setting shares of the sched
entity.
This patch fixes this bug. And echoing ULONG_MAX value into cpu.shares also
causes Div0 error, so we set a macro MAX_SHARES to limit the max value of
shares.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Concurrent calls to detach_destroy_domains and arch_init_sched_domains
were prevented by the old scheduler subsystem cpu hotplug mutex. When
this got converted to get_online_cpus() the locking got broken.
Unlike before now several processes can concurrently enter the critical
sections that were protected by the old lock.
So use the already present doms_cur_mutex to protect these sections again.
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Revert debugging commit 7ba2e74ab5.
print_cfs_rq_tasks() can induce live-lock if a task is dequeued
during list traversal.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
http://bugzilla.kernel.org/show_bug.cgi?id=10545
sched_stats.h says that __sched_info_switch is "called when prev !=
next" in the comment. sched.c should therefore do that.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Gautham R Shenoy reported:
> While running the usual CPU-Hotplug stress tests on linux-2.6.25,
> I noticed the following in the console logs.
>
> This is a wee bit difficult to reproduce. In the past 10 runs I hit this
> only once.
>
> ------------[ cut here ]------------
>
> WARNING: at kernel/sched.c:962 hrtick+0x2e/0x65()
>
> Just wondering if we are doing a good job at handling the cancellation
> of any per-cpu scheduler timers during CPU-Hotplug.
This looks like its indeed not cancelled at all and migrates the it to
another cpu. Fix it via a proper hotplug notifier mechanism.
Reported-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We currently use an optimization to skip the overhead of wake-idle
processing if more than one task is assigned to a run-queue. The
assumption is that the system must already be load-balanced or we
wouldnt be overloaded to begin with.
The problem is that we are looking at rq->nr_running, which may include
RT tasks in addition to CFS tasks. Since the presence of RT tasks
really has no bearing on the balance status of CFS tasks, this throws
the calculation off.
This patch changes the logic to only consider the number of CFS tasks
when making the decision to optimze the wake-idle.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we
will always evaluate to "true". You can find the thread here:
http://lkml.org/lkml/2008/4/22/296
In reality, we only want to try to push tasks away when a wake up request is
not going to preempt the current task. So lets fix it.
Note: We introduce test_tsk_need_resched() instead of open-coding the flag
check so that the merge-conflict with -rt should help remind us that we
may need to support NEEDS_RESCHED_DELAYED in the future, too.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Noticed by sparse:
kernel/sched.c:760:20: warning: symbol 'sched_feat_names' was not declared. Should it be static?
kernel/sched.c:767:5: warning: symbol 'sched_feat_open' was not declared. Should it be static?
kernel/sched_fair.c:845:3: warning: returning void-valued expression
kernel/sched.c:4386:3: warning: returning void-valued expression
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The C files are included directly in sched.c, so they are
effectively static.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Joel noticed that the !lw->inv_weight contition isn't unlikely anymore so
remove the unlikely annotation. Also, remove the two div64_u64() inv_weight
calculations, which makes them rely on the calc_delta_mine() path as well.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Joel Schopp <jschopp@austin.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Normalized sleeper uses calc_delta*() which requires that the rq load is
already updated, so move account_entity_enqueue() before place_entity()
Tested-by: Frans Pop <elendil@planet.nl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kgdb: kconfig fix xconfig/menuconfig element
kgdb: fix signedness mixmatches, add statics, add declaration to header
kgdb: 1000 loops for the single step test in kgdbts
kgdb: trivial sparse fixes in kgdb test-suite
kgdb: minor documentation fixes
Since FUTEX_FD was scheduled for removal in June 2007 lets remove it.
Google Code search found no users for it and NGPT was abandoned in 2003
according to IBM. futex.h is left untouched to make sure the id does
not get reassigned. Since queue_me() has no users left it is commented
out to avoid a warning, i didnt remove it completely since it is part of
the internal api (matching unqueue_me())
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed rest)
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Noticed by sparse:
arch/x86/kernel/kgdb.c:556:15: warning: symbol 'kgdb_arch_pc' was not declared. Should it be static?
kernel/kgdb.c:149:8: warning: symbol 'kgdb_do_roundup' was not declared. Should it be static?
kernel/kgdb.c:193:22: warning: symbol 'kgdb_arch_pc' was not declared. Should it be static?
kernel/kgdb.c:712:5: warning: symbol 'remove_all_break' was not declared. Should it be static?
Related to kgdb_hex2long:
arch/x86/kernel/kgdb.c:371:28: warning: incorrect type in argument 2 (different signedness)
arch/x86/kernel/kgdb.c:371:28: expected long *long_val
arch/x86/kernel/kgdb.c:371:28: got unsigned long *<noident>
kernel/kgdb.c:469:27: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:469:27: expected long *long_val
kernel/kgdb.c:469:27: got unsigned long *<noident>
kernel/kgdb.c:470:27: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:470:27: expected long *long_val
kernel/kgdb.c:470:27: got unsigned long *<noident>
kernel/kgdb.c:894:27: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:894:27: expected long *long_val
kernel/kgdb.c:894:27: got unsigned long *<noident>
kernel/kgdb.c:895:27: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:895:27: expected long *long_val
kernel/kgdb.c:895:27: got unsigned long *<noident>
kernel/kgdb.c:1127:28: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:1127:28: expected long *long_val
kernel/kgdb.c:1127:28: got unsigned long *<noident>
kernel/kgdb.c:1132:25: warning: incorrect type in argument 2 (different signedness)
kernel/kgdb.c:1132:25: expected long *long_val
kernel/kgdb.c:1132:25: got unsigned long *<noident>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The kernel module loader used to be much too happy to allow loading of
modules for the wrong kernel version by default. For example, if you
had MODVERSIONS enabled, but tried to load a module with no version
info, it would happily load it and taint the kernel - whether it was
likely to actually work or not!
Generally, such forced module loading should be considered a really
really bad idea, so make it conditional on a new config option
(MODULE_FORCE_LOAD), and make it default to off.
If somebody really wants to force module loads, that's their problem,
but we should not encourage it. Especially as it happened to me by
mistake (ie regular unversioned Fedora modules getting loaded) causing
lots of strange behavior.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no harm, when users can read the info and we ask often enough
during debugging for this kind of information.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
File permissions for
/sys/devices/system/clocksource/clocksource0/available_clocksource
are 600 which allows write access. But this is in fact a read only
file. So change permissions to 400.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The helper function hrtimer_callback_running() is used in
kernel/hrtimer.c as well as in the updated net/can/bcm.c which now
supports hrtimers. Moving the helper function to hrtimer.h removes the
duplicate definition in the C-files.
Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Force constants in kernel/timeconst.h (except shift counts) to be 64 bits,
using U64_C() constructor macros, and eliminate constants that cannot
be represented at all in 64 bits. This avoids warnings with some gcc
versions.
Drop generating 64-bit constants, since we have no real hope of
getting a full set (operation on 64-bit values requires a 128-bit
intermediate result, which gcc only supports on 64-bit platforms, and
only with libgcc support on some.) Note that the use of these
constants does not depend on if we are on a 32- or 64-bit architecture.
This resolves Bugzilla 10153.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Uwe Kleine-Koenig has some strange hardware where one of the shared
interrupts can be asserted during boot before the appropriate driver
loads. Requesting the shared irq line from another driver result in a
spurious interrupt storm which finally disables the interrupt line.
I have seen similar behaviour on resume before (the hardware does not
work anymore so I can not verify).
Change the spurious disable logic to increment the disable depth and
mark the interrupt with an extra flag which allows us to reenable the
interrupt when a new driver arrives which requests the same irq
line. In the worst case this will disable the irq again via the
spurious trap, but there is a decent chance that the new driver is the
one which can handle the already asserted interrupt and makes the box
usable again.
Eric Biederman said further: This case also happens on a regular basis
in kdump kernels where we deliberately don't shutdown the hardware
before starting the new kernel. This patch should reduce the need for
using irqpoll in that situation by a small amount.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-and-Acked-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com>
With s390 the last arch switched to the generic sys_ptrace yesterday so
we can now kill the ifdef around it to enforce every new port it using
it instead of introducing new weirdo versions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
timer_stats_timer_set_start_info is invoked twice, additionally, the
invocation of this function can be moved to where it is only called when a
delay is really required.
Signed-off-by: Andrew Liu <shengping.liu@windriver.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The extended crashkernel syntax is a little confusing in the way it handles
ranges. eg:
crashkernel=512M-2G:64M,2G-:128M
Means if the machine has between 512M and 2G of memory the crash region should
be 64M, and if the machine has 2G of memory the region should be 64M. Only if
the machine has more than 2G memory will 128M be allocated.
Although that semantic is correct, it is somewhat baffling. Instead I propose
that the end of the range means the first address past the end of the range,
ie: 512M up to but not including 2G.
[bwalle@suse.de: clarify inclusive/exclusive in crashkernel commandline in documentation]
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Bernhard Walle <bwalle@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Simon Horman <horms@verge.net.au>
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the leap second handling from second_overflow(), which doesn't have to
check for it every second anymore. With CONFIG_NO_HZ this also makes sure the
leap second is handled close to the full second. Additionally this makes it
possible to abort a leap second properly by resetting the STA_INS/STA_DEL
status bits.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
current_tick_length used to do a little more, but now it just returns
tick_length, which we can also access directly at the few places, where it's
needed.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As TICK_LENGTH_SHIFT is used for more than just the tick length, the name
isn't quite approriate anymore, so this renames it to NTP_SCALE_SHIFT.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds support for setting the TAI value (International Atomic Time). The
value is reported back to userspace via timex (as we don't have a
ntp_gettime() syscall).
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
time_offset is already a 64bit value but its resolution barely used, so this
makes better use of it by replacing SHIFT_UPDATE with TICK_LENGTH_SHIFT.
Side note: the SHIFT_HZ in SHIFT_UPDATE was incorrect for CONFIG_NO_HZ and the
primary reason for changing time_offset to 64bit to avoid the overflow.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes time_freq to a 64bit value and makes it static (the only outside
user had no real need to modify it). Intermediate values were already 64bit,
so the change isn't that big, but it saves a little in shifts by replacing
SHIFT_NSEC with TICK_LENGTH_SHIFT. PPM_SCALE is then used to convert between
user space and kernel space representation.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a few more things from the ntp nanokernel related to user space.
It's now possible to select the resolution used of some values via STA_NANO
and the kernel reports in which mode it works (pll/fll).
If some values for adjtimex() are outside the acceptable range, they are now
simply normalized instead of letting the syscall fail. I removed
MOD_CLKA/MOD_CLKB as the mapping didn't really makes any sense, the kernel
doesn't support setting the clock.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is mostly a style cleanup of ntp.c and extracts part of do_adjtimex as
ntp_update_offset(). Otherwise the functionality is still the same as before.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86 is the only arch right now, which provides an optimized for
div_long_long_rem and it has the downside that one has to be very careful that
the divide doesn't overflow.
The API is a little akward, as the arguments for the unsigned divide are
signed. The signed version also doesn't handle a negative divisor and
produces worse code on 64bit archs.
There is little incentive to keep this API alive, so this converts the few
users to the new API.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename div64_64 to div64_u64 to make it consistent with the other divide
functions, so it clearly includes the type of the divide. Move its definition
to math64.h as currently no architecture overrides the generic implementation.
They can still override it of course, but the duplicated declarations are
avoided.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This converts a few users of do_div to div_[su]64 and this demonstrates nicely
how it can reduce some expressions to one-liners.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
currently cpu hotplug (unplug) seems broken on s390 and likely others. On cpu
unplug the system starts to behave very strange and hangs.
I bisected the problem to the following commit:
commit 48f20a9a94
Author: Olof Johansson <olof@lixom.net>
Date: Tue Mar 4 15:23:25 2008 -0800
tasklets: execute tasklets in the same order they were queued
Reverting this patch seems to fix the problem. I looked into takeover_tasklet
and it seems that there is a way to corrupt the tail pointer of the current
cpu. If the tasklet list of the frozen cpu is empty, the tail pointer of the
current cpu points to the address of the head pointer of the stopped cpu and
not to the next pointer of a tasklet_struct.
This patch avoids the list splice of the list is empty and cpu hotplug seems
to work as the tail pointer is not corrupted. Olof, can you look into that
patch and ACK/NACK it so Andrew can push this to Linus, if appropriate?
Please note that some lines are longer than 80 chars, but line-wrapping looked
worse that this version.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Olof Johansson <olof@lixom.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide module unload callback. Required by the gcov profiling
infrastructure to keep track of profiling data structures.
Signed-off-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Make verify_export_symbols check the modules unused, unused_gpl and
gpl_future syms.
Inspired by Jan Beulich's fix, but table-driven.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Obvious typo, but I don't know of any modules with unused GPL exports,
and then it would take someone noticing that the version shouldn't
have matched in a dependent module.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
__find_symbol() has grown over time: there are now 5 different arrays
of symbols it traverses. It also shouldn't print out a warning on
some calls (ie. verify_symbol which simply checks for name clashes,
and __symbol_put which checks for bugs).
1) Rename to find_symbol: no need for underscores.
2) Use bool and add "warn" parameter to suppress warnings.
3) Make table-driven rather than open coded.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (179 commits)
ACPI: Fix acpi_processor_idle and idle= boot parameters interaction
acpi: fix section mismatch warning in pnpacpi
intel_menlo: fix build warning
ACPI: Cleanup: Remove unneeded, multiple local dummy variables
ACPI: video - fix permissions on some proc entries
ACPI: video - properly handle errors when registering proc elements
ACPI: video - do not store invalid entries in attached_array list
ACPI: re-name acpi_pm_ops to acpi_suspend_ops
ACER_WMI/ASUS_LAPTOP: fix build bug
thinkpad_acpi: fix possible NULL pointer dereference if kstrdup failed
ACPI: check a return value correctly in acpi_power_get_context()
#if 0 acpi/bay.c:eject_removable_drive()
eeepc-laptop: add hwmon fan control
eeepc-laptop: add backlight
eeepc-laptop: add base driver
ACPI: thinkpad-acpi: bump up version to 0.20
ACPI: thinkpad-acpi: fix selects in Kconfig
ACPI: thinkpad-acpi: use a private workqueue
ACPI: thinkpad-acpi: fluff really minor fix
ACPI: thinkpad-acpi: use uppercase for "LED" on user documentation
...
Fixed conflicts in drivers/acpi/video.c and drivers/misc/intel_menlow.c
manually.
hrtimers have now dynamic users in the network code. Put them under
debugobjects surveillance as well.
Add calls to the generic object debugging infrastructure and provide fixup
functions which allow to keep the system alive when recoverable problems have
been detected by the object debugging core code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add calls to the generic object debugging infrastructure and provide fixup
functions which allow to keep the system alive when recoverable problems have
been detected by the object debugging core code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use kmem_cache_zalloc(), remove large amounts of initialisation code and
ifdeffery.
Note: this assumes that memset(*atomic_t, 0) correctly initialises the
atomic_t. This is true for all present archtiectures and if it becomes false
for a future architecture then we'll need to make large changes all over the
place anyway.
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix update_console_cmdline() not to to read beyond the terminating zero of its
name argument.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds a minimalistic braille screen reader support. This is meant to
be used by blind people e.g. on boot failures or when / cannot be mounted
etc and thus the userland screen readers can not work.
[akpm@linux-foundation.org: fix exports]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Jiri Kosina <jikos@jikos.cz>
Cc: Dmitry Torokhov <dtor@mail.ru>
Acked-by: Alan Cox <alan@redhat.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().
Misc cleanups:
- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghether
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These values represent the nesting level of a namespace and pids living in it,
and it's always non-negative.
Turning this from int to unsigned int saves some space in pid.c (11 bytes on
x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit.
E.g. on ia64 this removes the sign extension calls, which compiler adds to
optimize access to pid->nubers[ns->level].
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the needlessly global marker_debug being static gcc can optimize the
unused code away.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. sys_getpgid() needs rcu_read_lock() to derive the pgrp _nr, even if
the task is current, otherwise we can race with another thread which
does sys_setpgid().
2. Use rcu_read_lock() instead of tasklist_lock when pid != 0, make sure
that we don't use the NULL pid if the task exits right after successful
find_task_by_vpid().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. sys_getsid() needs rcu_read_lock() to derive the session _nr, even if
the task is current, otherwise we can race with another thread which
does sys_setsid().
2. The task can exit between find_task_by_vpid() and task_session_vnr(),
in that unlikely case sys_getsid() returns 0 instead of -ESRCH.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use change_pid() instead of detach_pid() + attach_pid() in
__set_special_pids().
This way task_session() is not NULL in between.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use change_pid() instead of detach_pid() + attach_pid() in sys_setpgid().
This way task_pgrp() is not NULL in between.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on Eric W. Biederman's idea.
Without tasklist_lock held task_session()/task_pgrp() can return NULL if the
caller races with setprgp()/setsid() which does detach_pid() + attach_pid().
This can happen even if task == current.
Intoduce the new helper, change_pid(), which should be used instead. This way
the caller always sees the special pid != NULL, either old or new.
Also change the prototype of attach_pid(), it always returns 0 and nobody
check the returned value.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on Eric W. Biederman's idea.
Unless task == current, without tasklist_lock held task_session()/task_pgrp()
can return NULL if the caller races with de_thread() which switches the group
leader.
Change transfer_pid() to not clear old->pids[type].pid for the old leader.
This means that its .pid can point to "nowhere", but this is already true for
sub-threads, and the old leader is not group_leader() any longer. IOW, with
or without this change we can't trust task's special pids unless it is the
group leader.
With this change the following code
rcu_read_lock();
task = find_task_by_xxx();
do_something(task_pgrp(task), task_session(task));
rcu_read_unlock();
can't race with exec and hit the NULL pid.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are some places that are known to operate on tasks'
global pids only:
* the rest_init() call (called on boot)
* the kgdb's getthread
* the create_kthread() (since the kthread is run in init ns)
So use the find_task_by_pid_ns(..., &init_pid_ns) there
and schedule the find_task_by_pid for removal.
[sukadev@us.ibm.com: Fix warning in kernel/pid.c]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pid to lookup a task by is passed inside taskstats code via genetlink
message.
Since netlink packets are now processed in the context of the sending task,
this is correct to lookup the task with find_task_by_vpid() here.
Besides, I fix the call to fill_pid() from taskstats_exit(), since the
tsk->pid is not required in fill_pid() in this case, and the pid field on
task_struct is going to be deprecated as well.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Jonathan Lim <jlim@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The callers of free_pidmap() pass 2 members of "struct upid", we can just
pass "struct upid *" instead. Shaves off 10 bytes from pid.o.
Also, simplify the alloc_pid's "out_free:" error path a little bit. This
way it looks more clear which subset of pid->numbers[] we are freeing.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc :Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Operations are now a shared const function block as with most other Linux
objects
- Introduce wrappers for some optional functions to get consistent behaviour
- Wrap put_char which used to be patched by the tty layer
- Document which functions are needed/optional
- Make put_char report success/fail
- Cache the driver->ops pointer in the tty as tty->ops
- Remove various surplus lock calls we no longer need
- Remove proc_write method as noted by Alexey Dobriyan
- Introduce some missing sanity checks where certain driver/ldisc
combinations would oops as they didn't check needed methods were present
[akpm@linux-foundation.org: fix fs/compat_ioctl.c build]
[akpm@linux-foundation.org: fix isicom]
[akpm@linux-foundation.org: fix arch/ia64/hp/sim/simserial.c build]
[akpm@linux-foundation.org: fix kgdb]
Signed-off-by: Alan Cox <alan@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Afaics, currently there are no kernel problems with ptracing init, it can't
lose SIGNAL_UNKILLABLE flag and be killed/stopped by accident.
The ability to strace/debug init can be very useful if you try to figure out
why it does not work as expected.
However, admin should know what he does, "gdb /sbin/init 1" stops init, it
can't reap orphaned zombies or take care of /etc/inittab until continued. It
is even possible to crash init (and thus the whole system) if you wish,
ptracer has full control.
See also the long discussion: http://marc.info/?t=120628018600001
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody can block/ignore SIGSTOP, no need to use force_sig_specific() in
ptrace_attach. Use the "regular" send_sig_info().
With this patch stracing of /sbin/init doesn't clear its SIGNAL_UNKILLABLE,
but not that this makes ptracing of init safe.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently __ptrace_unlink() checks list_empty(->ptrace_list) to figure out
whether the child was reparented. Change the code to use ptrace_reparented()
to make this check more explicit and consistent.
No functional changes.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add another trivial helper for the sake of grep. It also auto-documents the
fact that ->parent != real_parent implies ->ptrace.
No functional changes.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a couple of small comments, it is not easy to see what this code does.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
exit.c has numerous "->exit_signal == -1" comparisons, this check is subtle
and deserves a helper. Imho makes the code more parseable for humans. At
least it's surely more greppable.
Also, a couple of whitespace cleanups. No functional changes.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the set_restore_sigmask() inline in <linux/thread_info.h> and
replaces every set_thread_flag(TIF_RESTORE_SIGMASK) with a call to it. No
change, but abstracts the details of the flag protocol from all the calls.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the buggy /sbin/init hangs if SIGSEGV/etc happens. The kernel sends
the signal, init dequeues it and ignores, returns from the exception, repeats
the faulting instruction, and so on forever.
Imho, such a behaviour is not good. I think that the explicit loud death of
the buggy /sbin/init is better than the silent hang.
Change force_sig_info() to clear SIGNAL_UNKILLABLE when the task should be
really killed.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The global init has a lot of long standing problems with the unhandled fatal
signals.
- The "is_global_init(current)" check in get_signal_to_deliver()
protects only the main thread. Sub-thread can dequee the fatal
signal and shutdown the whole thread group except the main thread.
If it dequeues SIGSTOP /sbin/init will be stopped, this is not
right too. Note that we can't use is_global_init(->group_leader),
this breaks exec and this can't solve other problems we have.
- Even if afterwards ignored, the fatal signals sets SIGNAL_GROUP_EXIT
on delivery. This breaks exec, has other bad implications, and this
is just wrong.
Introduce the new SIGNAL_UNKILLABLE flag to fix these problems. It also helps
to solve some other problems addressed by the subsequent patches.
Currently we use this flag for the global init only, but it could also be used
by kthreads and (perhaps) by the sub-namespace inits.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that task_session() can't return a false NULL, check_kill_permission()
doesn't need tasklist_lock.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This wasn't documented, but as Atsushi Tsuji pointed out
check_kill_permission() needs tasklist_lock for task_session_nr(). I missed
this fact when removed tasklist from the callers.
Change check_kill_permission() to take tasklist_lock for the SIGCONT case.
Re-order security checks so that we take tasklist_lock only if/when it is
actually needed. This is a minimal fix for now, tasklist will be removed
later.
Also change the code to use task_session() instead of task_session_nr().
Also, remove the SIGCONT check from cap_task_kill(), it is bogus (and the
whole function is bogus. Serge, Eric, why it is still alive?).
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Atsushi Tsuji <a-tsuji@bk.jp.nec.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
send_signal() shouldn't call signalfd_notify() if it then fails with -EAGAIN.
Harmless, just a paranoid cleanup.
Also remove the comment. It is obsolete, signalfd_notify() was simplified and
does a simple wakeup.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A couple of small comments about how CLD_CONTINUED notification works.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename handle_stop_signal() to prepare_signal(), make it return a boolean, and
move the callsites of sig_ignored() into it.
No functional changes for now. But it would be nice to factor out the "should
we drop this signal" checks as much as possible, before we try to fix the bugs
with the sub-namespace init's signals (actually the global /sbin/init has some
problems with signals too).
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the callsite of print_fatal_signal() down, under "if
(sig_kernel_coredump(signr))", so we don't need to check signr != SIGKILL.
We are only interested in the sig_kernel_coredump() signals anyway, and due to
the previous changes we almost never can see other fatal signals here except
SIGKILL.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
handle_stop_signal() clears SIGNAL_STOP_DEQUEUED when sig == SIGKILL. Remove
this nasty special case. It was needed to prevent the race with group stop
and exit caused by thread-specific SIGKILL. Now that we use complete_signal()
for private signals too this is not needed, complete_signal() will notice
SIGKILL and abort the soon-to-begin group stop.
Except: the target thread is dead (has PF_EXITING). But in that case we
should not just clear SIGNAL_STOP_DEQUEUED and nothing more. We should either
kill the whole thread group, or silently ignore the signal.
I suspect we are not right wrt zombie leaders, but this is another issue which
and should be fixed separately. Note that this check can't abort the group
stop if it was already started/finished, this check only adds a subtle side
effect if we race with the thread which has already dequeued sig_kernel_stop()
signal and temporary released ->siglock.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We export send_sigqueue() and send_group_sigqueue() for the only user,
posix_timer_event(). This is a bit silly, because both are just trivial
helpers on top of do_send_sigqueue() and because the we pass the unused
.si_signo parameter.
Kill them both, rename do_send_sigqueue() to send_sigqueue(), and export it.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested by Pavel Emelyanov.
send_sigqueue/send_group_sigqueue are only differ in how they lock ->siglock.
Unify them. send_group_sigqueue() uses spin_lock() because it knows the task
can't exit, but in that case lock_task_sighand() can't fail and doesn't hurt.
Note that the "sig" argument is ignored, it is always equal to ->si_signo.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Factor out complete_signal() callsites. This change completely unifies the
helpers sending the specific/group signals.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on Pavel Emelyanov's suggestion.
Rename __group_complete_signal() to complete_signal() and use it to process
the specific signals too. To do this we simply add the "int group" argument.
This allows us to greatly simply the signal-sending code and adds a useful
behaviour change. We can avoid the unneeded wakeups for the private signals
because wants_signal() is more clever than sigismember(blocked), but more
importantly we now take into account the fatal specific signals too.
The latter allows us to kill some subtle checks in handle_stop_signal() and
makes the specific/group signal's behaviour more consistent. For example,
currently sigtimedwait(FATAL_SIGNAL) behaves differently depending on was the
signal sent by kill() or tkill() if the signal was not blocked.
And. This allows us to tweak/fix the behaviour when the specific signal is
sent to the dying/dead ->group_leader.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
send_signal() is used either with ->pending or with ->signal->shared_pending.
Change it to take "int group" instead, this argument will be re-used later.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the unchanged definition of __group_complete_signal() so that send_signal
can see it. To simplify the reading of the next patches.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested by Roland McGrath.
Initialize signal->curr_target in copy_signal(). This way ->curr_target is
never == NULL, we can kill the check in __group_complete_signal's hot path.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment in send_sig_info() is wrong, tasklist_lock can't help.
The caller must ensure the task can't go away, otherwise ->sighand can be NULL
even before we take the lock.
p->sighand could be changed by exec(), but I can't imagine how it is possible
to prevent exit(), but not exec().
Since the things seem to work, I assume all callers are correct. However,
drm_vbl_send_signals() looks broken. block_all_signals() which is solely used
by drm is definitely broken.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert do_tkill() to use rcu_read_lock() + lock_task_sighand() to avoid
taking tasklist lock.
Note that we don't return an error if lock_task_sighand() fails, we pretend
the task dies after receiving the signal. Otherwise, we should fight with the
nasty races with mt-exec without having any advantage.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move handle_stop_signal() into send_signal(). This factors out a couple of
callsites and allows us to do further unifications.
Also, with this change specific_send_sig_info() does handle_stop_signal().
Not that this is really important, we never send STOP/CONT via send_sig() and
friends, but still this looks more consistent.
The only (afaics) special case is get_signal_to_deliver(). If the traced task
dequeues SIGCONT, it can re-send it to itself after ptrace_stop() if the
signal was blocked by debugger. In that case handle_stop_signal() is
unnecessary, but hopefully not a problem.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
handle_stop_signal() was changed, now send_group_sigqueue() doesn't need
tasklist_lock.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cosmetic, cache p->signal to make the code a bit more readable.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
send_group_sigqueue() calls handle_stop_signal(), send_sigqueue() doesn't.
This is not consistent and in fact I'd say this is (minor) bug.
Move handle_stop_signal() from send_group_sigqueue() to do_send_sigqueue(),
the latter is called by send_sigqueue() too.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
lock_task_sighand() was changed, send_sigqueue() doesn't need rcu_read_lock()
any longer.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cache the values of current->signal/sighand. Shrinks .text a bit and makes
the code more readable. Also, remove "sigset_t *mask", it is pointless
because in fact we save the constant offset.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cache the value of p->signal, and change the code to use while_each_thread()
helper.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that handle_stop_signal() doesn't drop ->siglock, we can't see both
->group_stop_count && SIGNAL_STOP_STOPPED. Merge two "if" branches.
As Roland pointed out, we never actually needed 2 do_notify_parent_cldstop()
calls.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Previously handle_stop_signal(SIGCONT) could drop ->siglock. That is why
kill_pid_info(SIGCONT) takes tasklist_lock to make sure the target task can't
go away after unlock. Not needed now.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on discussion with Jiri and Roland.
In short: currently handle_stop_signal(SIGCONT, p) sends the notification to
p->parent, with this patch p itself notifies its parent when it becomes
running.
handle_stop_signal(SIGCONT) has to drop ->siglock temporary in order to notify
the parent with do_notify_parent_cldstop(). This leads to multiple problems:
- as Jiri Kosina pointed out, the stopped task can resume without
actually seeing SIGCONT which may have a handler.
- we race with another sig_kernel_stop() signal which may come in
that window.
- we race with sig_fatal() signals which may set SIGNAL_GROUP_EXIT
in that window.
- we can't avoid taking tasklist_lock() while sending SIGCONT.
With this patch handle_stop_signal() just sets the new SIGNAL_CLD_CONTINUED
flag in p->signal->flags and returns. The notification is sent by the first
task which returns from finish_stop() (there should be at least one) or any
other signalled thread from get_signal_to_deliver().
This is a user-visible change. Say, currently kill(SIGCONT, stopped_child)
can't return without seeing SIGCHLD, with this patch SIGCHLD can be delayed
unpredictably. Another difference is that if the child is ptraced by another
process, CLD_CONTINUED may be delivered to ->real_parent after ptrace_detach()
while currently it always goes to the tracer which doesn't actually need this
notification. Hopefully not a problem.
The patch asks for the futher obvious cleanups, I'll send them separately.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every implementation of ->task_kill() does nothing when the signal comes from
the kernel. This is correct, but means that check_kill_permission() should
call security_task_kill() only for SI_FROMUSER() case, and we can remove the
same check from ->task_kill() implementations.
(sadly, check_kill_permission() is the last user of signal->session/__session
but we can't s/task_session_nr/task_session/ here).
NOTE: Eric W. Biederman pointed out cap_task_kill() should die, and I think
he is very right.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: David Quigley <dpquigl@tycho.nsa.gov>
Cc: Eric Paris <eparis@redhat.com>
Cc: Harald Welte <laforge@gnumonks.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both functions do the same thing after proper locking, but with
different sigpending structs, so move the common code into a helper.
After this we have 4 places that look very similar: send_sigqueue: calls
do_send_sigqueue and signal_wakeup send_group_sigqueue: calls
do_send_sigqueue and __group_complete_signal __group_send_sig_info:
calls send_signal and __group_complete_signal specific_send_sig_info:
calls send_signal and signal_wakeup
Besides, send_signal performs actions similar to do_send_sigqueue's
and __group_complete_signal - to signal_wakeup.
It looks like they can be consolidated gracefully.
Oleg said:
Personally, I think this change is very good. But send_sigqueue() and
send_group_sigqueue() have a very subtle difference which I was never able
to understand.
Let's suppose that sigqueue is already queued, and the signal is ignored
(the latter means we should re-schedule cpu timer or handle overrruns). In
that case send_sigqueue() returns 0, but send_group_sigqueue() returns 1.
I think this is not the problem (in fact, I think this patch makes the
behaviour more correct), but I hope Thomas can take a look and confirm.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The signr variable may be declared without initialization - it is set ro the
return value from __dequeue_signal() right at the function beginning.
Besides, after recalc_sigpending() two checks for signr to be not 0 may be
merged into one. Both if-s become easier to read.
Thanks to Oleg for pointing out mistakes in the first version of this patch.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both sig_ignored() and do_sigaction() check for signr to be explicitly or
implicitly ignored. Introduce a helper for them.
This patch is aimed to help handling signals by pid namespace's init, and was
derived from one of Oleg's patches
https://lists.linux-foundation.org/pipermail/containers/2007-December/009308.html
so, if he doesn't mind, he should be considered as an author.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just a trivial example, more to come.
k_getrusage() holds rcu_read_lock() because it was previously required by
lock_task_sighand(). Unneeded now.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most of the callers of lock_task_sighand() doesn't actually need rcu_lock().
lock_task_sighand() needs it only to safely play with tsk->sighand, it can
take the lock itself.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_group_exit() checks SIGNAL_GROUP_EXIT to avoid taking sighand->siglock.
Since ed5d2cac11 exec() doesn't set this
flag, we should use signal_group_exit().
This is not needed for correctness, but can speedup the multithreaded exec
and makes the code more consistent.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_signal_stop() needs signal_group_exit() but checks sig->group_exit_task.
This (optimization) is correct, SIGNAL_STOP_DEQUEUED and SIGNAL_GROUP_EXIT
are mutually exclusive, but looks confusing. Use signal_group_exit(), this
is not fastpath, the code clarity is more important.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two callers for send_signal() - the specific_send_sig_info and the
__group_send_sig_info - both check for sig to be ignored or already queued.
Move these checks into send_signal() and make it return 1 to indicate that the
signal is dropped, but there's no error in this.
Besides, merge comments and spell-check them.
[oleg@tv-sign.ru: simplifications]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes the code more readable, due to less brackets and small letters in
name.
I also move it above the send_signal() as a preparation for the 3rd patch.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function doesn't change the ret's value and thus always returns 0, with a
single exception of returning -EAGAIN explicitly.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'audit.b50' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
[PATCH] new predicate - AUDIT_FILETYPE
[patch 2/2] Use find_task_by_vpid in audit code
[patch 1/2] audit: let userspace fully control TTY input auditing
[PATCH 2/2] audit: fix sparse shadowed variable warnings
[PATCH 1/2] audit: move extern declarations to audit.h
Audit: MAINTAINERS update
Audit: increase the maximum length of the key field
Audit: standardize string audit interfaces
Audit: stop deadlock from signals under load
Audit: save audit_backlog_limit audit messages in case auditd comes back
Audit: collect sessionid in netlink messages
Audit: end printk with newline
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Skip I/O merges when disabled
block: add large command support
block: replace sizeof(rq->cmd) with BLK_MAX_CDB
ide: use blk_rq_init() to initialize the request
block: use blk_rq_init() to initialize the request
block: rename and export rq_init()
block: no need to initialize rq->cmd with blk_get_request
block: no need to initialize rq->cmd in prepare_flush_fn hook
block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline
block/elevator.c:elv_rq_merge_ok() mustn't be inline
block: make queue flags non-atomic
block: add dma alignment and padding support to blk_rq_map_kern
unexport blk_max_pfn
ps3disk: Remove superfluous cast
block: make rq_init() do a full memset()
relay: fix splice problem
The same definitions are used for the bounds logic and the asm-offsets.h
generation by kbuild. Put them into include/linux/kbuild.h file.
Also add a new feature
COMMENT("text")
which can be used to insert lines of ocmments into asm-offsets.h and
bounds.h.
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jay Estabrook <jay.estabrook@hp.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Chris Zankel <chris@zankel.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Bryan Wu <bryan.wu@analog.com>
Cc: Mike Frysinger <vapier.adi@gmail.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Miles Bader <miles@gnu.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use vmalloc() and memset() instead of kcalloc() to allocate a page* array when
the array size is bigger than one page. This enables relayfs to support
bigger relay buffers than 64MB on 4k-page system, 512MB on 16k-page system.
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: David Wilder <dwilder@us.ibm.com>
Reviewed-by: Tom Zanussi <zanussi@comcast.net>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some drivers have duplicated unlikely() macros. IS_ERR() already has
unlikely() in itself.
This patch cleans up such pointless code.
Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jeff Garzik <jeff@garzik.org>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reading from/writing to some table, a root, which this table came from,
may affect this table's permissions, depending on who is working with the
table.
The core hunk is at the bottom of this patch. All the rest is just pushing
the ctl_table_root argument up to the sysctl_perm() function.
This will be mostly (only?) used in the net sysctls.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The do_sysctl_strategy isn't used outside kernel/sysctl.c, so this can be
static and without a prototype in header.
Besides, move this one and parse_table() above their callers and drop the
forward declarations of the latter call.
One more "besides" - fix two checkpatch warnings: space before a ( and an
extra space at the end of a line.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Disable sysctl_check.c for embedded targets. This saves about about 11 kB
in .text and another 11 kB in .data on a PXA255 embedded platform.
Signed-off-by: Holger Schurig <hs4233@mail.mn-solutions.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove proc_root export. Creation and removal works well if parent PDE is
supplied as NULL -- it worked always that way.
So, one useless export removed and consistency added, some drivers created
PDEs with &proc_root as parent but removed them as NULL and so on.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel implements readlink of /proc/pid/exe by getting the file from
the first executable VMA. Then the path to the file is reconstructed and
reported as the result.
Because of the VMA walk the code is slightly different on nommu systems.
This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
walking the VMAs to find the first executable file-backed VMA we store a
reference to the exec'd file in the mm_struct.
That reference would prevent the filesystem holding the executable file
from being unmounted even after unmapping the VMAs. So we track the number
of VM_EXECUTABLE VMAs and drop the new reference when the last one is
unmapped. This avoids pinning the mounted filesystem.
[akpm@linux-foundation.org: improve comments]
[yamamoto@valinux.co.jp: fix dup_mmap]
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Howells <dhowells@redhat.com>
Cc:"Eric W. Biederman" <ebiederm@xmission.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the keyring quotas controllable through /proc/sys files:
(*) /proc/sys/kernel/keys/root_maxkeys
/proc/sys/kernel/keys/root_maxbytes
Maximum number of keys that root may have and the maximum total number of
bytes of data that root may have stored in those keys.
(*) /proc/sys/kernel/keys/maxkeys
/proc/sys/kernel/keys/maxbytes
Maximum number of keys that each non-root user may have and the maximum
total number of bytes of data that each of those users may have stored in
their keys.
Also increase the quotas as a number of people have been complaining that it's
not big enough. I'm not sure that it's big enough now either, but on the
other hand, it can now be set in /etc/sysctl.conf.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: <kwc@citi.umich.edu>
Cc: <arunsr@cse.iitk.ac.in>
Cc: <dwalsh@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't generate the per-UID user and user session keyrings unless they're
explicitly accessed. This solves a problem during a login process whereby
set*uid() is called before the SELinux PAM module, resulting in the per-UID
keyrings having the wrong security labels.
This also cures the problem of multiple per-UID keyrings sometimes appearing
due to PAM modules (including pam_keyinit) setuiding and causing user_structs
to come into and go out of existence whilst the session keyring pins the user
keyring. This is achieved by first searching for extant per-UID keyrings
before inventing new ones.
The serial bound argument is also dropped from find_keyring_by_name() as it's
not currently made use of (setting it to 0 disables the feature).
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: <kwc@citi.umich.edu>
Cc: <arunsr@cse.iitk.ac.in>
Cc: <dwalsh@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CLONE_NEWIPC|CLONE_SYSVSEM interaction isn't handled properly. This can cause
a kernel memory corruption. CLONE_NEWIPC must detach from the existing undo
lists.
Fix, part 3: refuse clone(CLONE_SYSVSEM|CLONE_NEWIPC).
With unshare, specifying CLONE_SYSVSEM means unshare the sysvsem. So it seems
reasonable that CLONE_NEWIPC without CLONE_SYSVSEM would just imply
CLONE_SYSVSEM.
However with clone, specifying CLONE_SYSVSEM means *share* the sysvsem. So
calling clone(CLONE_SYSVSEM|CLONE_NEWIPC) is explicitly asking for something
we can't allow. So return -EINVAL in that case.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
undo lists.
Fix, part 2: perform an implicit CLONE_SYSVSEM in CLONE_NEWIPC. CLONE_NEWIPC
creates a new IPC namespace, the task cannot access the existing semaphore
arrays after the unshare syscall. Thus the task can/must detach from the
existing undo list entries, too.
This fixes the kernel corruption, because it makes it impossible that
undo records from two different namespaces are in sysvsem.undo_list.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
undo lists.
Fix, part 1: add support for sys_unshare(CLONE_SYSVSEM)
The original reason to not support it was the potential (inevitable?)
confusion due to the fact that sys_unshare(CLONE_SYSVSEM) has the
inverse meaning of clone(CLONE_SYSVSEM).
Our two most reasonable options then appear to be (1) fully support
CLONE_SYSVSEM, or (2) continue to refuse explicit CLONE_SYSVSEM,
but always do it anyway on unshare(CLONE_SYSVSEM). This patch does
(1).
Changelog:
Apr 16: SEH: switch to Manfred's alternative patch which
removes the unshare_semundo() function which
always refused CLONE_SYSVSEM.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The enhancement as asked for by Yasunori: if msgmni is set to a negative
value, register it back into the ipcns notifier chain.
A new interface has been added to the notification mechanism:
notifier_chain_cond_register() registers a notifier block only if not already
registered. With that new interface we avoid taking care of the states
changes in procfs.
Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cpu_hotplug_begin() must be always called under cpu_add_remove_lock, this
means that only one process can be cpu_hotplug.active_writer. So we don't
need the cpu_hotplug.writer_queue, we can wake up the ->active_writer
directly.
Also, fix the comment.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Acked-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cleanup_workqueue_thread() doesn't need the second argument, remove it.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When cpu_populated_map was introduced, it was supposed that cwq->thread can
survive after CPU_DEAD, that is why we never shrink cpu_populated_map.
This is not very nice, we can safely remove the already dead CPU from the map.
The only required change is that destroy_workqueue() must hold the hotplug
lock until it destroys all cwq->thread's, to protect the cpu_populated_map.
We could make the local copy of cpu mask and drop the lock, but
sizeof(cpumask_t) may be very large.
Also, fix the comment near queue_work(). Unless _cpu_down() happens we do
guarantee the cpu-affinity of the work_struct, and we have users which rely on
this.
[akpm@linux-foundation.org: repair comment]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This flag provides the hardwalling properties of mem_exclusive, without
enforcing the exclusivity. Either mem_hardwall or mem_exclusive is sufficient
to prevent GFP_KERNEL allocations from passing outside the cpuset's assigned
nodes.
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the cpusets mem_exclusive flag is overloaded to mean both
"no-overlapping" and "no GFP_KERNEL allocations outside this cpuset".
These patches add a new mem_hardwall flag with just the allocation restriction
part of the mem_exclusive semantics, without breaking backwards-compatibility
for those who continue to use just mem_exclusive. Additionally, the cgroup
control file registration for cpusets is cleaned up to reduce boilerplate.
This patch:
This change tidies up the cpusets control file definitions, and reduces the
amount of boilerplate required to add/change control files in the future.
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the following needlessly global functions static:
- cpuset_test_cpumask()
- cpuset_change_cpumask()
- cpuset_do_move_task()
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This field is the maximal value of the usage one since the counter creation
(or since the latest reset).
To reset this to the usage value simply write anything to the appropriate
cgroup file.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the mem_cgroup member from mm_struct and instead adds an owner.
This approach was suggested by Paul Menage. The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined. It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.
A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.
This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.
I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.
This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.
After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>,
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a read_seq() helper in cftype, which uses seq_file to print out
lists. Use it in the devices cgroup. Also split devices.allow into two
files, so now devices.deny and devices.allow are the ones to use to manipulate
the whitelist, while devices.list outputs the cgroup's current whitelist.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now we can run through the hash table instead of running through the
linked-list.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are at system boot and there is only 1 cgroup group (i,e, init_css_set), so
we don't need to run through the css_set linked list. Neither do we need to
run through the task list, since no processes have been created yet.
Also referring to a comment in cgroup.h:
struct css_set
{
...
/*
* Set of subsystem states, one for each subsystem. This array
* is immutable after creation apart from the init_css_set
* during subsystem registration (at boot time).
*/
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
}
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we attach a process to a different cgroup, the css_set linked-list will
be run through to find a suitable existing css_set to use. This patch
implements a hash table for better performance.
The following benchmarks have been tested:
For N in 1, 5, 10, 50, 100, 500, 1000, create N cgroups with one sleeping
task in each, and then move an additional task through each cgroup in
turn.
Here is a test result:
N Loop orig - Time(s) hash - Time(s)
----------------------------------------------
1 10000 1.201231728 1.196311177
5 2000 1.065743872 1.040566424
10 1000 0.991054735 0.986876440
50 200 0.976554203 0.969608733
100 100 0.998504680 0.969218270
500 20 1.157347764 0.962602963
1000 10 1.619521852 1.085140172
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trigger callback can be used to receive a kick-up from the user space. The
string written is ignored.
The cftype->private is used for multiplexing events.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race between create_proc_entry() and the assignment of file ops.
proc_create() is invented to fix it.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is called by cgroup_init() and cgroup_init_early() only, which are
annotated with __init.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes some filesystem boilerplate from the CFS cgroup subsystem.
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These patches add cgroups read_s64 and write_s64 control file methods (the
signed equivalent of read_u64/write_u64) and use them to implement the
cpu.rt_runtime_us control file in the CFS cgroup subsystem.
This patch:
These are the signed equivalents of the read_u64/write_u64 methods
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "releasable" control file provided by the cgroup framework exports the
state of a per-cgroup flag that's related to the notify-on-release feature.
This isn't really generally useful, unless you're trying to debug this
particular feature of cgroups.
This patch moves the "releasable" file to the cgroup_debug subsystem.
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds a new type of supported control file representation, a map from strings
to u64 values.
Each map entry is printed as a line in a similar format to /proc/vmstat, i.e.
"$key $value\n"
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many of the cpusets control files are simple integer values, which don't
require the overhead of memory allocations for reads and writes.
Move the handlers for these control files into cpuset_read_u64() and
cpuset_write_u64().
[akpm@linux-foundation.org: ad dmissing `break']
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the need for people to remember to pass the -n flag to echo when
writing values to cgroup control files.
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds a function for returning the value of a resource counter member, in a
form suitable for use in a cgroup read_u64 control file method.
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several people have justifiably complained that the "_uint" suffix is
inappropriate for functions that handle u64 values, so this patch just renames
all these functions and their users to have the suffic _u64.
[peterz@infradead.org: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every file should include the headers containing the externs its global
functions (in this case for ns_cgroup_clone()).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a code warning: symbol 'p' shadows an earlier one
This is a reincarnation of Harvey Harrison's patch:
cpuset: sparse warnings in cpuset.c
Independently, Cliff Wickman moved the affected code,
from kernel/cpuset.c to kernel/cgroup.c, in his patch:
cpusets: update_cpumask revision
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Cliff Wickman <cpw@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the needlessly global cgroup_enable_task_cg_lists() static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make eCryptfs key module subsystem respect namespaces.
Since I will be removing the netlink interface in a future patch, I just made
changes to the netlink.c code so that it will not break the build. With my
recent patches, the kernel module currently defaults to the device handle
interface rather than the netlink interface.
[akpm@linux-foundation.org: export free_user_ns()]
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to the rcupreempt.h WARN_ON trigged, I got 2G syslog file. For some
serious complaining of kernel, we need repeat the warnings, so here I isolate
the ratelimit part of printk.c to a standalone file.
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following an experimental deletion of the unnecessary directive
#include <linux/slab.h>
from the header file <linux/percpu.h>, these files under kernel/ were exposed
as needing to include one of <linux/slab.h> or <linux/gfp.h>, so explicit
includes were added where necessary.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From the POV of synchronization, there should be no need to call
wake_up_process() with the 'kthread_create_lock' being held.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following warnings:
WARNING: vmlinux.o(.text+0xc60): Section mismatch in reference from the function kvm_init() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0x33869a): Section mismatch in reference from the function xfs_icsb_init_counters() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0x5556a1): Section mismatch in reference from the function acpi_processor_install_hotplug_notify() to the function .cpuinit.text:register_cpu_notifier()
WARNING: vmlinux.o(.text+0xfe6b28): Section mismatch in reference from the function cpufreq_register_driver() to the function .cpuinit.text:register_cpu_notifier()
register_cpu_notifier() are only really defined when HOTPLUG_CPU is enabled.
So references to the function are OK.
Annotate it with __ref so we do not get warnings from callers and do not get
warnings for the functions/data used by register_cpu_notifier().
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following warnings:
WARNING: vmlinux.o(.text+0x75c8d): Section mismatch in reference from the function take_cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75d2a): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75d4d): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75de4): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
WARNING: vmlinux.o(.text+0x75e33): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain
cpu_down is only used from code surrounded by HOTPLUG_CPU so any references to
__cpuinit is OK.
Add a few __ref to tech modpost to ignore the references.
This is just papering over the fact that the cpu hotplug code is fragile with
respect to use of HOTPLUG_CPU and in many cases rely on __cpuinit to get rid
of code when HOTPLUG_CPU is not enabled. For now this is the least invasive
change.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following warning:
WARNING: vmlinux.o(.text+0x75f4e): Section mismatch in reference from the function unregister_cpu_notifier() to the variable .cpuinit.data:cpu_chain
We know that unregister_cpu_notifier is using HOTPLUG_CPU
stuff - so ignore these references.
Annotating unregister_cpu_notifier had been another option
but this caused far more warnings since not all callers were
annotated __cpuinit.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add the RUSAGE_THREAD option for the getrusage system call. This is
essentially Roland's patch from http://lkml.org/lkml/2008/1/18/589, but the
line about RUSAGE_LWP line has been removed, as suggested by Ulrich and
Christoph.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel is sent to tainted within the warn_on_slowpath() function, and
whenever a warning occurs the new taint flag 'W' is set. This is useful to
know if a warning occurred before a BUG by preserving the warning as a flag
in the taint state.
This does not work on architectures where WARN_ON has its own definition.
These archs are:
1. s390
2. superh
3. avr32
4. parisc
The maintainers of these architectures have been added in the Cc: list
in this email to alert them to the situation.
The documentation in oops-tracing.txt has been updated to include the
new flag.
Signed-off-by: Nur Hussein <nurhussein@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/cpuset.c:1268:52: warning: Using plain integer as NULL pointer
kernel/pid_namespace.c:95:24: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Reviewed-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
My recent additions to compat_ptrace_request made it mandatory
for CONFIG_COMPAT arch's to define copy_siginfo_from_user32.
This broke some builds, though they all really should get cleaned
up in that way.
Since all the arch's that actually call compat_ptrace_request have
now been cleaned up to use the generic compat_sys_ptrace, we can
avoid the build problems on the crufty arch's by changing the
conditionals on the definition.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The scheduler hrtimer bits in 2.6.25 introduced a circular lock
dependency in a rare code path:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.25-sched-devel.git-x86-latest.git #19
-------------------------------------------------------
X/2980 is trying to acquire lock:
(&rq->rq_lock_key#2){++..}, at: [<ffffffff80230146>] task_rq_lock+0x56/0xa0
but task is already holding lock:
(&cpu_base->lock){++..}, at: [<ffffffff80257ae1>] lock_hrtimer_base+0x31/0x60
which lock already depends on the new lock.
The scenario which leads to this is:
posix-timer signal is delivered
-> posix-timer is rearmed
timer is already expired in hrtimer_enqueue()
-> softirq is raised
To prevent this we need to move the raise of the softirq out of the
base->lock protected code path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Prior to suspend, we allocate and switch to a new VT; after suspend, we switch
back to the original VT. This can be slow, and is completely unnecessary if
the framebuffer we're using can restore video properly.
This adds a hook that allows drivers to select whether or not to do this vt
switch, and changes the gxfb driver to call this hook. It also adds a module
param to gxfb to allow controlling of the vt switch (defaulting to no switch).
(Note: I'm not convinced that console_sem is the best way to protect this, but
we should probably have some form of locking..)
[akpm@linux-foundation.org: build fix]
Signed-off-by: Andres Salomon <dilinger@debian.org>
Cc: Jordan Crouse <jordan.crouse@amd.com>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce unregister_/register_kprobes() for kprobe batch registration. This
can reduce waiting time for synchronized_sched() when a lot of probes have to
be unregistered at once.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prohibit users from probing preempt_schedule(). One way of prohibiting the
user from probing functions is by marking such functions with __kprobes. But
this method doesn't work for those functions, which are already marked to
different section like preempt_schedule() (belongs to __sched section). So we
use blacklist approach to refuse user from probing these functions.
In blacklist approach we populate the blacklisted function's starting address
and its size in kprobe_blacklist structure. Then we verify the user specified
address against start and end of the blacklisted function. So any attempt to
register probe on blacklisted functions will be rejected.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filesystem capability support makes it possible to do away with (set)uid-0
based privilege and use capabilities instead. That is, with filesystem
support for capabilities but without this present patch, it is (conceptually)
possible to manage a system with capabilities alone and never need to obtain
privilege via (set)uid-0.
Of course, conceptually isn't quite the same as currently possible since few
user applications, certainly not enough to run a viable system, are currently
prepared to leverage capabilities to exercise privilege. Further, many
applications exist that may never get upgraded in this way, and the kernel
will continue to want to support their setuid-0 base privilege needs.
Where pure-capability applications evolve and replace setuid-0 binaries, it is
desirable that there be a mechanisms by which they can contain their
privilege. In addition to leveraging the per-process bounding and inheritable
sets, this should include suppressing the privilege of the uid-0 superuser
from the process' tree of children.
The feature added by this patch can be leveraged to suppress the privilege
associated with (set)uid-0. This suppression requires CAP_SETPCAP to
initiate, and only immediately affects the 'current' process (it is inherited
through fork()/exec()). This reimplementation differs significantly from the
historical support for securebits which was system-wide, unwieldy and which
has ultimately withered to a dead relic in the source of the modern kernel.
With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
all legacy privilege (through uid=0) for itself and all subsequently
fork()'d/exec()'d children with:
prctl(PR_SET_SECUREBITS, 0x2f);
This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
enabled at configure time.
[akpm@linux-foundation.org: fix uninitialised var warning]
[serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Paul Moore <paul.moore@hp.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch renames mpol_copy() to mpol_dup() because, well, that's what it
does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
existing mempolicy, allocates a new one and copies the contents.
In a later patch, I want to use the name mpol_copy() to copy the contents from
one mempolicy to another like, e.g., strcpy() does for strings.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a change that was requested some time ago by Mel Gorman. Makes sense
to me, so here it is.
Note: I retain the name "mpol_free_shared_policy()" because it actually does
free the shared_policy, which is NOT a reference counted object. However, ...
The mempolicy object[s] referenced by the shared_policy are reference counted,
so mpol_put() is used to release the reference held by the shared_policy. The
mempolicy might not be freed at this time, because some task attached to the
shared object associated with the shared policy may be in the process of
allocating a page based on the mempolicy. In that case, the task performing
the allocation will hold a reference on the mempolicy, obtained via
mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
holding such a reference have called mpol_put() for the mempolicy.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add some values of page flags to the vmcoreinfo data.
The vmcoreinfo data has the minimum debugging information only for dump
filtering. makedumpfile (dump filtering command) gets it to distinguish
unnecessary pages, and makedumpfile creates a small dumpfile.
An old makedumpfile (v1.2.4 or before) had assumed some values of page flags
internally, and this implementation could not follow the change of these
values. For example, Christoph Lameter is changing these values by the
follwing patch: http://lkml.org/lkml/2008/2/29/463
So a new makedumpfile (v1.2.5) came to need these values and I created this
patch to let the kernel output them.
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was used to compensate because MAX_NR_ZONES was not available to the
#ifdefs. Export MAX_NR_ZONES via the new mechanism and get rid of
__ZONE_COUNT.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NR_PAGEFLAGS specifies the number of page flags we are using. From that we
can calculate the number of bits leftover that can be used for zone, node (and
maybe the sections id). There is no need anymore for FLAGS_RESERVED if we use
NR_PAGEFLAGS.
Use the new methods to make NR_PAGEFLAGS available via the preprocessor.
NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields.
These field widths have to be available to the preprocessor.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Miller <davem@davemloft.net>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The use of enums create constants that are not available to the preprocessor
when building the kernel (f.e. MAX_NR_ZONES).
Arch code already has a way to export constants calculated to the preprocessor
through the asm-offsets.c file. Generate something similar for the core
kernel through kbuild.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The MPOL_BIND policy creates a zonelist that is used for allocations
controlled by that mempolicy. As the per-node zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages() that
takes a nodemask for further filtering. This eliminates the need for
MPOL_BIND to create a custom zonelist.
A positive benefit of this is that allocations using MPOL_BIND now use the
local node's distance-ordered zonelist instead of a custom node-id-ordered
zonelist. I.e., pages will be allocated from the closest allowed node with
available memory.
[Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filtering zonelists requires very frequent use of zone_idx(). This is costly
as it involves a lookup of another structure and a substraction operation. As
the zone_idx is often required, it should be quickly accessible. The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.
This patch introduces a struct zoneref to store a zone pointer and a zone
index. The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary. Helpers are given for accessing the zone index as
well as the node index.
[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Argument is S_IF... | <index>, where index is normally 0 or 1.
Triggers if chosen element of ctx->names[] is present and the
mode of object in question matches the upper bits of argument.
I.e. for things like "is the argument of that chmod a directory",
etc.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The pid to lookup a task by is passed inside audit code via netlink message.
Thanks to Denis Lunev, netlink packets are now (since 2.6.24) _always_
processed in the context of the sending task. So this is correct to lookup
the task with find_task_by_vpid() here.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use msglen as the identifier.
kernel/audit.c:724:10: warning: symbol 'len' shadows an earlier one
kernel/audit.c:575:8: originally declared here
Don't use ino_f to check the inode field at the end of the functions.
kernel/auditfilter.c:429:22: warning: symbol 'f' shadows an earlier one
kernel/auditfilter.c:420:21: originally declared here
kernel/auditfilter.c:542:22: warning: symbol 'f' shadows an earlier one
kernel/auditfilter.c:529:21: originally declared here
i always used as a counter for a for loop and initialized to zero before
use. Eliminate the inner i variables.
kernel/auditsc.c:1295:8: warning: symbol 'i' shadows an earlier one
kernel/auditsc.c:1152:6: originally declared here
kernel/auditsc.c:1320:7: warning: symbol 'i' shadows an earlier one
kernel/auditsc.c:1152:6: originally declared here
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Leave audit_sig_{uid|pid|sid} protected by #ifdef CONFIG_AUDITSYSCALL.
Noticed by sparse:
kernel/audit.c:73:6: warning: symbol 'audit_ever_enabled' was not declared. Should it be static?
kernel/audit.c💯8: warning: symbol 'audit_sig_uid' was not declared. Should it be static?
kernel/audit.c:101:8: warning: symbol 'audit_sig_pid' was not declared. Should it be static?
kernel/audit.c:102:6: warning: symbol 'audit_sig_sid' was not declared. Should it be static?
kernel/audit.c:117:23: warning: symbol 'audit_ih' was not declared. Should it be static?
kernel/auditfilter.c:78:18: warning: symbol 'audit_filter_list' was not declared. Should it be static?
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch standardized the string auditing interfaces. No userspace
changes will be visible and this is all just cleanup and consistancy
work. We have the following string audit interfaces to use:
void audit_log_n_hex(struct audit_buffer *ab, const unsigned char *buf, size_t len);
void audit_log_n_string(struct audit_buffer *ab, const char *buf, size_t n);
void audit_log_string(struct audit_buffer *ab, const char *buf);
void audit_log_n_untrustedstring(struct audit_buffer *ab, const char *string, size_t n);
void audit_log_untrustedstring(struct audit_buffer *ab, const char *string);
This may be the first step to possibly fixing some of the issues that
people have with the string output from the kernel audit system. But we
still don't have an agreed upon solution to that problem.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A deadlock is possible between kauditd and auditd under load if auditd
receives a signal. When auditd receives a signal it sends a netlink
message to the kernel asking for information about the sender of the
signal. In that same context the audit system will attempt to send a
netlink message back to the userspace auditd. If kauditd has already
filled the socket buffer (see netlink_attachskb()) auditd will now put
itself to sleep waiting for room to send the message. Since auditd is
responsible for draining that socket we have a deadlock. The fix, since
the response from the kernel does not need to be synchronous is to send
the signal information back to auditd in a separate thread. And thus
auditd can continue to drain the audit queue normally.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch causes the kernel audit subsystem to store up to
audit_backlog_limit messages for use by auditd if it ever appears
sometime in the future in userspace. This is useful to collect audit
messages during bootup and even when auditd is stopped. This is NOT a
reliable mechanism, it does not ever call audit_panic, nor should it.
audit_log_lost()/audit_panic() are called during the normal delivery
mechanism. The messages are still sent to printk/syslog as usual and if
too many messages appear to be queued they will be silently discarded.
I liked doing it by default, but this patch only uses the queue in
question if it was booted with audit=1 or if the kernel was built
enabling audit by default.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Previously I added sessionid output to all audit messages where it was
available but we still didn't know the sessionid of the sender of
netlink messages. This patch adds that information to netlink messages
so we can audit who sent netlink messages.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A couple of audit printk statements did not have a newline.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When using hrtimer with timer->cb_mode == HRTIMER_CB_SOFTIRQ
in some cases the clockevent is not programmed.
This happens, if:
- a timer is rearmed while it's state is HRTIMER_STATE_CALLBACK
- hrtimer_reprogram() returns -ETIME, when it is called after
CALLBACK is finished. This occurs if the new timer->expires
is in the past when CALLBACK is done.
In this case, the timer needs to be removed from the tree and put
onto the pending list again.
The patch is against 2.6.22.5, but AFAICS, it is relevant
for 2.6.25 also (in run_hrtimer_pending()).
Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com>
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.
Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.
This patch has a small common code hit, namely making dup_mm non-static.
Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Arrgghhh...
Sorry about that, I'd been sure I'd folded that one, but it actually got
lost. Please apply - that breaks execve().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* let unshare_files() give caller the displaced files_struct
* don't bother with grabbing reference only to drop it in the
caller if it hadn't been shared in the first place
* in that form unshare_files() is trivially implemented via
unshare_fd(), so we eliminate the duplicate logics in fork.c
* reset_files_struct() is not just only called for current;
it will break the system if somebody ever calls it for anything
else (we can't modify ->files of somebody else). Lose the
task_struct * argument.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* unshare_files() can fail; doing it after irreversible actions is wrong
and de_thread() is certainly irreversible.
* since we do it unconditionally anyway, we might as well do it in do_execve()
and save ourselves the PITA in binfmt handlers, etc.
* while we are at it, binfmt_som actually leaked files_struct on failure.
As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
become unexported.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is no guarantee that there is physical ram below 4GB, and in
fact many boxes don't have exactly that.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix __aggregate_redistribute_shares() related lockup reported by
David S. Miller.
The problem this code tries to solve is 'accurately' calculating the 'fair'
share of the group weight for each cpu. The current code falls back to a global
group rebalance in case the sched_domain's span it looks at has no shares, but
does have tasks.
The reason it gets stuck here, is because its inherently racy - if someone
steals the last task after we compute the agg->rq_weight, but before we
rebalance, we'll never get out of the loop.
We could of course go fix that, but while looking at this issue I found that
this 'fallback' wasn't nearly as rare as I'd hoped it to be. In fact its quite
common - and given it walks the whole machine, thats very bad.
The new approach is simple (why didn't I think of it before?), we set the
aggregate shares to the full task group weight, and each larger sched domain
that encounters an aggregate shares larger than the weight, clips it (it
already re-distributes anyway).
This nicely converges to the desired global picture where the sum of all
shares equals the task group weight.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
David Miller reported:
|--------------->
the following commit:
| commit 27ec440779
| Author: Ingo Molnar <mingo@elte.hu>
| Date: Thu Feb 28 21:00:21 2008 +0100
|
| sched: make cpu_clock() globally synchronous
|
| Alexey Zaytsev reported (and bisected) that the introduction of
| cpu_clock() in printk made the timestamps jump back and forth.
|
| Make cpu_clock() more reliable while still keeping it fast when it's
| called frequently.
|
| Signed-off-by: Ingo Molnar <mingo@elte.hu>
causes watchdog triggers when a cpu exits NOHZ state when it has been
there for >= the soft lockup threshold, for example here are some
messages from a 128 cpu Niagara2 box:
[ 168.106406] BUG: soft lockup - CPU#11 stuck for 128s! [dd:3239]
[ 168.989592] BUG: soft lockup - CPU#21 stuck for 86s! [swapper:0]
[ 168.999587] BUG: soft lockup - CPU#29 stuck for 91s! [make:4511]
[ 168.999615] BUG: soft lockup - CPU#2 stuck for 85s! [swapper:0]
[ 169.020514] BUG: soft lockup - CPU#37 stuck for 91s! [swapper:0]
[ 169.020514] BUG: soft lockup - CPU#45 stuck for 91s! [sh:4515]
[ 169.020515] BUG: soft lockup - CPU#69 stuck for 92s! [swapper:0]
[ 169.020515] BUG: soft lockup - CPU#77 stuck for 92s! [swapper:0]
[ 169.020515] BUG: soft lockup - CPU#61 stuck for 92s! [swapper:0]
[ 169.112554] BUG: soft lockup - CPU#85 stuck for 92s! [swapper:0]
[ 169.112554] BUG: soft lockup - CPU#101 stuck for 92s! [swapper:0]
[ 169.112554] BUG: soft lockup - CPU#109 stuck for 92s! [swapper:0]
[ 169.112554] BUG: soft lockup - CPU#117 stuck for 92s! [swapper:0]
[ 169.171483] BUG: soft lockup - CPU#40 stuck for 80s! [dd:3239]
[ 169.331483] BUG: soft lockup - CPU#13 stuck for 86s! [swapper:0]
[ 169.351500] BUG: soft lockup - CPU#43 stuck for 101s! [dd:3239]
[ 169.531482] BUG: soft lockup - CPU#9 stuck for 129s! [mkdir:4565]
[ 169.595754] BUG: soft lockup - CPU#20 stuck for 93s! [swapper:0]
[ 169.626787] BUG: soft lockup - CPU#52 stuck for 93s! [swapper:0]
[ 169.626787] BUG: soft lockup - CPU#84 stuck for 92s! [swapper:0]
[ 169.636812] BUG: soft lockup - CPU#116 stuck for 94s! [swapper:0]
It's simple enough to trigger this by doing a 10 minute sleep after a
fresh bootup then starting a parallel kernel build.
I suspect this might be reintroducing a problem we've had and fixed
before, see the thread:
http://marc.info/?l=linux-kernel&m=119546414004065&w=2
<---------------|
touch the softlockup watchdog when exiting NOHZ state - we are
obviously not locked up.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A recent change prevents SGI Altix from booting.
This patch fixes the problem.
The regresson was introduced in commit 434d53b00d
Signed-off-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
iwlwifi: Fix built-in compilation of iwlcore
net: Unexport move_addr_to_{kernel,user}
rt2x00: Select LEDS_CLASS.
iwlwifi: Select LEDS_CLASS.
leds: Do not guard NEW_LEDS with HAS_IOMEM
[IPSEC]: Fix catch-22 with algorithm IDs above 31
time: Export set_normalized_timespec.
tcp: Make use of before macro in tcp_input.c
hamradio: Remove unneeded and deprecated cli()/sti() calls in dmascc.c
[NETNS]: Remove empty ->init callback.
[DCCP]: Convert do_gettimeofday() to getnstimeofday().
[NETNS]: Don't initialize err variable twice.
[NETNS]: The ip6_fib_timer can work with garbage on net namespace stop.
[IPV4]: Convert do_gettimeofday() to getnstimeofday().
[IPV4]: Make icmp_sk_init() static.
[IPV6]: Make struct ip6_prohibit_entry_template static.
tcp: Trivial fix to correct function name in a comment in net/ipv4/tcp.c
[NET]: Expose netdevice dev_id through sysfs
skbuff: fix missing kernel-doc notation
[ROSE]: Fix soft lockup wrt. rose_node_list_lock
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
[PATCH] get rid of __exit_files(), __exit_fs() and __put_fs_struct()
[PATCH] proc_readfd_common() race fix
[PATCH] double-free of inode on alloc_file() failure exit in create_write_pipe()
[PATCH] teach seq_file to discard entries
[PATCH] umount_tree() will unhash everything itself
[PATCH] get rid of more nameidata passing in namespace.c
[PATCH] switch a bunch of LSM hooks from nameidata to path
[PATCH] lock exclusively in collect_mounts() and drop_collected_mounts()
[PATCH] move a bunch of declarations to fs/internal.h
The only reason to have separated __...() for those was to keep them inlined
for local users in exit.c. Since Alexey removed the inline on those, there's
no reason whatsoever to keep them around; just collapse with normal variants.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add missing kernel-doc in kernel/sched.c:
Warning(linux-2.6.25-git3//kernel/sched.c:7044): No description found for parameter 'span'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sorry I have just realized set_normalized_timespec() (used in
timespec_sub()) is not exported, and link will fail because of it...
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/juhl/trivial: (24 commits)
DOC: A couple corrections and clarifications in USB doc.
Generate a slightly more informative error msg for bad HZ
fix typo "is" -> "if" in Makefile
ext*: spelling fix prefered -> preferred
DOCUMENTATION: Use newer DEFINE_SPINLOCK macro in docs.
KEYS: Fix the comment to match the file name in rxrpc-type.h.
RAID: remove trailing space from printk line
DMA engine: typo fixes
Remove unused MAX_NODES_SHIFT
MAINTAINERS: Clarify access to OCFS2 development mailing list.
V4L: Storage class should be before const qualifier (sn9c102)
V4L: Storage class should be before const qualifier
sonypi: Storage class should be before const qualifier
intel_menlow: Storage class should be before const qualifier
DVB: Storage class should be before const qualifier
arm: Storage class should be before const qualifier
ALSA: Storage class should be before const qualifier
acpi: Storage class should be before const qualifier
firmware_sample_driver.c: fix coding style
MAINTAINERS: Add ati_remote2 driver
...
Fixed up trivial conflicts in firmware_sample_driver.c
This adds support for PTRACE_GETSIGINFO and PTRACE_SETSIGINFO in
compat_ptrace_request. It relies on existing arch definitions for
copy_siginfo_to_user32 and copy_siginfo_from_user32.
On powerpc, this fixes a longstanding regression of 32-bit ptrace
calls on 64-bit kernels vs native calls (64-bit calls or 32-bit
kernels). This can be seen in a 32-bit call using PTRACE_GETSIGINFO
to examine e.g. siginfo_t.si_addr from a signal that sets it.
(This was broken as of 2.6.24 and, I presume, many or all prior versions.)
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
hrtimer: optimize the softirq time optimization
hrtimer: reduce calls to hrtimer_get_softirq_time()
clockevents: fix typo in tick-broadcast.c
jiffies: add time_is_after_jiffies and others which compare with jiffies
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel: (62 commits)
sched: build fix
sched: better rt-group documentation
sched: features fix
sched: /debug/sched_features
sched: add SCHED_FEAT_DEADLINE
sched: debug: show a weight tree
sched: fair: weight calculations
sched: fair-group: de-couple load-balancing from the rb-trees
sched: fair-group scheduling vs latency
sched: rt-group: optimize dequeue_rt_stack
sched: debug: add some debug code to handle the full hierarchy
sched: fair-group: SMP-nice for group scheduling
sched, cpuset: customize sched domains, core
sched, cpuset: customize sched domains, docs
sched: prepatory code movement
sched: rt: multi level group constraints
sched: task_group hierarchy
sched: fix the task_group hierarchy for UID grouping
sched: allow the group scheduler to have multiple levels
sched: mix tasks and groups
...
These are small cleanups all over the tree.
Trivial style and comment changes to
fs/select.c, kernel/signal.c, kernel/stop_machine.c & mm/pdflush.c
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
The previous optimization did not take the case into account where a
clock provides its own softirq_get_time() function.
Check for the availablitiy of the clock get time function first and
then check if we need to retrieve the time for both clocks via
hrtimer_softirq_gettime() to avoid a double evaluation of time in that
case as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It seems that hrtimer_run_queues() is calling hrtimer_get_softirq_time() more
often than it needs to. This can cause frequent contention on systems with
large numbers of processors/cores.
With this patch, hrtimer_run_queues only calls hrtimer_get_softirq_time() if
there is a pending timer in one of the hrtimer bases, and only once.
This also combines hrtimer_run_queues() and the inline run_hrtimer_queue()
into one function.
[ tglx@linutronix.de: coding style ]
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
braodcast -> broadcast
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Done per Linus' request and suggestions. Linus has explained that
better than I'll be able to explain:
On Thu, Mar 27, 2008 at 10:12:10AM -0700, Linus Torvalds wrote:
> Actually, before we go any further, there might be a less intrusive
> alternative: add just a couple of flags to the resource flags field (we
> still have something like 8 unused bits on 32-bit), and use those to
> implement a generic "resource_alignment()" routine.
>
> Two flags would do it:
>
> - IORESOURCE_SIZEALIGN: size indicates alignment (regular PCI device
> resources)
>
> - IORESOURCE_STARTALIGN: start field is alignment (PCI bus resources
> during probing)
>
> and then the case of both flags zero (or both bits set) would actually be
> "invalid", and we would also clear the IORESOURCE_STARTALIGN flag when we
> actually allocate the resource (so that we don't use the "start" field as
> alignment incorrectly when it no longer indicates alignment).
>
> That wouldn't be totally generic, but it would have the nice property of
> automatically at least add sanity checking for that whole "res->start has
> the odd meaning of 'alignment' during probing" and remove the need for a
> new field, and it would allow us to have a generic "resource_alignment()"
> routine that just gets a resource pointer.
Besides, I removed IORESOURCE_BUS_HAS_VGA flag which was unused for ages.
Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gary Hade <garyhade@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
provide a text based interface to the scheduler features; this saves the
'user' from setting bits using decimal arithmetic.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In order to level the hierarchy, we need to calculate load based on the
root view. That is, each task's load is in the same unit.
A
/ \
B 1
/ \
2 3
To compute 1's load we do:
weight(1)
--------------
rq_weight(A)
To compute 2's load we do:
weight(2) weight(B)
------------ * -----------
rq_weight(B) rw_weight(A)
This yields load fractions in comparable units.
The consequence is that it changes virtual time. We used to have:
time_{i}
vtime_{i} = ------------
weight_{i}
vtime = \Sum vtime_{i} = time / rq_weight.
But with the new way of load calculation we get that vtime equals time.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
De-couple load-balancing from the rb-trees, so that I can change their
organization.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently FAIR_GROUP sched grows the scheduler latency outside of
sysctl_sched_latency, invert this so it stays within.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that the group hierarchy can have an arbitrary depth the O(n^2) nature
of RT task dequeues will really hurt. Optimize this by providing space to
store the tree path, so we can walk it the other way.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add some extra debug output so we can get a better overview of the
full hierarchy.
We print the cgroup path after each cfs_rq, so we can see what group
we're looking at.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement SMP nice support for the full group hierarchy.
On each load-balance action, compile a sched_domain wide view of the full
task_group tree. We compute the domain wide view when walking down the
hierarchy, and readjust the weights when walking back up.
After collecting and readjusting the domain wide view, we try to balance the
tasks within the task_groups. The current approach is a naively balance each
task group until we've moved the targeted amount of load.
Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP
paper.
XXX: there will be some numerical issues due to the limited nature of
SCHED_LOAD_SCALE wrt to representing a task_groups influence on the
total weight. When the tree is deep enough, or the task weight small
enough, we'll run out of bits.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Abhishek Chandra <chandra@cs.umn.edu>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for sched-devel/latest]
- Add a new cpuset file, having levels:
sched_relax_domain_level
- Modify partition_sched_domains() and build_sched_domains()
to take attributes parameter passed from cpuset.
- Fill newidle_idx for node domains which currently unused but
might be required if sched_relax_domain_level become higher.
- We can change the default level by boot option 'relax_domain_level='.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add the full parent<->child relation thing into task_groups as well.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
UID grouping doesn't actually have a task_group representing the root of
the task_group tree. Add one.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch makes the group scheduler multi hierarchy aware.
[a.p.zijlstra@chello.nl: rt-parts and assorted fixes]
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.
[a.p.zijlstra@chello.nl: rt bits and assorted fixes]
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a new function that accepts a pointer to the "newly allowed cpus"
cpumask argument.
int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)
The current set_cpus_allowed() function is modified to use the above
but this does not result in an ABI change. And with some compiler
optimization help, it may not introduce any additional overhead.
Additionally, to enforce the read only nature of the new_mask arg, the
"const" property is migrated to sub-functions called by set_cpus_allowed.
This silences compiler warnings.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the setting of nr_cpu_ids from sched_init() to start_kernel()
so that it's available as early as possible.
Note that an arch has the option of setting it even earlier if need be,
but it should not result in a different value than the setup_nr_cpu_ids()
function.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Remove another cpumask_t variable from stack that was missed in the
last kernel_sched_c updates.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Cleaned up references to cpumask_scnprintf() and added new
cpulist_scnprintf() interfaces where appropriate.
* Fix some small bugs (or code efficiency improvments) for various uses
of cpumask_scnprintf.
* Clean up some checkpatch errors.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Remove empty cpumask_t (and all non-zero/non-null) variables
in SD_*_INIT macros. Use memset(0) to clear. Also, don't
inline the initializer functions to save on stack space in
build_sched_domains().
* Merge change to include/linux/topology.h that uses the new
node_to_cpumask_ptr function in the nr_cpus_node macro into
this patch.
Depends on:
[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Use new node_to_cpumask_ptr. This creates a pointer to the
cpumask for a given node. This definition is in mm patch:
asm-generic-add-node_to_cpumask_ptr-macro.patch
* Use new set_cpus_allowed_ptr function.
Depends on:
[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
[sched-devel]: sched: add new set_cpus_allowed_ptr function
[x86/latest]: x86: add cpus_scnprintf function
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Greg Banks <gnb@melbourne.sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Modify sched_affinity functions to pass cpumask_t variables by reference
instead of by value.
* Use new set_cpus_allowed_ptr function.
Depends on:
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Cc: Paul Jackson <pj@sgi.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Modify cpuset_cpus_allowed to return the currently allowed cpuset
via a pointer argument instead of as the function return value.
* Use new set_cpus_allowed_ptr function.
* Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses.
Depends on:
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Use new set_cpus_allowed_ptr() function added by previous patch,
which instead of passing the "newly allowed cpus" cpumask_t arg
by value, pass it by pointer:
-int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
+int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)
* Modify CPU_MASK_ALL
Depends on:
[sched-devel]: sched: add new set_cpus_allowed_ptr function
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Change fixed size arrays to per_cpu variables or dynamically allocated
arrays in sched_init() and sched_init_smp().
(1) static struct sched_entity *init_sched_entity_p[NR_CPUS];
(1) static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
(1) static struct sched_rt_entity *init_sched_rt_entity_p[NR_CPUS];
(1) static struct rt_rq *init_rt_rq_p[NR_CPUS];
static struct sched_group **sched_group_nodes_bycpu[NR_CPUS];
(1) - these arrays are allocated via alloc_bootmem_low()
* Change sched_domain_debug_one() to use cpulist_scnprintf instead of
cpumask_scnprintf. This reduces the output buffer required and improves
readability when large NR_CPU count machines arrive.
* In sched_create_group() we allocate new arrays based on nr_cpu_ids.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Replace usages of CPU_MASK_NONE, CPU_MASK_ALL, NODE_MASK_NONE,
NODE_MASK_ALL to reduce stack requirements for large NR_CPUS
and MAXNODES counts.
* In some cases, the cpumask variable was initialized but then overwritten
with another value. This is the case for changes like this:
- cpumask_t oldmask = CPU_MASK_ALL;
+ cpumask_t oldmask;
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently the schedstats implementation does not allow the statistics
to be reset. This patch aims to allow that.
echo 0 > cpuacct.usage
resets the usage. Any other value is not allowed and returns -EINVAL.
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change the variable names to the common convention for the cpuacct
subsystem.
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I noticed this when looking at an openswan issue. Openswan (ab?)uses the
tasklet API to defer processing of packets in some situations, with one
packet per tasklet_action(). I started noticing sequences of
backwards-ordered sequence numbers coming over the wire, since new tasklets
are always queued at the head of the list but processed sequentially.
Convert it to instead append new entries to the tail of the list. As an
extra bonus, the splicing code in takeover_tasklets() no longer has to
iterate over the list.
Signed-off-by: Olof Johansson <olof@lixom.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently the rt group scheduling does a per cpu runtime limit, however
the rt load balancer makes no guarantees about an equal spread of real-
time tasks, just that at any one time, the highest priority tasks run.
Solve this by making the runtime limit a global property by borrowing
excessive runtime from the other cpus once the local limit runs out.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Various SMP balancing algorithms require that the bandwidth period
run in sync.
Possible improvements are moving the rt_bandwidth thing into root_domain
and keeping a span per rt_bandwidth which marks throttled cpus.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Balbir Singh reported:
> 1:mon> t
> [c0000000e7677da0] c000000000067de0 .sys_sched_yield+0x6c/0xbc
> [c0000000e7677e30] c000000000008748 syscall_exit+0x0/0x40
> --- Exception: c01 (System Call) at 00000400001d09e4
> SP (4000664cb10) is in userspace
> 1:mon> r
> cpu 0x1: Vector: 300 (Data Access) at [c0000000e7677aa0]
> pc: c000000000068e50: .yield_task_fair+0x94/0xc4
> lr: c000000000067de0: .sys_sched_yield+0x6c/0xbc
the check that should have avoided that is:
/*
* Are we the only task in the tree?
*/
if (unlikely(rq->load.weight == curr->se.load.weight))
return;
But I guess that overlooks rt tasks, they also increase the load.
So I guess something like this ought to fix it..
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no need to loop any longer when 'same == 0'.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
turn off sync wakeups by default. They are not needed anymore - the
buddy logic should be smart enough to keep the system from
overscheduling.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The wakeup buddy logic didn't use the same wakeup granularity logic as the
wakeup preemption did, this might cause the ->next buddy to be selected past
the point where we would have preempted had the task been a single running
instance.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When using CONFIG_NO_HZ, rq->tick_timestamp is not updated every TICK_NSEC.
We check that the number of skipped ticks matches the clock jump seen in
__update_rq_clock().
Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kernel/sched.c:506: erreur: implicit declaration of function tick_get_tick_sched
kernel/sched.c:506: erreur: invalid type argument of ->
kernel/sched.c:506: erreur: NOHZ_MODE_INACTIVE undeclared (first use in this function)
kernel/sched.c:506: erreur: (Each undeclared identifier is reported only once
kernel/sched.c:506: erreur: for each function it appears in.)
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Alexey Zaytsev reported (and bisected) that the introduction of
cpu_clock() in printk made the timestamps jump back and forth.
Make cpu_clock() more reliable while still keeping it fast when it's
called frequently.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Split the FPU save area from the task struct. This allows easy migration
of FPU context, and it's generally cleaner. It also allows the following
two optimizations:
1) only allocate when the application actually uses FPU, so in the first
lazy FPU trap. This could save memory for non-fpu using apps. Next patch
does this lazy allocation.
2) allocate the right size for the actual cpu rather than 512 bytes always.
Patches enabling xsave/xrstor support (coming shortly) will take advantage
of this.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We already catch most of the TSC problems by sanity checks, but there
is a subtle bug which has been in the code forever. This can cause
time jumps in the range of hours.
This was reported in:
http://lkml.org/lkml/2007/8/23/96
and
http://lkml.org/lkml/2008/3/31/23
I was able to reproduce the problem with a gettimeofday loop test on a
dual core and a quad core machine which both have sychronized
TSCs. The TSCs seems not to be perfectly in sync though, but the
kernel is not able to detect the slight delta in the sync check. Still
there exists an extremly small window where this delta can be observed
with a real big time jump. So far I was only able to reproduce this
with the vsyscall gettimeofday implementation, but in theory this
might be observable with the syscall based version as well.
CPU 0 updates the clock source variables under xtime/vyscall lock and
CPU1, where the TSC is slighty behind CPU0, is reading the time right
after the seqlock was unlocked.
The clocksource reference data was updated with the TSC from CPU0 and
the value which is read from TSC on CPU1 is less than the reference
data. This results in a huge delta value due to the unsigned
subtraction of the TSC value and the reference value. This algorithm
can not be changed due to the support of wrapping clock sources like
pm timer.
The huge delta is converted to nanoseconds and added to xtime, which
is then observable by the caller. The next gettimeofday call on CPU1
will show the correct time again as now the TSC has advanced above the
reference value.
To prevent this TSC specific wreckage we need to compare the TSC value
against the reference value and return the latter when it is larger
than the actual TSC value.
I pondered to mark the TSC unstable when the readout is smaller than
the reference value, but this would render an otherwise good and fast
clocksource unusable without a real good reason.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds prctl commands that make it possible
to deny the execution of timestamp counters in userspace.
If this is not implemented on a specific architecture,
prctl will return -EINVAL.
ned-off-by: Erik Bosman <ejbosman@cs.vu.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Rename the se_str and se_rule audit fields elements to
lsm_str and lsm_rule to avoid confusion.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Setup the new Audit LSM hooks for SELinux.
Remove the now redundant exported SELinux Audit interface.
Audit: Export 'audit_krule' and 'audit_field' to the public
since their internals are needed by the implementation of the
new LSM hook 'audit_rule_known'.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Convert Audit to use the new LSM Audit hooks instead of
the exported SELinux interface.
Basically, use:
security_audit_rule_init
secuirty_audit_rule_free
security_audit_rule_known
security_audit_rule_match
instad of (respectively) :
selinux_audit_rule_init
selinux_audit_rule_free
audit_rule_has_selinux
selinux_audit_rule_match
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Stop using the following exported SELinux interfaces:
selinux_get_inode_sid(inode, sid)
selinux_get_ipc_sid(ipcp, sid)
selinux_get_task_sid(tsk, sid)
selinux_sid_to_string(sid, ctx, len)
kfree(ctx)
and use following generic LSM equivalents respectively:
security_inode_getsecid(inode, secid)
security_ipc_getsecid*(ipcp, secid)
security_task_getsecid(tsk, secid)
security_sid_to_secctx(sid, ctx, len)
security_release_secctx(ctx, len)
Call security_release_secctx only if security_secid_to_secctx
succeeded.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Reviewed-by: Paul Moore <paul.moore@hp.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
clocksource: make clocksource watchdog cycle through online CPUs
Documentation: move timer related documentation to a single place
clockevents: optimise tick_nohz_stop_sched_tick() a bit
locking: remove unused double_spin_lock()
hrtimers: simplify lockdep handling
timers: simplify lockdep handling
posix-timers: fix shadowed variables
timer_list: add annotations to workqueue.c
hrtimer: use nanosleep specific restart_block fields
hrtimer: add nanosleep specific restart_block member
* 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc:
Remove DEBUG_SEMAPHORE from Kconfig
Improve semaphore documentation
Simplify semaphore implementation
Add down_timeout and change ACPI to use it
Introduce down_killable()
Generic semaphore implementation
Add semaphore.h to kernel_lock.c
Fix quota.h includes
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (36 commits)
[S390] Remove code duplication from monreader / dcssblk.
[S390] kernel: show last breaking-event-address on oops
[S390] lowcore: Change type of lowcores softirq_pending to __u32.
[S390] zcrypt: Comments and kernel-doc cleanup
[S390] uaccess: Always access the correct address space.
[S390] Fix a lot of sparse warnings.
[S390] Convert s390 to GENERIC_CLOCKEVENTS.
[S390] genirq/clockevents: move irq affinity prototypes/inlines to interrupt.h
[S390] Convert monitor calls to function calls.
[S390] qdio (new feature): enhancing info-retrieval from QDIO-adapters
[S390] replace remaining __FUNCTION__ occurrences
[S390] remove redundant display of free swap space in show_mem()
[S390] qdio: remove outdated developerworks link.
[S390] Add debug_register_mode() function to debug feature API
[S390] crypto: use more descriptive function names for init/exit routines.
[S390] switch sched_clock to store-clock-extended.
[S390] zcrypt: add support for large random numbers
[S390] hw_random: allow rng_dev_read() to return hardware errors.
[S390] Vertical cpu management.
[S390] cpu topology support for s390.
...
This breaks out the ptrace handling from get_signal_to_deliver into a
new subroutine. The actual code there doesn't change, and it gets
inlined into nearly identical compiled code. This makes the function
substantially shorter and thus easier to read, and it nicely isolates
the ptrace magic.
Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I ran a test program to fork mass processes and at the same time
'cat /cgroup/tasks', I got the following oops:
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:72!
invalid opcode: 0000 [#1] SMP
Pid: 4178, comm: a.out Not tainted (2.6.25-rc9 #72)
...
Call Trace:
[<c044a5f9>] ? cgroup_exit+0x55/0x94
[<c0427acf>] ? do_exit+0x217/0x5ba
[<c0427ed7>] ? do_group_exit+0.65/0x7c
[<c0427efd>] ? sys_exit_group+0xf/0x11
[<c0404842>] ? syscall_call+0x7/0xb
[<c05e0000>] ? init_cyrix+0x2fa/0x479
...
EIP: [<c04df671>] list_del+0x35/0x53 SS:ESP 0068:ebc7df4
---[ end trace caffb7332252612b ]---
Fixing recursive fault but reboot is needed!
After digging into the code and debugging, I finlly found out a race
situation:
do_exit()
->cgroup_exit()
->if (!list_empty(&tsk->cg_list))
list_del(&tsk->cg_list);
cgroup_iter_start()
->cgroup_enable_task_cg_list()
->list_add(&tsk->cg_list, ..);
In this case the list won't be deleted though the process has exited.
We got two bug reports in the past, which seem to be the same bug as
this one:
http://lkml.org/lkml/2008/3/5/332http://lkml.org/lkml/2007/10/17/224
Actually sometimes I got oops on list_del, sometimes oops on list_add.
And I can change my test program a bit to trigger other oops.
The patch has been tested both on x86_32 and x86_64.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On the ppc 4xx architecture the instruction cache must be flushed as
well as the data cache. This patch just makes it generic for all
architectures where CACHE_FLUSH_IS_SAFE is set to 1.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the problem of protecting the kgdb handle_exception exit
which had an NMI race condition, while trying to restore
normal system operation.
There was a small window after the master processor sets cpu_in_debug
to zero but before it has set kgdb_active to zero where a
non-master processor in an SMP system could receive an NMI and
re-enter the kgdb_wait() loop.
As long as the master processor sets the cpu_in_debug before sending
the cpu roundup the cpu_in_debug variable can also be used to guard
against the race condition.
The kgdb_wait() function no longer needs to check
kgdb_active because it is done in the arch specific code
and handled along with the nmi traps at the low level.
This also allows kgdb_wait() to exit correctly if it was
entered for some unknown reason due to a spurious NMI that
could not be handled by the arch specific code.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kgdb core fixes:
- Check to see that mm->mmap_cache is not null before calling
flush_cache_range(), else on arch=ARM it will cause a fatal
fault.
- Breakpoints should only be restored if they are in the BP_ACTIVE
state.
- Fix a typo in comments to "kgdb_register_io_module"
x86 kgdb fixes:
- Fix the x86 arch handler such that on a kill or detach that the
appropriate cleanup on the single stepping flags gets run.
- Add in the DIE_NMIWATCHDOG call for x86_64
- Touch the nmi watchdog before returning the system to normal
operation after performing any kind of kgdb operation, else
the possibility exists to trigger the watchdog.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix two regressions dealing with the kgdb core.
1) kgdb_skipexception and kgdb_post_primary_code are optional
functions that are only required on archs that need special exception
fixups.
2) The kernel address space scope must be set on any probe_kernel_*
function or archs such as ARCH=arm will not allow access to the kernel
memory space. As an example, it is required to allow the full kernel
address space is when you the kernel debugger to inspect a system
call.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add HW breakpoints into the arch specific portion of x86 kgdb. In the
current x86 kernel.org kernels HW breakpoints are changed out in lazy
fashion because there is no infrastructure around changing them when
changing to a kernel task or entering the kernel mode via a system
call. This lazy approach means that if a user process uses HW
breakpoints the kgdb will loose out. This is an acceptable trade off
because the developer debugging the kernel is assumed to know what is
going on system wide and would be aware of this trade off.
There is a minor bug fix to the kgdb core so as to correctly call the
hw breakpoint functions with a valid value from the enum.
There is also a minor change to the x86_64 startup code when using
early HW breakpoints. When the debugger is connected, the cpu startup
code must not zero out the HW breakpoint registers or you cannot hit
the breakpoints you are interested in, in the first place.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If kgdb does remove a breakpoint that had a problem on the recursion
check, it should also print the address of the breakpoint.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In order to not trip the clocksource watchdog, kgdb must touch the
clocksource watchdog on the return to normal system run state.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kgdb core code. Handles the protocol and the arch details.
[ mingo@elte.hu: heavily modified, simplified and cleaned up. ]
[ xemul@openvz.org: use find_task_by_pid_ns ]
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Move documentation from semaphore.h to semaphore.c as requested by
Andrew Morton. Also reformat to kernel-doc style and add some more
notes about the implementation.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
By removing the negative values of 'count' and relying on the wait_list to
indicate whether we have any waiters, we can simplify the implementation
by removing the protection against an unlikely race condition. Thanks to
David Howells for his suggestions.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
ACPI currently emulates a timeout for semaphores with calls to
down_trylock and sleep. This produces horrible behaviour in terms of
fairness and excessive wakeups. Now that we have a unified semaphore
implementation, adding a real down_trylock is almost trivial.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Semaphores are no longer performance-critical, so a generic C
implementation is better for maintainability, debuggability and
extensibility. Thanks to Peter Zijlstra for fixing the lockdep
warning. Thanks to Harvey Harrison for pointing out that the
unlikely() was unnecessary.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
This way it checks if the clocks are synchronized between CPUs too.
This might be able to detect slowly drifting TSCs which only
go wrong over longer time.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Call
ts = &per_cpu(tick_cpu_sched, cpu);
and
cpu = smp_processor_id();
once instead of twice.
No functional change done, as changed code runs with local irq off.
Reduces source lines and text size (20bytes on x86_64).
[ akpm@linux-foundation.org: Build fix ]
Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In order to avoid the false positive from lockdep, each per-cpu base->lock has
the separate lock class and migrate_hrtimers() uses double_spin_lock().
This is overcomplicated: except for migrate_hrtimers() we never take 2 locks
at once, and migrate_hrtimers() can use spin_lock_nested().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In order to avoid the false positive from lockdep, each per-cpu base->lock has
the separate lock class and migrate_timers() uses double_spin_lock().
This all is overcomplicated: except for migrate_timers() we never take 2 locks
at once, and migrate_timers() can use spin_lock_nested().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix sparse warnings like this:
kernel/posix-cpu-timers.c:1090:25: warning: symbol 't' shadows an earlier one
kernel/posix-cpu-timers.c:1058:21: originally declared here
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add timer list annotations to workqueue.c so we can see the call site
in the timer stats.
Signed-off-by: Pavel Machek <Pavel@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Convert all the nanosleep related users of restart_block to the
new nanosleep specific restart_block fields.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Generic code is not supposed to include irq.h. Replace this include
> by linux/hardirq.h instead and add/replace an include of linux/irq.h
> in asm header files where necessary.
> This change should only matter for architectures that make use of
> GENERIC_CLOCKEVENTS.
> Architectures in question are mips, x86, arm, sh, powerpc, uml and sparc64.
>
> I did some cross compile tests for mips, x86_64, arm, powerpc and sparc64.
> This patch fixes also build breakages caused by the include replacement in
> tick-common.h.
I generally dislike adding optional linux/* includes in asm/* includes -
I'm nervous about this causing include loops.
However, there's a separate point to be discussed here.
That is, what interfaces are expected of every architecture in the kernel.
If generic code wants to be able to set the affinity of interrupts, then
that needs to become part of the interfaces listed in linux/interrupt.h
rather than linux/irq.h.
So what I suggest is this approach instead (against Linus' tree of a
couple of days ago) - we move irq_set_affinity() and irq_can_set_affinity()
to linux/interrupt.h, change the linux/irq.h includes to linux/interrupt.h
and include asm/irq_regs.h where needed (asm/irq_regs.h is supposed to be
rarely used include since not much touches the stacked parent context
registers.)
Build tested on ARM PXA family kernels and ARM's Realview platform
kernels which both use genirq.
[ tglx@linutronix.de: add GENERIC_HARDIRQ dependencies ]
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
When I cleaned up printk() and split up the printk locking logic in
commit 266c2e0abe ("Make printk() console
semaphore accesses sensible") I had incorrectly moved the call to
have_callable_console() outside of the console semaphore.
That was buggy. The console semaphore protects the console_drivers list
that is used by have_callable_console().
Thanks go to Bongani Hlope who saw this as a hang on shutdown and reboot
and bisected the bug to the right commit, and tested this patch. See
http://lkml.org/lkml/2008/4/11/315
Bisected-and-tested-by: Bongani Hlope <bonganilinux@mweb.co.za>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
AFAICT pm_send_all is a nop when noone uses pm_register...
Hmm.. can we just force CONFIG_PM_LEGACY=n, and see what happens?
Or maybe this is better idea? It may break build somewhere, but it
should be easy to fix... (it builds here, i386 and x86-64).
Signed-off-by: Pavel Machek <pavel@suse.cz>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
revert "sched: fix fair sleepers" (e22ecef1d2),
because it is causing audio skipping, see:
http://bugzilla.kernel.org/show_bug.cgi?id=10428
the patch is correct and the real cause of the skipping is not
understood (tracing makes it go away), but time has run out so we'll
revert it and re-try in 2.6.26.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Extend the /proc/<pid>/cgroup file to include the appropriate hierarchy ID on
each line.
Currently this ID isn't really needed since a hierarchy can be completely
identified by the set of subsystems bound to it, but this is likely to change
in the near future in order to support stateless subsystems and
merging/rebinding of subsystems. Getting this change into 2.6.25 reduces the
need for an API change later.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The prevent_tail_call() macro works around the problem of the compiler
clobbering argument words on the stack, which for asmlinkage functions
is the caller's (user's) struct pt_regs. The tail/sibling-call
optimization is not the only way that the compiler can decide to use
stack argument words as scratch space, which we have to prevent.
Other optimizations can do it too.
Until we have new compiler support to make "asmlinkage" binding on the
compiler's own use of the stack argument frame, we have work around all
the manifestations of this issue that crop up.
More cases seem to be prevented by also keeping the incoming argument
variables live at the end of the function. This makes their original
stack slots attractive places to leave those variables, so the compiler
tends not clobber them for something else. It's still no guarantee, but
it handles some observed cases that prevent_tail_call() did not.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The effects of cgroup_disable=foo are:
- foo isn't auto-mounted if you mount all cgroups in a single hierarchy
- foo isn't visible as an individually mountable subsystem
As a result there will only ever be one call to foo->create(), at init time;
all processes will stay in this group, and the group will never be mounted on
a visible hierarchy. Any additional effects (e.g. not allocating metadata)
are up to the foo subsystem.
This doesn't handle early_init subsystems (their "disabled" bit isn't set be,
but it could easily be extended to do so if any of the early_init systems
wanted it - I think it would just involve some nastier parameter processing
since it would occur before the command-line argument parser had been run.
Hugh said:
Ballpark figures, I'm trying to get this question out rather than
processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead
to the affected paths, booting with cgroup_disable=memory cuts that back to
1% overhead (due to slightly bigger struct page).
I'm no expert on distros, they may have no interest whatever in
CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or
without it, or apply the cgroup_disable=memory patches.
Unix bench's execl test result on x86_64 was
== just after boot without mounting any cgroup fs.==
mem_cgorup=off : Execl Throughput 43.0 3150.1 732.6
mem_cgroup=on : Execl Throughput 43.0 2932.6 682.0
==
[lizf@cn.fujitsu.com: fix boot option parsing]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Markers do not mix well with CONFIG_PREEMPT_RCU because it uses
preempt_disable/enable() and not rcu_read_lock/unlock for minimal
intrusiveness. We would need call_sched and sched_barrier primitives.
Currently, the modification (connection and disconnection) of probes
from markers requires changes to the data structure done in RCU-style :
a new data structure is created, the pointer is changed atomically, a
quiescent state is reached and then the old data structure is freed.
The quiescent state is reached once all the currently running
preempt_disable regions are done running. We use the call_rcu mechanism
to execute kfree() after such quiescent state has been reached.
However, the new CONFIG_PREEMPT_RCU version of call_rcu and rcu_barrier
does not guarantee that all preempt_disable code regions have finished,
hence the race.
The "proper" way to do this is to use rcu_read_lock/unlock, but we don't
want to use it to minimize intrusiveness on the traced system. (we do
not want the marker code to call into much of the OS code, because it
would quickly restrict what can and cannot be instrumented, such as the
scheduler).
The temporary fix, until we get call_rcu_sched and rcu_barrier_sched in
mainline, is to use synchronize_sched before each call_rcu calls, so we
wait for the quiescent state in the system call code path. It will slow
down batch marker enable/disable, but will make sure the race is gone.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Silence two kerneldoc warnings.
Warning(kernel/audit.c:1276): No description found for parameter 'string'
Warning(kernel/audit.c:1276): No description found for parameter 'len'
[also fix a typo for bonus points]
Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The futex init function is called init(). This is a pain in the neck
when debugging when you code dies in ... init :-)
This renames it to futex_init().
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
NOHZ: reevaluate idle sleep length after add_timer_on()
clocksource: revert: use init_timer_deferrable for clocksource_watchdog
relay doesn't reference the pages it adds, however we need a non-NULL
hook or splice_to_pipe() can oops.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
I found that relay files can be read by pread(2). I fix it,
for relay files are not capable of seeking.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
add_timer_on() can add a timer on a CPU which is currently in a long
idle sleep, but the timer wheel is not reevaluated by the nohz code on
that CPU. So a timer can be delayed for quite a long time. This
triggered a false positive in the clocksource watchdog code.
To avoid this we need to wake up the idle CPU and enforce the
reevaluation of the timer wheel for the next timer event.
Add a function, which checks a given CPU for idle state, marks the
idle task with NEED_RESCHED and sends a reschedule IPI to notify the
other CPU of the change in the timer wheel.
Call this function from add_timer_on().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
--
include/linux/sched.h | 6 ++++++
kernel/sched.c | 43 +++++++++++++++++++++++++++++++++++++++++++
kernel/timer.c | 10 +++++++++-
3 files changed, 58 insertions(+), 1 deletion(-)
Revert
commit 1077f5a917
Author: Parag Warudkar <parag.warudkar@gmail.com>
Date: Wed Jan 30 13:30:01 2008 +0100
clocksource.c: use init_timer_deferrable for clocksource_watchdog
clocksource_watchdog can use a deferrable timer - reduces wakeups from
idle per second.
The watchdog timer needs to run with the specified interval. Otherwise
it will miss the possible wrap of the watchdog clocksource.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
The printk() logic on when/how to get the console semaphore was
unreadable, this splits the code up into a few helper functions and
makes it easier to follow what is going on.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case we're accounting from a sub-namespace, the tgids reported will not
refer to the right namespace.
Save the pid_namespace we're accounting in on the acct_glbs and use it in
do_acct_process.
Two less :) places using the task_struct.tgid member.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is minor, but dereferencing even current real_parent is not safe on debug
kernels, since the memory, this points to, can be unmapped - RCU protection is
required.
Besides, the tgid field is deprecated and is to be replaced with task_tgid_xxx
call (the 2nd patch), so RCU will be required anyway.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Paul pointed out, the ACCESS_ONCE are not needed because we already have
the explicit surrounding memory barriers.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Adrian Bunk <adrian.bunk@movial.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add comments requested by Andrew.
Updated comments about synchronize_sched(). Since we use call_rcu and
rcu_barrier now, these comments were out of sync with the code.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Adrian Bunk <adrian.bunk@movial.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The printk() can deadlock because it can wake up klogd(), and
task enqueueing will try to read the time in order to set a hrtimer.
Reported-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Debugged-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will be called each time the scheduling domains are rebuild.
Needed for architectures that don't have a static cpu topology.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Needed so it can be called from outside of sched.c.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
TREE_AVG and APPROX_AVG are initial task placement policies that have been
disabled for a long while.. time to remove them.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Pavel Emelyanov <xemul@openvz.org>
This patch is based on the one from Thomas.
The kauditd_thread() calls the netlink_unicast() and passes
the audit_pid to it. The audit_pid, in turn, is received from
the user space and the tool (I've checked the audit v1.6.9)
uses getpid() to pass one in the kernel. Besides, this tool
doesn't bind the netlink socket to this id, but simply creates
it allowing the kernel to auto-bind one.
That's the preamble.
The problem is that netlink_autobind() _does_not_ guarantees
that the socket will be auto-bound to the current pid. Instead
it uses the current pid as a hint to start looking for a free
id. So, in case of conflict, the audit messages can be sent
to a wrong socket. This can happen (it's unlikely, but can be)
in case some task opens more than one netlink sockets and then
the audit one starts - in this case the audit's pid can be busy
and its socket will be bound to another id.
The proposal is to introduce an audit_nlk_pid in audit subsys,
that will point to the netlink socket to send packets to. It
will most often be equal to audit_pid. The socket id can be
got from the skb's netlink CB right in the audit_receive_msg.
The audit_nlk_pid reset to 0 is not required, since all the
decisions are taken based on audit_pid value only.
Later, if the audit tools will bind the socket themselves, the
kernel will have to provide a way to setup the audit_nlk_pid
as well.
A good side effect of this patch is that audit_pid can later
be converted to struct pid, as it is not longer safe to use
pid_t-s in the presence of pid namespaces. But audit code still
uses the tgid from task_struct in the audit_signal_info and in
the audit_filter_syscall.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Revert commit 1ada5cba6a ("clocksource:
make clocksource watchdog cycle through online CPUs") due to the
regression reported by Gabriel C at
http://lkml.org/lkml/2008/2/24/281
(short vesion: it makes TSC be marked as always unstable on his
machine).
Cc: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Robert Hancock <hancockr@shaw.ca>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Gabriel C <nix.or.die@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wakeup-buddy tasks are cache-hot - this makes it a bit harder
for the load-balancer to tear them apart. (but it's still possible,
if the load is sufficiently assymetric)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
improve affine wakeups. Maintain the 'overlap' metric based on CFS's
sum_exec_runtime - which means the amount of time a task executes
after it wakes up some other task.
Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
it means there's strong workload coupling between this task and the
woken up task. If the 'overlap' is large then the workload is decoupled
and the scheduler will move them to separate CPUs more easily.
( Also slightly move the preempt_check within try_to_wake_up() - this has
no effect on functionality but allows 'early wakeups' (for still-on-rq
tasks) to be correctly accounted as well.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
split out the affine-wakeup bits.
No code changed:
kernel/sched.o:
text data bss dec hex filename
42521 2858 232 45611 b22b sched.o.before
42521 2858 232 45611 b22b sched.o.after
md5:
9d76738f1272aa82f0b7affd2f51df6b sched.o.before.asm
09b31c44e9aff8666f72773dc433e2df sched.o.after.asm
(the md5's changed because stack slots changed and some registers
get scheduled by gcc in a different order - but otherwise the before
and after assembly is instruction for instruction equivalent.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If subbuf_pages was larger than the max number of pages the pipe
buffer will hold, subbuf_splice_actor() would happily go beyond
the array size.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Use the existing calc_delta_mine() calculation for sched_slice(). This
saves a divide and simplifies the code because we share it with the
other /cfs_rq->load users.
It also improves code size:
text data bss dec hex filename
42659 2740 144 45543 b1e7 sched.o.before
42093 2740 144 44977 afb1 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Fair sleepers need to scale their latency target down by runqueue
weight. Otherwise busy systems will gain ever larger sleep bonus.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Currently we schedule to the leftmost task in the runqueue. When the
runtimes are very short because of some server/client ping-pong,
especially in over-saturated workloads, this will cycle through all
tasks trashing the cache.
Reduce cache trashing by keeping dependent tasks together by running
newly woken tasks first. However, by not running the leftmost task first
we could starve tasks because the wakee can gain unlimited runtime.
Therefore we only run the wakee if its within a small
(wakeup_granularity) window of the leftmost task. This preserves
fairness, but does alternate server/client task groups.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clear the cached inverse value when updating load. This is needed for
calc_delta_mine() to work correctly when using the rq load.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Current min_vruntime tracking is incorrect and will cause serious
problems when we don't run the leftmost task for some reason.
min_vruntime does two things; 1) it's used to determine a forward
direction when the u64 vruntime wraps, 2) it's used to track the
leftmost vruntime to position newly enqueued tasks from.
The current logic advances min_vruntime whenever the current task's
vruntime advance. Because the current task may pass the leftmost task
still waiting we're failing the second goal. This causes new tasks to be
placed too far ahead and thus penalizes their runtime.
Fix this by making min_vruntime the min_vruntime of the waiting tasks by
tracking it in enqueue/dequeue, and compare against current's vruntime
to obtain the absolute minimum when placing new tasks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a hard to trigger crash seen in the -rt kernel that also affects
the vanilla scheduler.
There is a race condition between schedule() and some dequeue/enqueue
functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().
When scheduling to idle, idle_balance() is called to pull tasks from
other busy processor. It might drop the rq lock. It means that those 3
functions encounter on_rq=0 and running=1. The current task should be
put when running.
Here is a possible scenario:
CPU0 CPU1
| schedule()
| ->deactivate_task()
| ->idle_balance()
| -->load_balance_newidle()
rt_mutex_setprio() |
| --->double_lock_balance()
*get lock *rel lock
* on_rq=0, ruuning=1 |
* sched_class is changed |
*rel lock *get lock
: |
:
->put_prev_task_rt()
->pick_next_task_fair()
=> panic
The current process of CPU1(P1) is scheduling. Deactivated P1, and the
scheduler looks for another process on other CPU's runqueue because CPU1
will be idle. idle_balance(), load_balance_newidle() and
double_lock_balance() are called and double_lock_balance() could drop
the rq lock. On the other hand, CPU0 is trying to boost the priority of
P1. The result of boosting only P1's prio and sched_class are changed to
RT. The sched entities of P1 and P1's group are never put. It makes
cfs_rq invalid, because the cfs_rq has curr and no leaf, but
pick_next_task_fair() is called, then the kernel panics.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move 00-INDEX entries to power/00-INDEX (and add entry for
pm_qos_interface.txt).
Update references to moved filenames.
Fix some trailing whitespace.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com>
There is a problem in the hibernation code that triggers on some NUMA
systems on which pfn_valid() returns 'true' for some PFNs that don't
belong to any zone. Namely, there is a BUG_ON() in
memory_bm_find_bit() that triggers for PFNs not belonging to any
zone and passing the pfn_valid() test. On the affected systems it
triggers when we mark PFNs reported by the platform as not saveable,
because the PFNs in question belong to a region mapped directly using
iorepam() (i.e. the ACPI data area) and they pass the pfn_valid()
test.
Modify memory_bm_find_bit() so that it returns an error if given PFN
doesn't belong to any zone instead of crashing the kernel and ignore
the result returned by it in mark_nosave_pages(), while marking the
"nosave" memory regions.
This doesn't affect the hibernation functionality, as we won't touch
the PFNs in question anyway.
http://bugzilla.kernel.org/show_bug.cgi?id=9966 .
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
It is possible to allow the root-domain cache of online cpus to
become out of sync with the global cpu_online_map. This is because we
currently trigger removal of cpus too early in the notifier chain.
Other DOWN_PREPARE handlers may in fact run and reconfigure the
root-domain topology, thereby stomping on our own offline handling.
The end result is that rd->online may become out of sync with
cpu_online_map, which results in potential task misrouting.
So change the offline handling to be more tightly coupled with the
global offline process by triggering on CPU_DYING intead of
CPU_DOWN_PREPARE.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The original preemptible-RCU patch put the choice between classic and
preemptible RCU into kernel/Kconfig.preempt, which resulted in build failures
on machines not supporting CONFIG_PREEMPT. This choice was therefore moved to
init/Kconfig, which worked, but placed the choice between classic and
preemptible RCU at the top level, a very obtuse choice indeed.
This patch changes from the Kconfig "choice" mechanism to a pair of booleans,
only one of which (CONFIG_PREEMPT_RCU) is user-visible, and is located in
kernel/Kconfig.preempt, where one would expect it to be. The other
(CONFIG_CLASSIC_RCU) is in init/Kconfig so that it is available to all
architectures, hopefully avoiding build breakage. Thanks to Roman Zippel for
suggesting this approach.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Return value convention of module's init functions is 0/-E. Sometimes,
e.g. during forward-porting mistakes happen and buggy module created,
where result of comparison "workqueue != NULL" is propagated all the way up
to sys_init_module. What happens is that some other module created
workqueue in question, our module created it again and module was
successfully loaded.
Or it could be some other bug.
Let's make such mistakes much more visible. In retrospective, such
messages would noticeably shorten some of my head-scratching sessions.
Note, that dump_stack() is just a way to get attention from user. Sample
message:
sys_init_module: 'foo'->init suspiciously returned 1, it should follow 0/-E convention
sys_init_module: loading module anyway...
Pid: 4223, comm: modprobe Not tainted 2.6.24-25f666300625d894ebe04bac2b4b3aadb907c861 #5
Call Trace:
[<ffffffff80254b05>] sys_init_module+0xe5/0x1d0
[<ffffffff8020b39b>] system_call_after_swapgs+0x7b/0x80
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c9a3ba55 (module: wait for dependent modules doing init.) didn't quite
work because the waiter holds the module lock, meaning that the state of the
module it's waiting for cannot change.
Fortunately, it's fairly simple to update the state outside the lock and do
the wakeup.
Thanks to Jan Glauber for tracking this down and testing (qdio and qeth).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
time: remove obsolete CLOCK_TICK_ADJUST
time: don't touch an offlined CPU's ts->tick_stopped in tick_cancel_sched_timer()
time: prevent the loop in timespec_add_ns() from being optimised away
ntp: use unsigned input for do_div()
We currently set the root-domain online span automatically when the
domain is added to the cpu if the cpu is already a member of
cpu_online_map.
This was done as a hack/bug-fix for s2ram, but it also causes a problem
with hotplug CPU_DOWN transitioning. The right way to fix the original
problem is to actually respond to CPU_UP events, instead of CPU_ONLINE,
which is already too late.
This solves the hung reboot regression reported by Andrew Morton and
others.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first version of the ntp_interval/tick_length inconsistent usage patch was
recently merged as bbe4d18ac2http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bbe4d18ac2e058c56adb0cd71f49d9ed3216a405
While the fix did greatly improve the situation, it was correctly pointed out
by Roman that it does have a small bug: If the users change clocksources after
the system has been running and NTP has made corrections, the correctoins made
against the old clocksource will be applied against the new clocksource,
causing error.
The second attempt, which corrects the issue in the NTP_INTERVAL_LENGTH
definition has also made it up-stream as commit
e13a2e61ddhttp://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e13a2e61dd5152f5499d2003470acf9c838eab84
Roman has correctly pointed out that CLOCK_TICK_ADJUST is calculated
based on the PIT's frequency, and isn't really relevant to non-PIT
driven clocksources (that is, clocksources other then jiffies and pit).
This patch reverts both of those changes, and simply removes
CLOCK_TICK_ADJUST.
This does remove the granularity error correction for users of PIT and Jiffies
clocksource users, but the granularity error but for the majority of users, it
should be within the 500ppm range NTP can accommodate for.
For systems that have granularity errors greater then 500ppm, the
"ntp_tick_adj=" boot option can be used to compensate.
[johnstul@us.ibm.com: provided changelog]
[mattilinnanvuori@yahoo.com: maek ntp_tick_adj static]
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: john stultz <johnstul@us.ibm.com>
Signed-off-by: Matti Linnanvuori <mattilinnanvuori@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Silences WARN_ONs in rcu_enter_nohz() and rcu_exit_nohz(), which appeared
before caused by (repeated) calls to:
$ echo 0 > /sys/devices/system/cpu/cpu1/online
$ echo 1 > /sys/devices/system/cpu/cpu1/online
Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de>
Cc: johnstul@us.ibm.com
Cc: Rafael Wysocki <rjw@sisk.pl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The kernel NTP code shouldn't hand 64-bit *signed* values to do_div(). Make it
instead hand 64-bit unsigned values. This gets rid of a couple of warnings.
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In commit ee7c82da83 ("wait_task_stopped:
simplify and fix races with SIGCONT/SIGKILL/untrace"), the magic (short)
cast when storing si_code was lost in wait_task_stopped. This leaks the
in-kernel CLD_* values that do not match what userland expects.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch checks if we can set the rt_runtime_us to 0. If there is a
realtime task in the group, we don't want to set the rt_runtime_us as 0
or bad things will happen. (that task wont get any CPU time despite
being TASK_RUNNNG)
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
it was only possible to configure the rt-group scheduling parameters
beyond the default value in a very small range.
that's because div64_64() has a different calling convention than
do_div() :/
fix a few untidies while we are here; sysctl_sched_rt_period may overflow
due to that multiplication, so cast to u64 first. Also that RUNTIME_INF
juggling makes little sense although its an effective NOP.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Function sys_sched_rr_get_interval returns wrong time slice value for
SCHED_FIFO tasks. The time slice for SCHED_FIFO tasks should be 0.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Sripathi Kodi reported a crash in the -rt kernel:
https://bugzilla.redhat.com/show_bug.cgi?id=435674
this is due to a place that can reschedule a task without holding
the tasks runqueue lock. This was caused by the RT balancing code
that pulls RT tasks to the current run queue and will reschedule the
current task.
There's a slight chance that the pulling of the RT tasks will release
the current runqueue's lock and retake it (in the double_lock_balance).
During this time that the runqueue is released, the current task can
migrate to another runqueue.
In the prio_changed_rt code, after the pull, if the current task is of
lesser priority than one of the RT tasks pulled, resched_task is called
on the current task. If the current task had migrated in that small
window, resched_task will be called without holding the runqueue lock
for the runqueue that the task is on.
This race condition also exists in the mainline kernel and this patch
adds a check to make sure the task hasn't migrated before calling
resched_task.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Tested-by: Sripathi Kodi <sripathik@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kei Tokunaga reported an interactivity problem when moving tasks
between control groups.
Tasks would retain their old vruntime when moved between groups, this
can cause funny lags. Re-set the vruntime on group move to fit within
the new tree.
Reported-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
mm migration is no longer done in cpuset_update_task_memory_state() so it
can no longer take current->mm->mmap_sem, so fix the obsolete comment.
[ This changed in commit 04c19fa6f1
("cpuset: migrate all tasks in cpuset at once") when the mm migration
was moved from cpuset_update_task_memory_state() to update_nodemask() ]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A change after 2.6.24 broke ndiswrapper by accidentally removing its
access to GPL-only symbols. Revert that change and add comments about
the reasons why ndiswrapper and driverloader are treated in a special
way.
Signed-off-by: Pavel Roskin <proski@gnu.org>
Acked-by: Greg KH <gregkh@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jon Masters <jonathan@jonmasters.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a bug in regiseter_kretprobe() which does not check rp->kp.symbol_name ==
NULL before calling kprobe_lookup_name.
For maintainability, this introduces kprobe_addr helper function which
resolves addr field. It is used by register_kprobe and register_kretprobe.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_marker() may return NULL, so test for it.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add CONFIG_HAVE_KRETPROBES to the arch/<arch>/Kconfig file for relevant
architectures with kprobes support. This facilitates easy handling of
in-kernel modules (like samples/kprobes/kretprobe_example.c) that depend on
kretprobes being present in the kernel.
Thanks to Sam Ravnborg for helping make the patch more lean.
Per Mathieu's suggestion, added CONFIG_KRETPROBES and fixed up dependencies.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memory controller has a requirement that while writing values, we need
to use echo -n. This patch fixes the problem and makes the UI more consistent.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The documentation says the default value of notify_on_release of a child
cgroup is inherited from its parent, which is reasonable, but the
implementation just sets the flag disabled.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following commits cause a number of regressions:
commit 58e2d4ca58
Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Date: Fri Jan 25 21:08:00 2008 +0100
sched: group scheduling, change how cpu load is calculated
commit 6b2d770026
Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Date: Fri Jan 25 21:08:00 2008 +0100
sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups
Namely:
- very frequent wakeups on SMP, reported by PowerTop users.
- cacheline trashing on (large) SMP
- some latencies larger than 500ms
While there is a mergeable patch to fix the latter, the former issues
are not fixable in a manner suitable for .25 (we're at -rc3 now).
Hence we revert them and try again in v2.6.26.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This changes the "freezer" code used by suspend/hibernate in its treatment
of tasks in TASK_STOPPED (job control stop) and TASK_TRACED (ptrace) states.
As I understand it, the intent of the "freezer" is to hold all tasks
from doing anything significant. For this purpose, TASK_STOPPED and
TASK_TRACED are "frozen enough". It's possible the tasks might resume
from ptrace calls (if the tracer were unfrozen) or from signals
(including ones that could come via timer interrupts, etc). But this
doesn't matter as long as they quickly block again while "freezing" is
in effect. Some minor adjustments to the signal.c code make sure that
try_to_freeze() very shortly follows all wakeups from both kinds of
stop. This lets the freezer code safely leave stopped tasks unmolested.
Changing this fixes the longstanding bug of seeing after resuming from
suspend/hibernate your shell report "[1] Stopped" and the like for all
your jobs stopped by ^Z et al, as if you had freshly fg'd and ^Z'd them.
It also removes from the freezer the arcane special case treatment for
ptrace'd tasks, which relied on intimate knowledge of ptrace internals.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. exit_notify() always calls kill_orphaned_pgrp(). This is wrong, we
should do this only when the whole process exits.
2. exit_notify() uses "current" as "ignored_task", obviously wrong.
Use ->group_leader instead.
Test case:
void hup(int sig)
{
printf("HUP received\n");
}
void *tfunc(void *arg)
{
sleep(2);
printf("sub-thread exited\n");
return NULL;
}
int main(int argc, char *argv[])
{
if (!fork()) {
signal(SIGHUP, hup);
kill(getpid(), SIGSTOP);
exit(0);
}
pthread_t thr;
pthread_create(&thr, NULL, tfunc, NULL);
sleep(1);
printf("main thread exited\n");
syscall(__NR_exit, 0);
return 0;
}
output:
main thread exited
HUP received
Hangup
With this patch the output is:
main thread exited
sub-thread exited
HUP received
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
p->exit_state != 0 doesn't mean this process is dead, it may have
sub-threads. Change the code to use "p->exit_state && thread_group_empty(p)"
instead.
Without this patch, ^Z doesn't deliver SIGTSTP to the foreground process
if the main thread has exited.
However, the new check is not perfect either. There is a window when
exit_notify() drops tasklist and before release_task(). Suppose that
the last (non-leader) thread exits. This means that entire group exits,
but thread_group_empty() is not true yet.
As Eric pointed out, is_global_init() is wrong as well, but I did not
dare to do other changes.
Just for the record, has_stopped_jobs() is absolutely wrong too. But we
can't fix it now, we should first fix SIGNAL_STOP_STOPPED issues.
Even with this patch ^Z doesn't play well with the dead main thread.
The task is stopped correctly but do_wait(WSTOPPED) won't see it. This
is another unrelated issue, will be (hopefully) fixed separately.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Factor out the common code in reparent_thread() and exit_notify().
No functional changes.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hi,
While we are looking at the printk issue, I see that its printk'ing the EOE
(end of event) records which is really not something that we need in syslog.
Its really intended for the realtime audit event stream handled by the audit
daemon. So, lets avoid printk'ing that record type.
Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On the latest kernels if one was to load about 15 rules, set the failure
state to panic, and then run service auditd stop the kernel will panic.
This is because auditd stops, then the script deletes all of the rules.
These deletions are sent as audit messages out of the printk kernel
interface which is already known to be lossy. These will overun the
default kernel rate limiting (10 really fast messages) and will call
audit_panic(). The same effect can happen if a slew of avc's come
through while auditd is stopped.
This can be fixed a number of ways but this patch fixes the problem by
just not panicing if auditd is not running. We know printk is lossy and
if the user chooses to set the failure mode to panic and tries to use
printk we can't make any promises no matter how hard we try, so why try?
At least in this way we continue to get lost message accounting and will
eventually know that things went bad.
The other change is to add a new call to audit_log_lost() if auditd
disappears. We already pulled the skb off the queue and couldn't send
it so that message is lost. At least this way we will account for the
last message and panic if the machine is configured to panic. This code
path should only be run if auditd dies for unforeseen reasons. If
auditd closes correctly audit_pid will get set to 0 and we won't walk
this code path.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix the following compiler warning by using "%zu" as defined in C99.
CC kernel/auditsc.o
kernel/auditsc.c: In function 'audit_log_single_execve_arg':
kernel/auditsc.c:1074: warning: format '%ld' expects type 'long int', but
argument 4 has type 'size_t'
Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch fixes a potentially invalid access to a per-CPU variable in
rcu_process_callbacks().
This per-CPU access needs to be done in such a way as to guarantee that
the code using it cannot move to some other CPU before all uses of the
value accessed have completed. Even though this code is currently only
invoked from softirq context, which currrently cannot migrate to some
other CPU, life would be better if this code did not silently make such
an assumption.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This fixes a oops encountered when doing hibernate/resume in presence of
PREEMPT_RCU.
The problem was that the code failed to disable preemption when
accessing a per-CPU variable. This is OK when called from code that
already has preemption disabled, but such is not the case from the
suspend/resume code path.
Reported-by: Dave Young <hidave.darkstar@gmail.com>
Tested-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kthread_stop() can be called when a 'watchdog' thread is executing after
kthread_should_stop() but before set_task_state(TASK_INTERRUPTIBLE).
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The PREEMPT-RCU can get stuck if a CPU goes idle and NO_HZ is set. The
idle CPU will not progress the RCU through its grace period and a
synchronize_rcu my get stuck. Without this patch I have a box that will
not boot when PREEMPT_RCU and NO_HZ are set. That same box boots fine
with this patch.
This patch comes from the -rt kernel where it has been tested for
several months.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'v2.6.25-rc3-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep:
Subject: lockdep: include all lock classes in all_lock_classes
lockdep: increase MAX_LOCK_DEPTH
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
latencytop: change /proc task_struct access method
latencytop: fix memory leak on latency proc file
latencytop: fix kernel panic while reading latency proc file
sched: add declaration of sched_tail to sched.h
sched: fix signedness warnings in sched.c
sched: clean up __pick_last_entity() a bit
sched: remove duplicate code from sched_fair.c
sched: make early bootup sched_clock() use safer
printk recursion detection prepends message to printk_buf and offsets
printk_buf when actual message is printed but it forgets to trim buffer
length accordingly. This can result in overrun in extreme cases. Fix it.
[ mingo@elte.hu:
bug was introduced by me via:
commit 32a7600668
Author: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 25 21:07:58 2008 +0100
printk: make printk more robust by not allowing recursion
]
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add each lock class to the all_lock_classes list when it is
first registered.
Previously, lock classes were added to all_lock_classes when
the lock class was first used. Since one of the uses of the
list is to find unused locks, this didn't work well.
Signed-off-by: Dale Farnsworth <dale@farnsworth.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unsigned long values are always assigned to switch_count,
make it unsigned long.
kernel/sched.c:3897:15: warning: incorrect type in assignment (different signedness)
kernel/sched.c:3897:15: expected long *switch_count
kernel/sched.c:3897:15: got unsigned long *<noident>
kernel/sched.c:3921:16: warning: incorrect type in assignment (different signedness)
kernel/sched.c:3921:16: expected long *switch_count
kernel/sched.c:3921:16: got unsigned long *<noident>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
pick_task_entity() duplicates existing code. This functionality can be
easily obtained using rb_last(). Avoid code duplication by using rb_last().
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do not call sched_clock() too early. Not only might rq->idle
not be set up - but pure per-cpu data might not be accessible
either.
this solves an ia64 early bootup hang with CONFIG_PRINTK_TIME=y.
Tested-by: Tony Luck <tony.luck@gmail.com>
Acked-by: Tony Luck <tony.luck@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Oleg Nesterov and others have pointed out that on some architectures,
the traditional sequence of
set_current_state(TASK_INTERRUPTIBLE);
if (CONDITION)
return;
schedule();
is racy wrt another CPU doing
CONDITION = 1;
wake_up_process(p);
because while set_current_state() has a memory barrier separating
setting of the TASK_INTERRUPTIBLE state from reading of the CONDITION
variable, there is no such memory barrier on the wakeup side.
Now, wake_up_process() does actually take a spinlock before it reads and
sets the task state on the waking side, and on x86 (and many other
architectures) that spinlock is in fact equivalent to a memory barrier,
but that is not generally guaranteed. The write that sets CONDITION
could move into the critical region protected by the runqueue spinlock.
However, adding a smp_wmb() to before the spinlock should now order the
writing of CONDITION wrt the lock itself, which in turn is ordered wrt
the accesses within the spinlock (which includes the reading of the old
state).
This should thus close the race (which probably has never been seen in
practice, but since smp_wmb() is a no-op on x86, it's not like this will
make anything worse either on the most common architecture where the
spinlock already gave the required protection).
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The list head res->tasks gets initialized twice in find_css_set().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cgroup uses unsigned long for subsys bitops, not unsigned long long.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
opts.release_agent is not kfree()ed in all necessary places.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fix:
- comments about need_forkexit_callback
- comments about release agent
- typo and comment style, etc.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these
functions inturn call add/sub_preempt_count(). So we need to refuse user from
inserting probe in to these functions.
This patch disallows user from probing add/sub_preempt_count().
Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Not all architectures implement futex_atomic_cmpxchg_inatomic(). The default
implementation returns -ENOSYS, which is currently not handled inside of the
futex guts.
Futex PI calls and robust list exits with a held futex result in an endless
loop in the futex code on architectures which have no support.
Fixing up every place where futex_atomic_cmpxchg_inatomic() is called would
add a fair amount of extra if/else constructs to the already complex code. It
is also not possible to disable the robust feature before user space tries to
register robust lists.
Compile time disabling is not a good idea either, as there are already
architectures with runtime detection of futex_atomic_cmpxchg_inatomic support.
Detect the functionality at runtime instead by calling
cmpxchg_futex_value_locked() with a NULL pointer from the futex initialization
code. This is guaranteed to fail, but the call of
futex_atomic_cmpxchg_inatomic() happens with pagefaults disabled.
On architectures, which use the asm-generic implementation or have a runtime
CPU feature detection, a -ENOSYS return value disables the PI/robust features.
On architectures with a working implementation the call returns -EFAULT and
the PI/robust features are enabled.
The relevant syscalls return -ENOSYS and the robust list exit code is blocked,
when the detection fails.
Fixes http://lkml.org/lkml/2008/2/11/149
Originally reported by: Lennart Buytenhek
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Riku Voipio <riku.voipio@movial.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the futex init code fails to initialize the futex pseudo file system it
returns early without initializing the hash queues. Should the boot succeed
then a futex syscall which tries to enqueue a waiter on the hashqueue will
crash due to the unitilialized plist heads.
Initialize the hash queues before the filesystem.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Riku Voipio <riku.voipio@movial.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During the last step of hibernation in the "platform" mode (with the
help of ACPI) we use the suspend code, including the devices'
->suspend() methods, to prepare the system for entering the ACPI S4
system sleep state.
But at least for some devices the operations performed by the
->suspend() callback in that case must be different from its operations
during regular suspend.
For this reason, introduce the new PM event type PM_EVENT_HIBERNATE and
pass it to the device drivers' ->suspend() methods during the last phase
of hibernation, so that they can distinguish this case and handle it as
appropriate. Modify the drivers that handle PM_EVENT_SUSPEND in a
special way and need to handle PM_EVENT_HIBERNATE in the same way.
These changes are necessary to fix a hibernation regression related
to the i915 driver (ref. http://lkml.org/lkml/2008/2/22/488).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Jeff Chua <jeff.chua.linux@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thanks to Alexey for the testing and the fix of the fix.
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Make hibernation work with CONFIG_DEBUG_PAGEALLOC set on x86, by
checking if the pages to be copied are marked as present in the
kernel mapping and temporarily marking them as present if that's not
the case. No functional modifications are introduced if
CONFIG_DEBUG_PAGEALLOC is unset.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
The default_disable() function was changed in commit:
76d2160147
genirq: do not mask interrupts by default
It removed the mask function in favour of the default delayed
interrupt disabling. Unfortunately this also broke the shutdown in
free_irq() when the last handler is removed from the interrupt for
those architectures which rely on the default implementations. Now we
can end up with a enabled interrupt line after the last handler was
removed, which can result in spurious interrupts.
Fix this by adding a default_shutdown function, which is only
installed, when the irqchip implementation does provide neither a
shutdown nor a disable function.
[@stable: affected versions: .21 - .24 ]
Pointed-out-by: Michael Hennerich <Michael.Hennerich@analog.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Tested-by: Michael Hennerich <Michael.Hennerich@analog.com>
The functions time_before, time_before_eq, time_after, and
time_after_eq are more robust for comparing jiffies against other
values.
So following patch implements usage of the time_after() macro, defined
at linux/jiffies.h, which deals with wrapping correctly
Signed-off-by: S.Caglar Onur <caglar@pardus.org.tr>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Clearly this was supposed to be an == not an = in the if statement.
This patch also causes us to stop processing execve args once we have
failed rather than continuing to loop on failure over and over and over.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Relative expiry time can get negative, so it should be signed.
Signed-off-by: Pavel Machek <Pavel@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
d_path() is used on a <dentry,vfsmount> pair. Lets use a struct path to
reflect this.
[akpm@linux-foundation.org: fix build in mm/memory.c]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Bryan Wu <bryan.wu@analog.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
audit_log_d_path() is a d_path() wrapper that is used by the audit code. To
use a struct path in audit_log_d_path() I need to embed it into struct
avc_audit_data.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Use struct path in fs_struct.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Add path_put() functions for releasing a reference to the dentry and
vfsmount of a struct path in the right order
* Switch from path_release(nd) to path_put(&nd->path)
* Rename dput_path() to path_put_conditional()
[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.
Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
<dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
struct path in every place where the stack can be traversed
- it reduces the overall code size:
without patch series:
text data bss dec hex filename
5321639 858418 715768 6895825 6938d1 vmlinux
with patch series:
text data bss dec hex filename
5320026 858418 715768 6894212 693284 vmlinux
This patch:
Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This test seems to be unnecessary since we always have rootfs mounted before
calling a usermodehelper.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A CLOCK_REALTIME timer, which has an absolute expiry time less than
the clock realtime offset calls with a negative delta into the clock
events code and triggers the WARN_ON() there.
This is a false positive and needs to be prevented. Check the result
of timer->expires - timer->base->offset right away and return -ETIME
right away.
Thanks to Frans Pop, who reported the problem and tested the fixes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Frans Pop <elendil@planet.nl>
Various user space callers ask for relative timeouts. While we fixed
that overflow issue in hrtimer_start(), the sites which convert
relative user space values to absolute timeouts themself were uncovered.
Instead of putting overflow checks into each place add a function
which does the sanity checking and convert all affected callers to use
it.
Thanks to Frans Pop, who reported the problem and tested the fixes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Frans Pop <elendil@planet.nl>
RCU style multiple probes support for the Linux Kernel Markers. Common case
(one probe) is still fast and does not require dynamic allocation or a
supplementary pointer dereference on the fast path.
- Move preempt disable from the marker site to the callback.
Since we now have an internal callback, move the preempt disable/enable to the
callback instead of the marker site.
Since the callback change is done asynchronously (passing from a handler that
supports arguments to a handler that does not setup the arguments is no
arguments are passed), we can safely update it even if it is outside the
preempt disable section.
- Move probe arm to probe connection. Now, a connected probe is automatically
armed.
Remove MARK_MAX_FORMAT_LEN, unused.
This patch modifies the Linux Kernel Markers API : it removes the probe
"arm/disarm" and changes the probe function prototype : it now expects a
va_list * instead of a "...".
If we want to have more than one probe connected to a marker at a given
time (LTTng, or blktrace, ssytemtap) then we need this patch. Without it,
connecting a second probe handler to a marker will fail.
It allow us, for instance, to do interesting combinations :
Do standard tracing with LTTng and, eventually, to compute statistics
with SystemTAP, or to have a special trigger on an event that would call
a systemtap script which would stop flight recorder tracing.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't
hold hugetlb_lock over the call. Use a dummy variable to store the sysctl
result, like in hugetlb_sysctl_handler(), then grab the lock to update
nr_overcommit_huge_pages.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Reported-by: Miles Lane <miles.lane@gmail.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fastcall always expands to empty, remove it.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This comment caused some consternation during fastcall removal. Make it
truthful.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
sched: rt-group: refure unrunnable tasks
sched: rt-group: clean up the ifdeffery
sched: rt-group: make rt groups scheduling configurable
sched: rt-group: interface
sched: rt-group: deal with PI
sched: fix incorrect irq lock usage in normalize_rt_tasks()
sched: fair-group: separate tg->shares from task_group_lock
hrtimer: more hrtimer_init_sleeper() fallout.
Refuse to accept or create RT tasks in groups that can't run them.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clean up some of the excessive ifdeffery introduces in the last patch.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make the rt group scheduler compile time configurable.
Keep it experimental for now.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change the rt_ratio interface to rt_runtime_us, to match rt_period_us.
This avoids picking a granularity for the ratio.
Extend the /sys/kernel/uids/<uid>/ interface to allow setting
the group's rt_runtime.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven mentioned the fun case where a lock holding task will be throttled.
Simple fix: allow groups that have boosted tasks to run anyway.
If a runnable task in a throttled group gets boosted the dequeue/enqueue
done by rt_mutex_setprio() is enough to unthrottle the group.
This is ofcourse not quite correct. Two possible ways forward are:
- second prio array for boosted tasks
- boost to a prio ceiling (this would also work for deadline scheduling)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
lockdep spotted this bogus irq locking. normalize_rt_tasks() can be called
from hardirq context through sysrq-n
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The USEC_TO_HZ and HZ_TO_USEC constant sets were mislabelled, with
seriously incorrect results. This among other things manifested
itself as cpufreq not working when a tickless kernel was configured.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Tested-by: Carlos R. Mafra <crmafra@ift.unesp.br>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hrtimer_nanosleep_restart() clears/restores restart_block->fn. This is
pointless and complicates its usage. Note that if sys_restart_syscall()
doesn't actually happen, we have a bogus "pending" restart->fn anyway,
this is harmless.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Pavel Emelyanov <xemul@sw.ru>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Toyo Abe <toyoa@mvista.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Spotted by Pavel Emelyanov and Alexey Dobriyan.
compat_sys_nanosleep() implicitly uses hrtimer_nanosleep_restart(), this can't
work. Make a suitable compat_nanosleep_restart() helper.
Introduced by commit c70878b4e0
hrtimer: hook compat_sys_nanosleep up to high res timer code
Also, set ->addr_limit = KERNEL_DS before doing hrtimer_nanosleep(), this func
was changed by the previous patch and now takes the "__user *" parameter.
Thanks to Ingo Molnar for fixing the bug in this patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Pavel Emelyanov <xemul@sw.ru>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Toyo Abe <toyoa@mvista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Spotted by Pavel Emelyanov and Alexey Dobriyan.
hrtimer_nanosleep() sets restart_block->arg1 = rmtp, but this rmtp points to
the local variable which lives in the caller's stack frame. This means that
if sys_restart_syscall() actually happens and it is interrupted as well, we
don't update the user-space variable, but write into the already dead stack
frame.
Introduced by commit 04c227140f
hrtimer: Rework hrtimer_nanosleep to make sys_compat_nanosleep easier
Change the callers to pass "__user *rmtp" to hrtimer_nanosleep(), and change
hrtimer_nanosleep() to use copy_to_user() to actually update *rmtp.
Small problem remains. man 2 nanosleep states that *rtmp should be written if
nanosleep() was interrupted (it says nothing whether it is OK to update *rmtp
if nanosleep returns 0), but (with or without this patch) we can dirty *rem
even if nanosleep() returns 0.
NOTE: this patch doesn't change compat_sys_nanosleep(), because it has other
bugs. Fixed by the next patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pavel Emelyanov <xemul@sw.ru>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Toyo Abe <toyoa@mvista.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
include/linux/hrtimer.h | 2 -
kernel/hrtimer.c | 51 +++++++++++++++++++++++++-----------------------
kernel/posix-timers.c | 14 +------------
3 files changed, 30 insertions(+), 37 deletions(-)
clocksource initialization and error accumulation. This corrects a 280ppm
drift seen on some systems using acpi_pm, and affects other clocksources as
well (likely to a lesser degree).
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'for-2.6.25' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
[POWERPC] Add arch-specific walk_memory_remove() for 64-bit powerpc
[POWERPC] Enable hotplug memory remove for 64-bit powerpc
[POWERPC] Add remove_memory() for 64-bit powerpc
[POWERPC] Make cell IOMMU fixed mapping printk more useful
[POWERPC] Fix potential cell IOMMU bug when switching back to default DMA ops
[POWERPC] Don't enable cell IOMMU fixed mapping if there are no dma-ranges
[POWERPC] Fix cell IOMMU null pointer explosion on old firmwares
[POWERPC] spufs: Fix timing dependent false return from spufs_run_spu
[POWERPC] spufs: No need to have a runnable SPU for libassist update
[POWERPC] spufs: Update SPU_Status[CISHP] in backing runcntl write
[POWERPC] spufs: Fix state_mutex leaks
[POWERPC] Disable G5 NAP mode during SMU commands on U3
Probing non-ISA interrupts using the handle_percpu_irq as their handle_irq
method may crash the system because handle_percpu_irq does not check
IRQ_WAITING. This for example hits the MIPS Qemu configuration.
This patch provides two helper functions set_irq_noprobe and set_irq_probe to
set rsp. clear the IRQ_NOPROBE flag. The only current caller is MIPS code
but this really belongs into generic code.
As an aside, interrupt probing these days has become a mostly obsolete if not
dangerous art. I think Linux interrupts should be changed to default to
non-probing but that's subject of this patch.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Acked-and-tested-by: Rob Landley <rob@landley.net>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, for every sysfs node, the callers will be responsible for
implementing store operation, so many many callers are doing duplicate
things to validate input, they have the same mistakes because they are
calling simple_strtol/ul/ll/uul, especially for module params, they are
just numeric, but you can echo such values as 0x1234xxx, 07777888 and
1234aaa, for these cases, module params store operation just ignores
succesive invalid char and converts prefix part to a numeric although input
is acctually invalid.
This patch tries to fix the aforementioned issues and implements
strict_strtox serial functions, kernel/params.c uses them to strictly
validate input, so module params will reject such values as 0x1234xxxx and
returns an error:
write error: Invalid argument
Any modules which export numeric sysfs node can use strict_strtox instead of
simple_strtox to reject any invalid input.
Here are some test results:
Before applying this patch:
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000g > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000gggggggg > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 010000 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0100008 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 010000aaaaa > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]#
After applying this patch:
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000g > /sys/module/e1000/parameters/copybreak
-bash: echo: write error: Invalid argument
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000gggggggg > /sys/module/e1000/parameters/copybreak
-bash: echo: write error: Invalid argument
[root@yangyi-dev /]# echo 010000 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# echo 0100008 > /sys/module/e1000/parameters/copybreak
-bash: echo: write error: Invalid argument
[root@yangyi-dev /]# echo 010000aaaaa > /sys/module/e1000/parameters/copybreak
-bash: echo: write error: Invalid argument
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo -n 4096 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]#
[akpm@linux-foundation.org: fix compiler warnings]
[akpm@linux-foundation.org: fix off-by-one found by tiwai@suse.de]
Signed-off-by: Yi Yang <yi.y.yang@intel.com>
Cc: Greg KH <greg@kroah.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following warning:
WARNING: o-x86_64/kernel/built-in.o(.text+0x36d8b): Section mismatch in reference from the function enable_nonboot_cpus() to the function .cpuinit.text:_cpu_up()
enable_nonboot_cpus() are used solely from CONFIG_CONFIG_PM_SLEEP_SMP=y
and PM_SLEEP_SMP imply HOTPLUG_CPU therefore the reference
to _cpu_up() is valid.
Annotate enable_nonboot_cpus() with __ref to silence modpost.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The proper behavior to store task's pid and get this task later is to get the
struct pid pointer and get the task with the pid_task() call.
Make it for rt_mutex_waiter->deadlock_task_pid field.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All the functions that need to lookup a task by pid in posix timers obtain
this pid from a user space, and thus this value refers to a task in the same
namespace, as the current task lives in.
So the proper behavior is to call find_task_by_vpid() here.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the conversion factor between jiffies and milli- or microseconds is
not a single multiply or divide, as for the case of HZ == 300, we currently
do a multiply followed by a divide. The intervening result, however, is
subject to overflows, especially since the fraction is not simplified (for
HZ == 300, we multiply by 300 and divide by 1000).
This is exposed to the user when passing a large timeout to poll(), for
example.
This patch replaces the multiply-divide with a reciprocal multiplication on
32-bit platforms. When the input is an unsigned long, there is no portable
way to do this on 64-bit platforms there is no portable way to do this
since it requires a 128-bit intermediate result (which gcc does support on
64-bit platforms but may generate libgcc calls, e.g. on 64-bit s390), but
since the output is a 32-bit integer in the cases affected, just simplify
the multiply-divide (*3/10 instead of *300/1000).
The reciprocal multiply used can have off-by-one errors in the upper half
of the valid output range. This could be avoided at the expense of having
to deal with a potential 65-bit intermediate result. Since the intent is
to avoid overflow problems and most of the other time conversions are only
semiexact, the off-by-one errors were considered an acceptable tradeoff.
At Ralf Baechle's suggestion, this version uses a Perl script to compute
the necessary constants. We already have dependencies on Perl for kernel
compiles. This does, however, require the Perl module Math::BigInt, which
is included in the standard Perl distribution starting with version 5.8.0.
In order to support older versions of Perl, include a table of canned
constants in the script itself, and structure the script so that
Math::BigInt isn't required if pulling values from said table.
Running the script requires that the HZ value is available from the
Makefile. Thus, this patch also adds the Kconfig variable CONFIG_HZ to the
architectures which didn't already have it (alpha, cris, frv, h8300, m32r,
m68k, m68knommu, sparc, v850, and xtensa.) It does *not* touch the sh or
sh64 architectures, since Paul Mundt has dealt with those separately in the
sh tree.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Ralf Baechle <ralf@linux-mips.org>,
Cc: Sam Ravnborg <sam@ravnborg.org>,
Cc: Paul Mundt <lethal@linux-sh.org>,
Cc: Richard Henderson <rth@twiddle.net>,
Cc: Michael Starvik <starvik@axis.com>,
Cc: David Howells <dhowells@redhat.com>,
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>,
Cc: Hirokazu Takata <takata@linux-m32r.org>,
Cc: Geert Uytterhoeven <geert@linux-m68k.org>,
Cc: Roman Zippel <zippel@linux-m68k.org>,
Cc: William L. Irwin <sparclinux@vger.kernel.org>,
Cc: Chris Zankel <chris@zankel.net>,
Cc: H. Peter Anvin <hpa@zytor.com>,
Cc: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Makes an embedded image a bit smaller.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
delayed_work_timer_fn() is a timer function, make it static.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The scheduled removal of the 'time' option.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the duplicate inclusion of linux/jiffies.h from kernel/printk.c
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remains the question whether it is intended that many, perhaps even large,
tables are compiled in without ever having a chance to get used, i.e.
whether there shouldn't #ifdef CONFIG_xxx get added.
[akpm@linux-foundation.org: fix cut-n-paste error]
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix typo in comments.
BTW: I have to fix coding style in arch/ia64/kernel/time.c also, otherwise
checkpatch.pl will be complaining.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Function timekeeping_is_continuous() no longer checks flag
CLOCK_IS_CONTINUOUS, and it checks CLOCK_SOURCE_VALID_FOR_HRES now. So rename
the function accordingly.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Flag CLOCK_SOURCE_WATCHDOG is cleared twice. Note clocksource_change_rating()
won't do anyting with the cs flag.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's only one caller left - the kill_pgrp one - so merge these two
functions and forget the kill_pgrp_info one.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the first step (of two) in removing the kill_pgrp_info.
All the users of this function are in kernel/signal.c, but all they need is to
call __kill_pgrp_info() with the tasklist_lock read-locked.
Fortunately, one of its users is the kill_something_info(), which already
needs this lock in one of its branches, so clean these branches up and call
the __kill_pgrp_info() directly.
Based on Oleg's view of how this function should look.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
_all_ converted to operate on the current pid namespace. After this each call
like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
one.
Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
appropriate.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
signal_struct->tsk points to the ->group_leader and thus we have the nasty
code in de_thread() which has to change it and restart ->real_timer if the
leader is changed.
Use "struct pid *leader_pid" instead. This also allows us to kill now
unneeded send_group_sig_info().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kill_pid_info()->pid_task() could be the old leader of the execing process.
In that case it is possible that the leader will be released before we take
siglock. This means that kill_pid_info() (and thus sys_kill()) can return a
false -ESRCH.
Change the code to retry when lock_task_sighand() fails. The endless loop is
not possible, __exit_signal() both clears ->sighand and does detach_pid().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the new semantics of find_vpid() we don't need to play with ->nsproxy
explicitely, _vxx() do the right things.
Also s/tasklist/rcu/.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pid_vnr returns the user space pid with respect to the pid namespace the
struct pid was allocated in. What we want before we return a pid to user
space is the user space pid with respect to the pid namespace of current.
pid_vnr is a very nice optimization but because it isn't quite what we want
it is easy to use pid_vnr at times when we aren't certain the struct pid
was allocated in our pid namespace.
Currently this describes at least tiocgpgrp and tiocgsid in ttyio.c the
parent process reported in the core dumps and the parent process in
get_signal_to_deliver.
So unless the performance impact is huge having an interface that does what
we want instead of always what we want should be much more reliable and
much less error prone.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This modifies do_wait and eligible child to take a pair of enum pid_type
and struct pid *pid to precisely specify what set of processes are eligible
to be waited for, instead of the raw pid_t value from sys_wait4.
This fixes a bug in sys_waitid where you could not wait for children in
just process group 1.
This fixes a pid namespace crossing case in eligible_child. Allowing us to
wait for a processes in our current process group even if our current
process group == 0.
This allows the no child with this pid case to be optimized. This allows
us to optimize the pid membership test in eligible child to be optimized.
This even closes a theoretical pid wraparound race where in a threaded
parent if two threads are waiting for the same child and one thread picks
up the child and the pid numbers wrap around and generate another child
with that same pid before the other thread is scheduled (teribly insanely
unlikely) we could end up waiting on the second child with the same pid#
and not discover that the specific child we were waiting for has exited.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous bugfix was not optimal, we shouldn't care about group stop
when we are the only thread or the group stop is in progress. In that case
nothing special is needed, just set PF_EXITING and return.
Also, take the related "TIF_SIGPENDING re-targeting" code from exit_notify().
So, from the performance POV the only difference is that we don't trust
!signal_pending() until we take ->siglock. But this in fact fixes another
___pure___ theoretical minor race. __group_complete_signal() finds the
task without PF_EXITING and chooses it as the target for signal_wake_up().
But nothing prevents this task from exiting in between without noticing the
pending signal and thus unpredictably delaying the actual delivery.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric's "fix clone(CLONE_NEWPID)" eliminated the last reason for this hack.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_signal_stop() counts all sub-thread and sets ->group_stop_count
accordingly. Every thread should decrement ->group_stop_count and stop,
the last one should notify the parent.
However a sub-thread can exit before it notices the signal_pending(), or it
may be somewhere in do_exit() already. In that case the group stop never
finishes properly.
Note: this is a minimal fix, we can add some optimizations later. Say we
can return quickly if thread_group_empty(). Also, we can move some signal
related code from exit_notify() to exit_signals().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Eric pointed out, there is no problem with init starting with sid == pgid
== 0, and this was historical linux behavior changed in 2.6.18.
Remove kernel_init()->__set_special_pids(), this is unneeded and complicates
the rules for sys_setsid().
This change and the previous change in daemonize() mean that /sbin/init does
not need the special "session != 1" hack in sys_setsid() any longer. We can't
remove this check yet, we should cleanup copy_process(CLONE_NEWPID) first, so
update the comment only.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daemonized kernel threads run in the init's session. This doesn't match the
behaviour of kthread_create()'ed threads, and this is one of the 2 reasons
why we need a special hack in sys_setsid().
Now that set_special_pids() was changed to use struct pid, not pid_t, we can
use init_struct_pid and set 0,0 special pids.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change set_special_pids() to work with struct pid, not pid_t from global name
space. This again speedups and imho cleanups the code, also a preparation for
the next patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_setsid() still deals with pid_t's from the global namespace. This means
that the "session > 1" check can't help for sub-namespace init, setsid() can't
succeed because copy_process(CLONE_NEWPID) populates PIDTYPE_PGID/SID links.
Remove the usage of task_struct->pid and convert the code to use "struct pid".
This also simplifies and speedups the code, saves one find_pid().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sys_setpgid() does unneeded conversions from pid_t to "struct pid" and vice
versa. Use "struct pid" more consistently. Saves one find_vpid() and
eliminates the explicit usage of ->nsproxy->pid_ns. Imho, cleanups the
code.
Also use the same_thread_group() helper.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The first "p->exit_state != EXIT_ZOMBIE" check doesn't make too much sense.
The exit_state was EXIT_ZOMBIE when the function was called, and another
thread can change it to EXIT_DEAD right after the check.
The second condition is not possible, detached non-traced threads were already
filtered out by eligible_child(), we didn't drop tasklist since then.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Surprise, the other two wait_task_*() functions also abuse the
task_pid_nr_ns() function, and may cause read-after-free or report nr == 0
in wait_task_continued(). wait_task_zombie() doesn't have this problem,
but it is still better to cache pid_t rather than call task_pid_nr_ns()
three times on the saved pid_namespace.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Imho, the current usage of security_task_wait() is not logical.
Suppose we have the single child p, and security_task_wait(p) return
-EANY. In that case waitpid(-1) returns this error. Why? Isn't it
better to return ECHLD? We don't really have reapable children.
Now suppose that child was stolen by gdb. In that case we find this
child on ->ptrace_children and set flag = 1, but we don't check that the
child was denied. So, do_wait(..., WNOHANG) returns 0, this doesn't
match the behaviour above. Without WNOHANG do_wait() blocks only to
return the error later, when the child will be untraced. Inho, really
strange.
I think eligible_child() should return the error only if the child's pid
was requested explicitly, otherwise we should silently ignore the tasks
which were nacked by security_task_wait().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
eligible_child() == 2 means delay_group_leader(). With the previous patch
this only matters for EXIT_ZOMBIE task, we can move that special check to
the only place it is really needed.
Also, with this patch we don't skip security_task_wait() for the group
leaders in a non-empty thread group. I don't really understand the exact
semantics of security_task_wait(), but imho this change is a bugfix.
Also rearrange the code a bit to kill an ugly "check_continued" backdoor.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wait_task_stopped() doesn't need the "delay_group_leader" parameter. If
the child is not traced it must be a group leader. With or without
subthreads ->group_stop_count == 0 when the whole task is stopped.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Mika Penttila <mika.penttila@kolumbus.fi>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the tracer is gone and we are not going to stop, ptrace_stop() sets
->exit_code = nostop_code. However, the tracer could actually clear the
exit code before detaching. In that case get_signal_to_deliver() "resends"
the signal which was cancelled by the debugger. For example, it is
possible that a quick PTRACE_ATTACH + PTRACE_DETACH can leave the tracee in
STOPPED state.
Change the behaviour of ptrace_stop(). If the caller is ptrace notify(),
we should always clear ->exit_code. If the caller is
get_signal_to_deliver(), we should not touch it at all. To do so, change
the nonstop_code parameter to "bool clear_code" and change the callers
accordingly.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every branch if the main "if" statement does the same code at the end. Move
it down. Also, fix the indentation.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wait_task_stopped() has multiple races with SIGCONT/SIGKILL. tasklist_lock
does not pin the child in TASK_TRACED/TASK_STOPPED stated, almost all info
reported (including exit_code) may be wrong.
In fact, the code under write_lock_irq(tasklist_lock) is not safe. The child
may be PTRACE_DETACH'ed at this time by another subthread, in that case it is
possible we are no longer its ->parent.
Change wait_task_stopped() to take ->siglock before inspecting the task. This
guarantees that the child can't resume and (for example) clear its
->exit_code, so we don't need to use xchg(&p->exit_code) and re-check. The
only exception is ptrace_stop() which changes ->state and ->exit_code without
->siglock held during abort. But this can only happen if both the tracer and
the tracee are dying (coredump is in progress), we don't care.
With this patch wait_task_stopped() doesn't move the child to the end of
the ->parent list on success. This optimization could be restored, but
in that case we have to take write_lock(tasklist) and do some nasty
checks.
Also change the do_wait() since we don't return EAGAIN any longer.
[akpm@linux-foundation.org: fix up after Willy renamed everything]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the tracer went away (may_ptrace_stop() failed), ptrace_stop() drops
tasklist and then changes the ->state from TASK_TRACED to TASK_RUNNING.
This can fool another tracer which attaches to us in between. Change the
->state under tasklist_lock to ensure that ptrace_check_attach() can't wrongly
succeed. Also, remove the unnecessary mb().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is not possible to see the PT_PTRACED task without ->signal/sighand under
tasklist_lock, release_task() does ptrace_unlink() first. If the task was
already released before, ptrace_attach() can't succeed and set PT_PTRACED.
Remove this check.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that my_ptrace_child() is trivial we can use the "p->ptrace & PT_PTRACED"
inline and simplify the corresponding logic in do_wait: we can't find the
child in TASK_TRACED state without PT_PTRACED flag set, ptrace_untrace()
either sets TASK_STOPPED or wakes up the tracee.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the patch
"Fix ptrace_attach()/ptrace_traceme()/de_thread() race"
commit f5b40e363a
we set PT_ATTACHED and change child->parent "atomically" wrt task_list lock.
This means we can remove the checks like "PT_ATTACHED && ->parent != ptracer"
which were needed to catch the "ptrace attach is in progress" case. We can
also remove the flag itself since nobody else uses it.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
module.c should not define linker variables on its own. We have an include
file for that.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The module subsystem cannot handle symbols that are zero. If symbols are
present that have a zero value then the module resolver prints out a
message that these symbols are unresolved.
[akinobu.mita@gmail.com: fix __find_symbl() error checks]
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Kay Sievers <kay.sievers@vrfy.org
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we possibly lookup the pid in the wrong pid namespace. So
seq_file convert proc_pid_status which ensures the proper pid namespaces is
passed in.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: another build fix]
[akpm@linux-foundation.org: s390 build fix]
[akpm@linux-foundation.org: fix task_name() output]
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andrew Morgan <morgan@kernel.org>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just like with the user namespaces, move the namespace management code into
the separate .c file and mark the (already existing) PID_NS option as "depend
on NAMESPACES"
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the user_namespace.o compilation depend on this option and move the
init_user_ns into user.c file to make the kernel compile and work without the
namespaces support. This make the user namespace code be organized similar to
other namespaces'.
Also mask the USER_NS option as "depend on NAMESPACES".
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the IPC namespace management code is spread over the ipc/*.c files.
I moved this code into ipc/namespace.c file which is compiled out when needed.
The linux/ipc_namespace.h file is used to store the prototypes of the
functions in namespace.c and the stubs for NAMESPACES=n case. This is done
so, because the stub for copy_ipc_namespace requires the knowledge of the
CLONE_NEWIPC flag, which is in sched.h. But the linux/ipc.h file itself in
included into many many .c files via the sys.h->sem.h sequence so adding the
sched.h into it will make all these .c depend on sched.h which is not that
good. On the other hand the knowledge about the namespaces stuff is required
in 4 .c files only.
Besides, this patch compiles out some auxiliary functions from ipc/sem.c,
msg.c and shm.c files. It turned out that moving these functions into
namespaces.c is not that easy because they use many other calls and macros
from the original file. Moving them would make this patch complicated. On
the other hand all these functions can be consolidated, so I will send a
separate patch doing this a bit later.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently all the namespace management code is in the kernel/utsname.c file,
so just compile it out and make stubs in the appropriate header.
The init namespace itself is in init/version.c and is in the kernel all the
time.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used
proc_doulongvec_minmax() directly. However, hugetlb.c's locking rules
require that all counter modifications occur under the hugetlb_lock. Add a
callback into the hugetlb code similar to the one for nr_hugepages. Grab
the lock around the manipulation of nr_overcommit_hugepages in
proc_doulongvec_minmax().
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
walk_memory_resource() verifies if there are holes in a given memory
range, by checking against /proc/iomem. On x86/ia64 system memory is
represented in /proc/iomem. On powerpc, we don't show system memory as
IO resource in /proc/iomem - instead it's maintained in
/proc/device-tree.
This provides a way for an architecture to provide its own
walk_memory_resource() function. On powerpc, the memory region is
small (16MB), contiguous and non-overlapping. So extra checking
against the device-tree is not needed.
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Kumar Gala <galak@gate.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
For readability, all the calls to vmcoreinfo_append_str() are changed to macros
having a prefix "VMCOREINFO_".
This discussion is the following:
http://www.ussg.iu.edu/hypermail/linux/kernel/0709.3/0584.html
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Acked-by: Simon Horman <horms@verge.net.au>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset is for the vmcoreinfo data.
The vmcoreinfo data has the minimum debugging information only for dump
filtering. makedumpfile (dump filtering command) gets it to distinguish
unnecessary pages, and makedumpfile creates a small dumpfile.
This patch:
VMCOREINFO_SIZE() should be renamed VMCOREINFO_STRUCT_SIZE() since it's always
returning the size of the struct with a given name. This change would allow
VMCOREINFO_TYPEDEF_SIZE() to simply become VMCOREINFO_SIZE() since it need not
be used exclusively for typedefs.
This discussion is the following:
http://www.ussg.iu.edu/hypermail/linux/kernel/0709.3/0582.html
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's one place that works with task pids - its the "tasks" file in cgroups.
The read/write handlers assume, that the pid values go to/come from the user
space and thus it is a virtual pid, i.e. the pid as it is seen from inside a
namespace.
Tune the code accordingly.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Narrow the scope of callback_mutex in scan_for_empty_cpusets().
- Avoid rewriting the cpus, mems of cpusets except when it is likely that
we'll be changing them.
- Have remove_tasks_in_empty_cpuset() also check for empty mems.
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Cliff Wickman <cpw@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various minor formatting and comment tweaks to Cliff Wickman's
[PATCH_3_of_3]_cpusets__update_cpumask_revision.patch
I had had "iff", meaning "if and only if" in a comment. However, except for
ancient mathematicians, the abbreviation "iff" was a tad too cryptic. Cliff
changed it to "if", presumably figuring that the "iff" was a typo. However,
it was the "only if" half of the conjunction that was most interesting.
Reword to emphasis the "only if" aspect.
The locking comment for remove_tasks_in_empty_cpuset() was wrong; it said
callback_mutex had to be held on entry. The opposite is true.
Several mentions of attach_task() in comments needed to be
changed to cgroup_attach_task().
A comment about notify_on_release was no longer relevant,
as the line of code it had commented, namely:
set_bit(CS_RELEASED_RESOURCE, &parent->flags);
is no longer present in that place in the cpuset.c code.
Similarly a comment about notify_on_release before the
scan_for_empty_cpusets() routine was no longer relevant.
Removed extra parentheses and unnecessary return statement.
Renamed attach_task() to cpuset_attach() in various comments.
Removed comment about not needing memory migration, as it seems the migration
is done anyway, via the cpuset_attach() callback from cgroup_attach_task().
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Cliff Wickman <cpw@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of the comments in kernel/cpuset.c were stale following the
transition to control groups; this patch updates them to more closely
match reality.
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the new function cgroup_scan_tasks() to step through all tasks in a
cpuset.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch corrects a situation that occurs when one disables all the cpus in
a cpuset.
Currently, the disabled (cpu-less) cpuset inherits the cpus of its parent,
which is incorrect because it may then overlap its cpu-exclusive sibling.
Tasks of an empty cpuset should be moved to the cpuset which is the parent of
their current cpuset. Or if the parent cpuset has no cpus, to its parent,
etc.
And the empty cpuset should be released (if it is flagged notify_on_release).
Depends on the cgroup_scan_tasks() function (proposed by David Rientjes) to
iterate through all tasks in the cpu-less cpuset. We are deliberately
avoiding a walk of the tasklist.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide cgroup_scan_tasks(), which iterates through every task in a cgroup,
calling a test function and a process function for each. And call the process
function without holding the css_set_lock lock.
The idea is David Rientjes', predicting that such a function will make it much
easier in the future to extend things that require access to each task in a
cgroup without holding the lock,
[akpm@linux-foundation.org: cleanup]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a handler "pre_destroy" to cgroup_subsys. It is called before
cgroup_rmdir() checks all subsys's refcnt.
I think this is useful for subsys which have some extra refs even if there
are no tasks in cgroup. By adding pre_destroy(), the kernel keeps the rule
"destroy() against subsystem is called only when refcnt=0." and allows css
ref to be used by other objects than tasks.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Paul Menage <menage@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
dump of all system tasks (excluding kernel threads) when performing an
OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
oom_adj score, and name.
This is helpful for determining why there was an OOM condition and which
rogue task caused it.
It is configurable so that large systems, such as those with several
thousand tasks, do not incur a performance penalty associated with dumping
data they may not desire.
If an OOM was triggered as a result of a memory controller, the tasklist
shall be filtered to exclude tasks that are not a member of the same
cgroup.
Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the interface to use bytes instead of pages. Page sizes can vary
across platforms and configurations. A new strategy routine has been added
to the resource counters infrastructure to format the data as desired.
Suggested by David Rientjes, Andrew Morton and Herbert Poetzl
Tested on a UML setup with the config for memory control enabled.
[kamezawa.hiroyu@jp.fujitsu.com: possible race fix in res_counter]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Basic setup routines, the mm_struct has a pointer to the cgroup that
it belongs to and the the page has a page_cgroup associated with it.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With fixes from David Rientjes <rientjes@google.com>
Introduce generic structures and routines for resource accounting.
Each resource accounting cgroup is supposed to aggregate it,
cgroup_subsystem_state and its resource-specific members within.
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cgroup_is_releasable() and notify_on_release() should be static,
not global inline.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the calls to the cgroup subsystem destroy() methods from
cgroup_rmdir() to cgroup_diput(). This allows control file reads and
writes to access their subsystem state without having to be concerned with
locking against cgroup destruction - the control file dentry will keep the
cgroup and its subsystem state objects alive until the file is closed.
The documentation is updated to reflect the changed semantics of destroy();
additionally the locking comments for destroy() and some other methods were
clarified and decrustified.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simplify the space stripping code in cgroup file write.
[akpm@linux-foundation.org: s/BUG_ON/BUILD_BUG_ON/]
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Coding style fix - one line conditionals don't get braces.
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes dead code spotted by the Coverity checker
(look at the "(nbytes >= PATH_MAX)" check).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The gl_owner_pid field is used to get the lock owning task by its pid, so make
it in a proper manner, i.e. by using the struct pid pointer and pid_task()
function.
The pid_task() becomes exported for the gfs2 module.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When an ACPI table is overridden (for now this can happen only for DSDT)
display a big warning and taint the kernel with flag A.
Signed-off-by: Eric Piel <eric.piel@tremplin-utc.net>
Signed-off-by: Len Brown <len.brown@intel.com>
Provide support to add an optional user defined callback to be run at
function entry of a kretprobe'd function. Also modify the kprobe smoke
tests to include an entry-handler during the kretprobe sanity test.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel has a divide by zero crash when trying to run the system timer
less than 100Hz. The problem is x/(HZ/USER_HZ) and related. Now
x*(USER_HZ/HZ) will be used if HZ<USER_HZ.
I'm running the Linux kernel under qemu and went to run a slower system
timer to take less CPU (and battery) on the host. I found that the kernel
paniced under emulation because of a divide by zero in three places. Here
is the patch. The base git was updated today 01-05-2008. I went for a
20Hz system time by adding config HZ_20 etc to kernel/Kconfig.hz. With
this patch I verified the system timer by looking at /proc/interrupts.
[akpm@linux-foundation.org: partially clean up the macro maze]
Signed-off-by: David Fries <david@fries.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
groups_sort() can be quite long if user loads a large gid table.
This is because GROUP_AT(group_info, some_integer) uses an integer divide.
So having to do XXX thousand divides during one syscall can lead to very
high latencies. (NGROUPS_MAX=65536)
In the past (25 Mar 2006), an analog problem was found in groups_search()
(commit d74beb9f33 ) and at that time I
changed some variables to unsigned int.
I believe that a more generic fix is to make sure NGROUPS_PER_BLOCK is
unsigned.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds support to allow asm/ptrace.h to define two new macros,
arch_ptrace_stop_needed and arch_ptrace_stop. These control special
machine-specific actions to be done before a ptrace stop. The new code
compiles away to nothing when the new macros are not defined. This is the
case on all machines to begin with.
On ia64, these macros will be defined to solve the long-standing issue of
ptrace vs register backing store.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert relay from nopage to fault.
Remove redundant vma range checks.
Switch from OOM to SIGBUS if the resource is not available.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NR_OPEN (historically set to 1024*1024) actually forbids processes to open
more than 1024*1024 handles.
Unfortunatly some production servers hit the not so 'ridiculously high
value' of 1024*1024 file descriptors per process.
Changing NR_OPEN is not considered safe because of vmalloc space potential
exhaust.
This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
1024*1024, so that admins can decide to change this limit if their workload
needs it.
[akpm@linux-foundation.org: export it for sparc64]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minor cleanup. We can remove one "else if" branch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stop using unsigned _longs_ for printk buffer indexes. Log buffer is way
smaller than 2 gigabytes and unsigned ints will work too . Indeed, they do
work nicely on all 32-bit platforms where longs and ints are the same.
With this patch, we have following size savings on amd64:
text data bss dec hex filename
5997 313 17736 24046 5dee 2.6.23.1.t64/kernel/printk.o
5858 313 17700 23871 5d3f 2.6.23.1.printk.t64/kernel/printk.o
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I found that there is a buffer overflow problem in the following code.
Version: 2.6.24-rc2,
File: kernel/time/clocksource.c:417-432
--------------------------------------------------------------------
static ssize_t
sysfs_show_available_clocksources(struct sys_device *dev, char *buf)
{
struct clocksource *src;
char *curr = buf;
spin_lock_irq(&clocksource_lock);
list_for_each_entry(src, &clocksource_list, list) {
curr += sprintf(curr, "%s ", src->name);
}
spin_unlock_irq(&clocksource_lock);
curr += sprintf(curr, "\n");
return curr - buf;
}
-----------------------------------------------------------------------
sysfs_show_current_clocksources() also has the same problem though in practice
the size of current clocksource's name won't exceed PAGE_SIZE.
I fix the bug by using snprintf according to the specification of the kernel
(Version:2.6.24-rc2,File:Documentation/filesystems/sysfs.txt)
Fix sysfs_show_available_clocksources() and sysfs_show_current_clocksources()
buffer overflow problem with snprintf().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every file should include the headers containing the prototypes for its global
functions (in this case {,un}register_reboot_notifier()).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the needlessly global srcu_readers_active() static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every file should include the headers containing the prototypes for its global
functions (in this case sys_ptrace()).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When passing a zero address to kallsyms_lookup(), the kernel thought it was
a valid kernel address, even if it is not. This is because is_ksym_addr()
called is_kernel_extratext() and checked against labels that don't exist on
many archs (which default as zero). Since PPC was the only kernel which
defines _extra_text, (in 2005), and no longer needs it, this patch removes
_extra_text support.
For some history (provided by Jon):
http://ozlabs.org/pipermail/linuxppc-dev/2005-September/019734.htmlhttp://ozlabs.org/pipermail/linuxppc-dev/2005-September/019736.htmlhttp://ozlabs.org/pipermail/linuxppc-dev/2005-September/019751.html
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Jon Loeliger <jdl@freescale.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. It is much easier to grep for ->state change if __set_task_state() is used
instead of the direct assignment.
2. ptrace_stop() and handle_group_stop() use set_task_state() which adds the
unneeded mb() (btw even if we use mb() it is still possible that do_wait()
sees the new ->state but not ->exit_code, but this is ok).
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the ability to scale cputime into generic code. This allows us
to fix the issue in kernel/timer.c (noticed by Balbir) where we could only
add an unscaled value to the scaled utime/stime.
This adds a cputime_to_scaled function. As before, the POWERPC version
does the scaling based on the last SPURR/PURR ratio calculated. The
generic and s390 (only other arch to implement asm/cputime.h) versions are
both NOPs.
Also moves the SPURR and PURR snapshots closer.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace latency.c use with pm_qos_params use.
Signed-off-by: mark gross <mgross@linux.intel.com>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jaroslav Kysela <perex@suse.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following patch is a generalization of the latency.c implementation done
by Arjan last year. It provides infrastructure for more than one parameter,
and exposes a user mode interface for processes to register pm_qos
expectations of processes.
This interface provides a kernel and user mode interface for registering
performance expectations by drivers, subsystems and user space applications on
one of the parameters.
Currently we have {cpu_dma_latency, network_latency, network_throughput} as
the initial set of pm_qos parameters.
The infrastructure exposes multiple misc device nodes one per implemented
parameter. The set of parameters implement is defined by pm_qos_power_init()
and pm_qos_params.h. This is done because having the available parameters
being runtime configurable or changeable from a driver was seen as too easy to
abuse.
For each parameter a list of performance requirements is maintained along with
an aggregated target value. The aggregated target value is updated with
changes to the requirement list or elements of the list. Typically the
aggregated target value is simply the max or min of the requirement values
held in the parameter list elements.
>From kernel mode the use of this interface is simple:
pm_qos_add_requirement(param_id, name, target_value):
Will insert a named element in the list for that identified PM_QOS
parameter with the target value. Upon change to this list the new target is
recomputed and any registered notifiers are called only if the target value
is now different.
pm_qos_update_requirement(param_id, name, new_target_value):
Will search the list identified by the param_id for the named list element
and then update its target value, calling the notification tree if the
aggregated target is changed. with that name is already registered.
pm_qos_remove_requirement(param_id, name):
Will search the identified list for the named element and remove it, after
removal it will update the aggregate target and call the notification tree
if the target was changed as a result of removing the named requirement.
>From user mode:
Only processes can register a pm_qos requirement. To provide for
automatic cleanup for process the interface requires the process to register
its parameter requirements in the following way:
To register the default pm_qos target for the specific parameter, the
process must open one of /dev/[cpu_dma_latency, network_latency,
network_throughput]
As long as the device node is held open that process has a registered
requirement on the parameter. The name of the requirement is
"process_<PID>" derived from the current->pid from within the open system
call.
To change the requested target value the process needs to write a s32
value to the open device node. This translates to a
pm_qos_update_requirement call.
To remove the user mode request for a target value simply close the device
node.
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix build again]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: mark gross <mgross@linux.intel.com>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jaroslav Kysela <perex@suse.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Adam Belay <abelay@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel_shutdown_prepare() can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
resume_file[] and create_image() can become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The capability bounding set is a set beyond which capabilities cannot grow.
Currently cap_bset is per-system. It can be manipulated through sysctl,
but only init can add capabilities. Root can remove capabilities. By
default it includes all caps except CAP_SETPCAP.
This patch makes the bounding set per-process when file capabilities are
enabled. It is inherited at fork from parent. Noone can add elements,
CAP_SETPCAP is required to remove them.
One example use of this is to start a safer container. For instance, until
device namespaces or per-container device whitelists are introduced, it is
best to take CAP_MKNOD away from a container.
The bounding set will not affect pP and pE immediately. It will only
affect pP' and pE' after subsequent exec()s. It also does not affect pI,
and exec() does not constrain pI'. So to really start a shell with no way
of regain CAP_MKNOD, you would do
prctl(PR_CAPBSET_DROP, CAP_MKNOD);
cap_t cap = cap_get_proc();
cap_value_t caparray[1];
caparray[0] = CAP_MKNOD;
cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
cap_set_proc(cap);
cap_free(cap);
The following test program will get and set the bounding
set (but not pI). For instance
./bset get
(lists capabilities in bset)
./bset drop cap_net_raw
(starts shell with new bset)
(use capset, setuid binary, or binary with
file capabilities to try to increase caps)
************************************************************
cap_bound.c
************************************************************
#include <sys/prctl.h>
#include <linux/capability.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifndef PR_CAPBSET_READ
#define PR_CAPBSET_READ 23
#endif
#ifndef PR_CAPBSET_DROP
#define PR_CAPBSET_DROP 24
#endif
int usage(char *me)
{
printf("Usage: %s get\n", me);
printf(" %s drop <capability>\n", me);
return 1;
}
#define numcaps 32
char *captable[numcaps] = {
"cap_chown",
"cap_dac_override",
"cap_dac_read_search",
"cap_fowner",
"cap_fsetid",
"cap_kill",
"cap_setgid",
"cap_setuid",
"cap_setpcap",
"cap_linux_immutable",
"cap_net_bind_service",
"cap_net_broadcast",
"cap_net_admin",
"cap_net_raw",
"cap_ipc_lock",
"cap_ipc_owner",
"cap_sys_module",
"cap_sys_rawio",
"cap_sys_chroot",
"cap_sys_ptrace",
"cap_sys_pacct",
"cap_sys_admin",
"cap_sys_boot",
"cap_sys_nice",
"cap_sys_resource",
"cap_sys_time",
"cap_sys_tty_config",
"cap_mknod",
"cap_lease",
"cap_audit_write",
"cap_audit_control",
"cap_setfcap"
};
int getbcap(void)
{
int comma=0;
unsigned long i;
int ret;
printf("i know of %d capabilities\n", numcaps);
printf("capability bounding set:");
for (i=0; i<numcaps; i++) {
ret = prctl(PR_CAPBSET_READ, i);
if (ret < 0)
perror("prctl");
else if (ret==1)
printf("%s%s", (comma++) ? ", " : " ", captable[i]);
}
printf("\n");
return 0;
}
int capdrop(char *str)
{
unsigned long i;
int found=0;
for (i=0; i<numcaps; i++) {
if (strcmp(captable[i], str) == 0) {
found=1;
break;
}
}
if (!found)
return 1;
if (prctl(PR_CAPBSET_DROP, i)) {
perror("prctl");
return 1;
}
return 0;
}
int main(int argc, char *argv[])
{
if (argc<2)
return usage(argv[0]);
if (strcmp(argv[1], "get")==0)
return getbcap();
if (strcmp(argv[1], "drop")!=0 || argc<3)
return usage(argv[0]);
if (capdrop(argv[2])) {
printf("unknown capability\n");
return 1;
}
return execl("/bin/bash", "/bin/bash", NULL);
}
************************************************************
[serue@us.ibm.com: fix typo]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>a
Signed-off-by: "Serge E. Hallyn" <serue@us.ibm.com>
Tested-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch supports legacy (32-bit) capability userspace, and where possible
translates 32-bit capabilities to/from userspace and the VFS to 64-bit
kernel space capabilities. If a capability set cannot be compressed into
32-bits for consumption by user space, the system call fails, with -ERANGE.
FWIW libcap-2.00 supports this change (and earlier capability formats)
http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/
[akpm@linux-foundation.org: coding-syle fixes]
[akpm@linux-foundation.org: use get_task_comm()]
[ezk@cs.sunysb.edu: build fix]
[akpm@linux-foundation.org: do not initialise statics to 0 or NULL]
[akpm@linux-foundation.org: unused var]
[serue@us.ibm.com: export __cap_ symbols]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add vm.highmem_is_dirtyable toggle
A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process. On 2.6.16 this process took a few
minutes. With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.
Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.
[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
Signed-off-by: Bron Gondwana <brong@fastmail.fm>
Cc: Ethan Solomita <solo@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(with Martin Schwidefsky <schwidefsky@de.ibm.com>)
The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument. The free functions do not get the mm_struct argument. This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.
[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Add comments explaing how drain_pages() works.
- Eliminate useless functions
- Rename drain_all_local_pages to drain_all_pages(). It does drain
all pages not only those of the local processor.
- Eliminate useless interrupt off / on sequences. drain_pages()
disables interrupts on its own. The execution thread is
pinned to processor by the caller. So there is no need to
disable interrupts.
- Put drain_all_pages() declaration in gfp.h and remove the
declarations from suspend.h and from mm/memory_hotplug.c
- Make software suspend call drain_all_pages(). The draining
of processor local pages is may not the right approach if
software suspend wants to support SMP. If they call drain_all_pages
then we can make drain_pages() static.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the new timerfd API as it is implemented by the following patch:
int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);
The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not
NULL).
The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or
{0, 0} if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are
supported (with the same interface). Here's a simple test program I used to
exercise the new timerfd APIs:
http://www.xmailserver.org/timerfd-test2.c
[akpm@linux-foundation.org: coding-style cleanups]
[akpm@linux-foundation.org: fix ia64 build]
[akpm@linux-foundation.org: fix m68k build]
[akpm@linux-foundation.org: fix mips build]
[akpm@linux-foundation.org: fix alpha, arm, blackfin, cris, m68k, s390, sparc and sparc64 builds]
[heiko.carstens@de.ibm.com: fix s390]
[akpm@linux-foundation.org: fix powerpc build]
[akpm@linux-foundation.org: fix sparc64 more]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Roland pointed out, we have the very old problem with exec. de_thread()
sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
clears signal->flags. All signals (even fatal ones) sent in this window
(which is not too small) will be lost.
With this patch exec doesn't abuse SIGNAL_GROUP_EXIT. signal_group_exit(),
the new helper, should be used to detect exit_group() or exec() in progress.
It can have more users, but this patch does only strictly necessary changes.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robin Holt <holt@sgi.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every time we set SIGNAL_GROUP_EXIT or clear SIGNAL_STOP_DEQUEUED we also
reset ->group_stop_count.
This means that the SIGNAL_GROUP_EXIT check in handle_group_stop() is not
needed, and do_signal_stop() should check SIGNAL_STOP_DEQUEUED only when
->group_stop_count == 0. With these changes handle_group_stop() becomes the
subset of do_signal_stop(), we can kill it and use do_signal_stop() instead.
Also, a preparation for the next patch.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robin Holt <holt@sgi.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When __group_complete_signal() sees sig_kernel_coredump() signal, it starts
the group stop, but sets ->group_exit_task = t in a hope that "t" will
actually dequeue this signal and invoke do_coredump(). However, by the
time "t" enters get_signal_to_deliver() it is possible that the signal was
blocked/ignored or we have another pending !SIG_KERNEL_COREDUMP_MASK signal
which will be dequeued first. This means the task could be stopped but not
killed.
Remove this code from __group_complete_signal(). Note also this patch
removes the bogus signal_wake_up(t, 1). This thread can't be
STOPPED/TRACED, note the corresponding check in wants_signal().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robin Holt <holt@sgi.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ulrich says that we never used this clone flags and that nothing should be
using it.
As we're down to only a single bit left in clone's flags argument, let's add a
warning to check that no userspace is actually using it. Hopefully we will
be able to recycle it.
Roland said:
CLONE_STOPPED was previously used by some NTPL versions when under
thread_db (i.e. only when being actively debugged by gdb), but not for a
long time now, and it never worked reliably when it was used. Removing it
seems fine to me.
[akpm@linux-foundation.org: it looks like CLONE_DETACHED is being used]
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild:
scsi: fix dependency bug in aic7 Makefile
kbuild: add svn revision information to setlocalversion
kbuild: do not warn about __*init/__*exit symbols being exported
Move Kconfig.instrumentation to arch/Kconfig and init/Kconfig
Add HAVE_KPROBES
Add HAVE_OPROFILE
Create arch/Kconfig
Fix ARM to play nicely with generic Instrumentation menu
kconfig: ignore select of unknown symbol
kconfig: mark config as changed when loading an alternate config
kbuild: Spelling/grammar fixes for config DEBUG_SECTION_MISMATCH
Remove __INIT_REFOK and __INITDATA_REFOK
kbuild: print only total number of section mismatces found
Drivers that register a ->fault handler, but do not range-check the
offset argument, must set VM_DONTEXPAND in the vm_flags in order to
prevent an expanding mremap from overflowing the resource.
I've audited the tree and attempted to fix these problems (usually by
adding VM_DONTEXPAND where it is not obvious).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The file exit.c contains one useless extern declaration of sem_exit().
Moreover, it refers to nothing.
This trivial patch removes it.
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Linus:
On the per-architecture side, I do think it would be better to *not* have
internal architecture knowledge in a generic file, and as such a line like
depends on X86_32 || IA64 || PPC || S390 || SPARC64 || X86_64 || AVR32
really shouldn't exist in a file like kernel/Kconfig.instrumentation.
It would be much better to do
depends on ARCH_SUPPORTS_KPROBES
in that generic file, and then architectures that do support it would just
have a
bool ARCH_SUPPORTS_KPROBES
default y
in *their* architecture files. That would seem to be much more logical,
and is readable both for arch maintainers *and* for people who have no
clue - and don't care - about which architecture is supposed to support
which interface...
Changelog:
Actually, I know I gave this as the magic incantation, but now that I see
it, I realize that I should have told you to just use
config KPROBES_SUPPORT
def_bool y
instead, which is a bit denser.
We seem to use both kinds of syntax for these things, but this is really
what "def_bool" is there for...
- Use HAVE_KPROBES
- Use a select
- Yet another update :
Moving to HAVE_* now.
- Update ARM for kprobes support.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Linus:
On the per-architecture side, I do think it would be better to *not* have
internal architecture knowledge in a generic file, and as such a line like
depends on X86_32 || IA64 || PPC || S390 || SPARC64 || X86_64 || AVR32
really shouldn't exist in a file like kernel/Kconfig.instrumentation.
It would be much better to do
depends on ARCH_SUPPORTS_KPROBES
in that generic file, and then architectures that do support it would just
have a
bool ARCH_SUPPORTS_KPROBES
default y
in *their* architecture files. That would seem to be much more logical,
and is readable both for arch maintainers *and* for people who have no
clue - and don't care - about which architecture is supposed to support
which interface...
Changelog:
Actually, I know I gave this as the magic incantation, but now that I see
it, I realize that I should have told you to just use
config ARCH_SUPPORTS_KPROBES
def_bool y
instead, which is a bit denser.
We seem to use both kinds of syntax for these things, but this is really
what "def_bool" is there for...
Changelog :
- Moving to HAVE_*.
- Add AVR32 oprofile.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
The conflicting commit for
move-kconfiginstrumentation-to-arch-kconfig-and-init-kconfig.patch
is the ARM fix from Linus :
commit 38ad9aebe7
He just seemed to agree that my approach (just putting the missing ARM
config options in arch/arm/Kconfig) works too. The main advantage it has
is that it is smaller, does not need a cleanup in the future and does
not break the following patches unnecessarily.
It's just been discussed here
http://lkml.org/lkml/2008/1/15/267
However, Linus might prefer to stay with his own patch and I would
totally understand it that late in the release cycle. Therefore I submit
this for the next release cycle.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
CC: Russell King <rmk+lkml@arm.linux.org.uk>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
kernel/ksysfs.c seems to be a random dumping group for misc globals
that the rest of the tree depend on. This has caused problems with
exports in the past when sysfs is disabled, which can already be
observed in commit-id 51107301b6.
The latest one is the kernel_kobj usage, which presently results in:
fs/built-in.o: In function `debugfs_init':
inode.c:(.init.text+0xc34): undefined reference to `kernel_kobj'
make: *** [.tmp_vmlinux1] Error 1
kernel/ksysfs.c itself at this point only contains globals and some
basic sysfs initialization, the sysfs initialization code is optimized
out when we build with sysfs disabled. Given that, it's easier to just
build in unconditionally, rather than trying to find some other random
place to dump and initialize the globals.
Additionally, the current trend seems to be decoupling of kobjects from
sysfs, in which case it still makes sense to perform the kernel_kobj
initialization that happens here even if sysfs is disabled, as
lib/kobject.o is built-in unconditionally.
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'suspend' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (38 commits)
suspend: cleanup reference to swsusp_pg_dir[]
PM: Remove obsolete /sys/devices/.../power/state docs
Hibernation: Invoke suspend notifications after console switch
Suspend: Invoke suspend notifications after console switch
Suspend: Clean up suspend_64.c
Suspend: Add config option to disable the freezer if architecture wants that
ACPI: Print message before calling _PTS
ACPI hibernation: Call _PTS before suspending devices
Hibernation: Introduce begin() and end() callbacks
ACPI suspend: Call _PTS before suspending devices
ACPI: Separate disabling of GPEs from _PTS
ACPI: Separate invocations of _GTS and _BFS from _PTS and _WAK
Suspend: Introduce begin() and end() callbacks
suspend: fix ia64 allmodconfig build
ACPI: clear GPE earily in resume to avoid warning
Suspend: Clean up Kconfig (V2)
Hibernation: Clean up Kconfig (V2)
Hibernation: Update messages
Suspend: Use common prefix in messages
Hibernation: Remove unnecessary variable declaration
...
Rafael J. Wysocki reported weird, multi-seconds delays during
suspend/resume and bisected it back to:
commit 82a1fcb902
Author: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 25 21:08:02 2008 +0100
softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks
fix it:
- restore the old wakeup mechanism
- fix break usage in do_each_thread() { } while_each_thread().
- fix the hotplug switch stmt, a fall-through case was broken.
Bisected-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following the recent change in the suspend code path, switch consoles before
calling PM notifiers during hibernation.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
In order to fix APM emulation it is necessary to enable apm-emulation
notifications for suspends triggered in various ways via the suspend
notifiers. However, this will cause the systems using APM emulation
to lock up between X being needed to switch away from the VT and X
already waiting for resume in the APM ioctl.
This patch moves the console switch (if enabled) before the suspend
notification (and after the resume notification) to avoid this issue.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
This patch makes the freezer optional for suspend to allow the
system to work (or not work) like the original PMU suspend.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
Introduce global hibernation callback .end() and rename global
hibernation callback .start() to .begin(), in analogy with the
recent modifications of the global suspend callbacks.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
On ACPI systems the target state set by acpi_pm_set_target() is
reset by acpi_pm_finish(), but that need not be called if the
suspend fails. All platforms that use the .set_target() global
suspend callback are affected by analogous issues.
For this reason, we need an additional global suspend callback that
will reset the target state regardless of whether or not the suspend
is successful. Also, it is reasonable to rename the .set_target()
callback, since it will be used for a different purpose on ACPI
systems (due to ACPI 1.0x code ordering requirements).
Introduce the global suspend callback .end() to be executed at the
end of the suspend sequence and rename the .set_target() global
suspend callback to .begin().
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
kernel/power/main.c:488: error: ‘pm_test_attr’ undeclared here (not in a function)
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
This cleans up the suspend Kconfig and removes the need to
declare centrally which architectures support suspend. All
architectures that currently support suspend are modified
accordingly.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
This cleans up the hibernation Kconfig and removes the need to
declare centrally which architectures support hibernation. All
architectures that currently support hibernation are modified
accordingly.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
Make hibernation messages start with one common prefix "PM: " and use
the word "hibernation" in the messages as a synonym of "suspend to
disk".
Turn some KERN_INFO messages into debug ones.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Make suspend messages start with one common prefix "PM: ".
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Remove the unnecessary extern declaration of resume_file[]
from kernel/power/swap.c .
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Fix a comment in kernel/power/disk.c so that it doesn't contain lines
longer that 80 characters.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Fix a comment in kernel/power/main.c so that it doesn't contain lines
longer that 80 characters.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Move the low level restore code to kernel/power/disk.c , since the
corresponding low level hibernation code is already there.
Make restore fail if device_power_down(PMSG_PRETHAW) returns an
error.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Suspend: Make debug facility depend on CONFIG_SUSPEND
Make the new suspend debug facility code depend on CONFIG_SUSPEND,
as appropriate, to remove the compiler warning printed when CONFIG_PM is set
and CONFIG_SUSPEND is not set.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
This patch (as1008b) converts the PM notifier routines from inline
calls to out-of-line code. It also prevents pm_chain_head from
being created when CONFIG_PM_SLEEP isn't enabled, and EXPORTs the
notifier registration and unregistration routines.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
When trying to debug a suspend failure I started implementing
PM_TRACE for powerpc. I then noticed that I'm debugging a suspend
failure and so PM_TRACE isn't useful at all, but thought that
nonetheless this could be useful in the future.
Basically, to support PM_TRACE, you add a Kconfig option that
selects PM_TRACE and provides the infrastructure as per the
help text of PM_TRACE.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
Make it possible to test the hibernation core code with the help of the
/sys/power/pm_test attribute introduced for suspend testing in the previous
patch.
Writing an appropriate string to this file causes the hibernation code to work
in one of the test modes defined as follows:
freezer
- test the freezing of processes
devices
- test the freezing of processes and suspending of devices
platform
- test the freezing of processes, suspending of devices and platform global
control methods(*)
processors
- test the freezing of processes, suspending of devices, platform global
control methods(*) and the disabling of nonboot CPUs
core
- test the freezing of processes, suspending of devices, platform global
control methods(*), the disabling of nonboot CPUs and suspending of
platform/system devices
(*) - the platform global control methods are only available on ACPI systems
and are only tested if the hibernation mode is set to "platform"
Then, if a hibernation is started by normal means, the hibernation core will
perform its normal operations up to the point indicated by given test level.
Next, it will wait for 5 seconds and carry out the resume operations needed to
transition the system back to the fully functional state.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Introduce sysfs attribute /sys/power/pm_test allowing one to test the suspend
core code. Namely, writing one of the strings:
freezer
devices
platform
processors
core
to this file causes the suspend code to work in one of the test modes defined as
follows:
freezer
- test the freezing of processes
devices
- test the freezing of processes and suspending of devices
platform
- test the freezing of processes, suspending of devices and platform global
control methods(*)
processors
- test the freezing of processes, suspending of devices, platform global
control methods and the disabling of nonboot CPUs
core
- test the freezing of processes, suspending of devices, platform global
control methods, the disabling of nonboot CPUs and suspending of
platform/system devices
(*) These are ACPI global control methods on ACPI systems
Then, if a suspend is started by normal means, the suspend core will perform
its normal operations up to the point indicated by given test level. Next, it
will wait for 5 seconds and carry out the resume operations needed to transition
the system back to the fully functional state.
Writing "none" to /sys/power/pm_test turns the testing off.
When open for reading, /sys/power/pm_test contains a space-separated list of all
available tests (including "none" that represents the normal functionality) in
which the current test level is indicated by square brackets.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Add PM_RESTORE_PREPARE and PM_POST_RESTORE notifiers to the PM core, to be used
in analogy with the existing PM_HIBERNATION_PREPARE and PM_POST_HIBERNATION
notifiers.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
This patch moves the prototypes of count_highmem_pages() and
restore_highmem() to kernel/power/power.h
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
Move the definitions of hibernation ioctls to a separate header file in
include/linux, which can be exported to the user space.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Three ioctl numbers belonging to the hibernation userland interface,
SNAPSHOT_ATOMIC_SNAPSHOT, SNAPSHOT_AVAIL_SWAP, SNAPSHOT_GET_SWAP_PAGE,
are defined in a wrong way (eg. not portable). Provide new ioctl numbers for
these ioctls and mark the existing ones as deprecated.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Mark the SNAPSHOT_SET_SWAP_FILE ioctl belonging to the hibernation userland
interface as deprecated.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Modify the hibernation userland interface by adding two new ioctls to it,
SNAPSHOT_PLATFORM_SUPPORT and SNAPSHOT_POWER_OFF, that can be used,
respectively, to switch the hibernation platform support on/off and to make the
kernel transition the system to the hibernation state (eg. ACPI S4) using the
platform (eg. ACPI) driver.
These ioctls are intended to replace the misdesigned SNAPSHOT_PMOPS ioctl,
which from now is regarded as obsolete and will be removed in the future.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Add a new ioctl, SNAPSHOT_GET_IMAGE_SIZE, returning the size of the (just
created) hibernation image, to the hibernation userland interface.
This ioctl is necessary so that the userland utilities using the interface need
not access the hibernation image header, owned by the kernel, in order to obtain
the size of the image.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'audit.b46' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
[AUDIT] Add uid, gid fields to ANOM_PROMISCUOUS message
[AUDIT] ratelimit printk messages audit
[patch 2/2] audit: complement va_copy with va_end()
[patch 1/2] kernel/audit.c: warning fix
[AUDIT] create context if auditing was ever enabled
[AUDIT] clean up audit_receive_msg()
[AUDIT] make audit=0 really stop audit messages
[AUDIT] break large execve argument logging into smaller messages
[AUDIT] include audit type in audit message when using printk
[AUDIT] do not panic on exclude messages in audit_log_pid_context()
[AUDIT] Add End of Event record
[AUDIT] add session id to audit messages
[AUDIT] collect uid, loginuid, and comm in OBJ_PID records
[AUDIT] return EINTR not ERESTART*
[PATCH] get rid of loginuid races
[PATCH] switch audit_get_loginuid() to task_struct *
some printk messages from the audit system can become excessive. This
patch ratelimits those messages. It was found that messages, such as
the audit backlog lost printk message could flood the logs to the point
that a machine could take an nmi watchdog hit or otherwise become
unresponsive.
Signed-off-by: Eric Paris <eparis@redhat.com>
Complement va_copy() with va_end().
Signed-off-by: Richard Knutsson <ricknu-0@student.ltu.se>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kernel/audit.c: In function 'audit_log_start':
kernel/audit.c:1133: warning: 'serial' may be used uninitialized in this function
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Disabling audit at runtime by auditctl doesn't mean that we can
stop allocating contexts for new processes; we don't want to miss them
when that sucker is reenabled.
(based on work from Al Viro in the RHEL kernel series)
Signed-off-by: Eric Paris <eparis@redhat.com>
generally clean up audit_receive_msg() don't free random memory if
selinux_sid_to_string fails for some reason. Move generic auditing
to a helper function
Signed-off-by: Eric Paris <eparis@redhat.com>
Some audit messages (namely configuration changes) are still emitted even if
the audit subsystem has been explicitly disabled. This patch turns those
messages off as well.
Signed-off-by: Eric Paris <eparis@redhat.com>
execve arguments can be quite large. There is no limit on the number of
arguments and a 4G limit on the size of an argument.
this patch prints those aruguments in bite sized pieces. a userspace size
limitation of 8k was discovered so this keeps messages around 7.5k
single arguments larger than 7.5k in length are split into multiple records
and can be identified as aX[Y]=
Signed-off-by: Eric Paris <eparis@redhat.com>
Currently audit drops the audit type when an audit message goes through
printk instead of the audit deamon. This is a minor annoyance in
that the audit type is no longer part of the message and the information
the audit type conveys needs to be carried in, or derived from the
message data.
The attached patch includes the type number as part of the printk.
Admittedly it isn't the type name that the audit deamon provides but I
think this is better than dropping the type completely.
Signed-pff-by: John Johansen <jjohansen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
If we fail to get an ab in audit_log_pid_context this may be due to an exclude
rule rather than a memory allocation failure. If it was due to a memory
allocation failue we would have already paniced and no need to do it again.
Signed-off-by: Eric Paris <eparis@redhat.com>
This patch adds an end of event record type. It will be sent by the kernel as
the last record when a multi-record event is triggered. This will aid realtime
analysis programs since they will now reliably know they have the last record
to complete an event. The audit daemon filters this and will not write it to
disk.
Signed-off-by: Steve Grubb <sgrubb redhat com>
Signed-off-by: Eric Paris <eparis@redhat.com>
In order to correlate audit records to an individual login add a session
id. This is incremented every time a user logs in and is included in
almost all messages which currently output the auid. The field is
labeled ses= or oses=
Signed-off-by: Eric Paris <eparis@redhat.com>
Add uid, loginuid, and comm collection to OBJ_PID records. This just
gives users a little more information about the task that received a
signal. pid is rather meaningless after the fact, and even though comm
isn't great we can't collect exe reasonably on this code path for
performance reasons.
Signed-off-by: Eric Paris <eparis@redhat.com>
The syscall exit code will change ERESTART* kernel internal return codes
to EINTR if it does not restart the syscall. Since we collect the audit
info before that point we should fix those in the audit log as well.
Signed-off-by: Eric Paris <eparis@redhat.com>
Keeping loginuid in audit_context is racy and results in messier
code. Taken to task_struct, out of the way of ->audit_context
changes.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
To allow the implementation of optimized rw-locks in user space, glibc
needs a possibility to select waiters for wakeup depending on a bitset
mask.
This requires two new futex OPs: FUTEX_WAIT_BITS and FUTEX_WAKE_BITS
These OPs are basically the same as FUTEX_WAIT and FUTEX_WAKE plus an
additional argument - a bitset. Further the FUTEX_WAIT_BITS OP is
expecting an absolute timeout value instead of the relative one, which
is used for the FUTEX_WAIT OP.
FUTEX_WAIT_BITS calls into the kernel with a bitset. The bitset is
stored in the futex_q structure, which is used to enqueue the waiter
into the hashed futex waitqueue.
FUTEX_WAKE_BITS also calls into the kernel with a bitset. The wakeup
function logically ANDs the bitset with the bitset stored in each
waiters futex_q structure. If the result is zero (i.e. none of the set
bits in the bitsets is matching), then the waiter is not woken up. If
the result is not zero (i.e. one of the set bits in the bitsets is
matching), then the waiter is woken.
The bitset provided by the caller must be non zero. In case the
provided bitset is zero the kernel returns EINVAL.
Internaly the new OPs are only extensions to the existing FUTEX_WAIT
and FUTEX_WAKE functions. The existing OPs hand a bitset with all bits
set into the futex_wait() and futex_wake() functions.
Signed-off-by: Thomas Gleixner <tgxl@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The WARN_ON() in the fixup return path of futex_lock_pi() can
trigger with false positives.
The following scenario happens:
t1 holds the futex and t2 and t3 are blocked on the kernel side rt_mutex.
t1 releases the futex (and the rt_mutex) and assigned t2 to be the next
owner of the futex.
t2 is interrupted and returns w/o acquiring the rt_mutex, before t1 can
release the rtmutex.
t1 releases the rtmutex and t3 becomes the pending owner of the rtmutex.
t2 notices that it is the designated owner (user space variable) and
fails to acquire the rt_mutex via trylock, because it is not allowed to
steal the rt_mutex from t3. Now it looks at the rt_mutex pending owner (t3)
and assigns the futex and the pi_state to it.
During the fixup t4 steals the rtmutex from t3.
t2 returns from the fixup and the owner of the rt_mutex has changed from
t3 to t4.
There is no need to do another round of fixups from t2. The important
part (t2 is not returning as the user space visible owner) is
done. The further fixups are done, before either t3 or t4 return to
user space.
For the user space it is not relevant which task (t3 or t4) is the real
owner, as long as those are both in the kernel, which is guaranteed by
the serialization of the hash bucket lock. Both tasks (which ever returns
first to userspace - t4 because it locked the rt_mutex or t3 due to a signal)
are going through the lock_futex_pi() return path where the ownership is
fixed before the return to user space.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To allow better diagnosis of tick-sched related, especially NOHZ
related problems, we need to know when the last wakeup via an irq
happened and when the CPU left the idle state.
Add two fields (idle_waketime, idle_exittime) to the tick_sched
structure and add them to the timer_list output.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
xtime_cache needs to be updated whenever xtime and or wall_to_monotic
are changed. Otherwise users of xtime_cache might see a stale (and in
the case of timezone changes utterly wrong) value until the next
update happens.
Fixup the obvious places, which miss this update.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <johnstul@us.ibm.com>
Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
this patch:
commit 37bb6cb409
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri Jan 25 21:08:32 2008 +0100
hrtimer: unlock hrtimer_wakeup
Broke hrtimer_init_sleeper() users. It forgot to fix up the futex
caller of this function to detect the failed queueing and messed up
the do_nanosleep() caller in that it could leak a TASK_INTERRUPTIBLE
state.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The recent UDP patch exposed this bug in the audit code. It
was calling pskb_expand_head without increasing skb->truesize.
The caller of pskb_expand_head needs to do so because that function
is designed to be called in places where truesize is already fixed
and therefore it doesn't update its value.
Because the audit system is using it in a place where the truesize
has not yet been fixed, it needs to update its value manually.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
It may be used by the modules nfs.ko and sunrpc.ko
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[ Made it a regular export rather than GPL-only - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits)
Remove commented-out code copied from NFS
NFS: Switch from intr mount option to TASK_KILLABLE
Add wait_for_completion_killable
Add wait_event_killable
Add schedule_timeout_killable
Use mutex_lock_killable in vfs_readdir
Add mutex_lock_killable
Use lock_page_killable
Add lock_page_killable
Add fatal_signal_pending
Add TASK_WAKEKILL
exit: Use task_is_*
signal: Use task_is_*
sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL
ptrace: Use task_is_*
power: Use task_is_*
wait: Use TASK_NORMAL
proc/base.c: Use task_is_*
proc/array.c: Use TASK_REPORT
perfmon: Use task_is_*
...
Fixed up conflicts in NFS/sunrpc manually..
i was debugging early crashes and wondered where all the printks
went. The reason: ignore_loglevel_setup() was not called yet ...
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This removes the extra struct task_struct *p parameter in inc_nr_running
and dec_nr_running functions.
Signed-off by: Jerry Stralko <gerb.stralko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Michel Dänzr has bisected an interactivity problem with
plus-reniced tasks back to this commit:
810e95ccd5 is first bad commit
commit 810e95ccd5
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Oct 15 17:00:14 2007 +0200
sched: another wakeup_granularity fix
unit mis-match: wakeup_gran was used against a vruntime
fix this by assymetrically scaling the vtime of positive reniced
tasks.
Bisected-by: Michel Dänzer <michel@tungstengraphics.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The reason why we are getting better wakeup latencies for
!FAIR_USER_SCHED is because of this snippet of code in place_entity():
if (!initial) {
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
^^^^^^^^^^^^^^^^^^
vruntime -= sysctl_sched_latency;
/* ensure we never gain time by being placed backwards. */
vruntime = max_vruntime(se->vruntime, vruntime);
}
NEW_FAIR_SLEEPERS feature gives credit for sleeping only to tasks and
not group-level entities. With the patch attached, I could see that
wakeup latencies with FAIR_USER_SCHED are restored to the same level as
!FAIR_USER_SCHED.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
these bugs are harder to find than they seem, a stackdump helps.
make it dependent on CONFIG_DEBUG_SHIRQ so that people can turn it off
if it annoys them.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
During the work on the x86 32 and 64 bit backtrace code I found it useful
to have a simple test module to test a process and irq context backtrace.
Since the existing backtrace code was buggy, I figure it might be useful
to have such a test module in the kernel so that maybe we can even
detect such bugs earlier..
[ mingo@elte.hu: build fix ]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Here is a quick and naive smoke test for kprobes. This is intended to
just verify if some unrelated change broke the *probes subsystem. It is
self contained, architecture agnostic and isn't of any great use by itself.
This needs to be built in the kernel and runs a basic set of tests to
verify if kprobes, jprobes and kretprobes run fine on the kernel. In case
of an error, it'll print out a message with a "BUG" prefix.
This is a start; we intend to add more tests to this bucket over time.
Thanks to Jim Keniston and Masami Hiramatsu for comments and suggestions.
Tested on x86 (32/64) and powerpc.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unlike oopses, WARN_ON() currently does't print the loaded modules list.
This makes it harder to take action on certain bug reports. For example,
recently there were a set of WARN_ON()s reported in the mac80211 stack,
which were just signalling a driver bug. It takes then anther round trip
to the bug reporter (if he responds at all) to find out which driver
is at fault.
Another issue is that, unlike oopses, WARN_ON() doesn't currently printk
the helpful "cut here" line, nor the "end of trace" marker.
Now that WARN_ON() is out of line, the size increase due to this is
minimal and it's worth adding.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
A quick grep shows that there are currently 1145 instances of WARN_ON
in the kernel. Currently, WARN_ON is pretty much entirely inlined,
which makes it hard to enhance it without growing the size of the kernel
(and getting Andrew unhappy).
This patch build on top of Olof's patch that introduces __WARN,
and places the slowpath out of line. It also uses Ingo's suggestion
to not use __FUNCTION__ but to use kallsyms to do the lookup;
this saves a ton of extra space since gcc doesn't need to store the function
string twice now:
3936367 833603 624736 5394706 525112 vmlinux.before
3917508 833603 624736 5375847 520767 vmlinux-slowpath
15Kb savings...
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Olof Johansson <olof@lixom.net>
Acked-by: Matt Meckall <mpm@selenic.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is useful to debug problems with interrupt handlers that return
sometimes IRQ_NONE.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This allows to change them at runtime using sysfs. No need to
reboot to set them.
I only added aliases (kernel.noirqdebug etc.) so the old options
still work.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This adds a generic definition of compat_sys_ptrace that calls
compat_arch_ptrace, parallel to sys_ptrace/arch_ptrace. Some
machines needing this already define a function by that name.
The new generic function is defined only on machines that
put #define __ARCH_WANT_COMPAT_SYS_PTRACE into asm/ptrace.h.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This adds a compat_ptrace_request that is the analogue of ptrace_request
for the things that 32-on-64 ptrace implementations can share in common.
So far there are just a couple of requests handled generically.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This makes ptrace_request handle {PEEK,POKE}{TEXT,DATA} directly.
Every arch_ptrace that could call generic_ptrace_peekdata already
has a default case calling ptrace_request, so this keeps things
simpler for the arch code.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The break_lock data structure and code for spinlocks is quite nasty.
Not only does it double the size of a spinlock but it changes locking to
a potentially less optimal trylock.
Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
__raw_spin_is_contended that uses the lock data itself to determine whether
there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
not set.
Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
decouple it from the spinlock implementation, and make it typesafe (rwlocks
do not have any need_lockbreak sites -- why do they even get bloated up
with that break_lock then?).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We have a lot of code which differs only by the naming of specific
members of structures that contain registers. In order to enable
additional unifications, this patch drops the e- or r- size prefix
from the register names in struct pt_regs, and drops the x- prefixes
for segment registers on the 32-bit side.
This patch also performs the equivalent renames in some additional
places that might be candidates for unification in the future.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This makes ptrace_request handle PTRACE_SINGLEBLOCK along with
PTRACE_CONT et al. The new generic code makes use of the
arch_has_block_step macro and generic entry points on machines
that define them.
[ mingo@elte.hu: bugfix ]
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This makes ptrace_request handle all the ptrace requests that wake
up the traced task. These do low-level ptrace implementation magic
that is not arch-specific and should be kept out of arch code. The
implementations on each arch usually do the same thing. The new
generic code makes use of the arch_has_single_step macro and generic
entry points to handle PTRACE_SINGLESTEP.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
various changes to the in_p/out_p delay details:
- add the io_delay=none method
- make each method selectable from the kernel config
- simplify the delay code a bit by getting rid of an indirect function call
- add the /proc/sys/kernel/io_delay_type sysctl
- change 'io_delay=standard|alternate' to io_delay=0x80 and io_delay=0xed
- make the io delay config not depend on CONFIG_DEBUG_KERNEL
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: "David P. Reed" <dpreed@reed.com>
Current idle time in kstat is based on jiffies and is coarse grained.
tick_sched.idle_sleeptime is making some attempt to keep track of idle time
in a fine grained manner. But, it is not handling the time spent in
interrupts fully.
Make tick_sched.idle_sleeptime accurate with respect to time spent on
handling interrupts and also add tick_sched.idle_lastupdate, which keeps
track of last time when idle_sleeptime was updated.
This statistics will be crucial for cpufreq-ondemand governor, which can
shed some conservative gaurd band that is uses today while setting the
frequency. The ondemand changes that uses the exact idle time is coming
soon.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I recently noticed on one of my boxes that when synched with an NTP
server, the drift value reported for the system was ~283ppm. While in
some cases, clock hardware can be that bad, it struck me as unusual as
the system was using the acpi_pm clocksource, which is one of the more
trustworthy and accurate clocksources on x86 hardware.
I brought up another system and let it sync to the same NTP server, and
I noticed a similar 280some ppm drift.
In looking at the code, I found that the acpi_pm's constant frequency
was being computed correctly at boot-up, however once the system was up,
even without the ntp daemon running, the clocksource's frequency was
being modified by the clocksource_adjust() function.
Digging deeper, I realized that in the code that keeps track of how much
the clocksource is skewing from the ntp desired time, we were using
different lengths to establish how long an time interval was.
The clocksource was being setup with the following interval:
NTP_INTERVAL_LENGTH = NSEC_PER_SEC/NTP_INTERVAL_FREQ
While the ntp code was using the tick_length_base value:
tick_length_base ~= (tick_usec * NSEC_PER_USEC * USER_HZ)
/NTP_INTERVAL_FREQ
The subtle difference is:
(tick_usec * NSEC_PER_USEC * USER_HZ) != NSEC_PER_SEC
This difference in calculation was causing the clocksource correction
code to apply a correction factor to the clocksource so the two
intervals were the same, however this results in the actual frequency of
the clocksource to be made incorrect. I believe this difference would
affect all clocksources, although to differing degrees depending on the
clocksource resolution.
The issue was introduced when my HZ free ntp patch landed in 2.6.21-rc1,
so my apologies for the mistake, and for not noticing it until now.
The following patch, corrects the clocksource's initialization code so
it uses the same interval length as the code in ntp.c. After applying
this patch, the drift value for the same system went from ~283ppm to
only 2.635ppm.
I believe this patch to be good, however it does affect all arches and
I've only tested on x86, so some caution is advised. I do think it would
be a likely candidate for a stable 2.6.24.x release.
Any thoughts or feedback would be appreciated.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
detect zero event-device multiplicators - they then cause
division-by-zero crashes if a clockevent has been initialized
incorrectly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On x86 the PIT might become an unusable clocksource. Add an unregister
function to provide a possibilty to remove the PIT from the list of
available clock sources.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This way it checks if the clocks are synchronized between CPUs too.
This might be able to detect slowly drifting TSCs which only
go wrong over longer time.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
clocksource_watchdog can use a deferrable timer - reduces wakeups from
idle per second.
Signed-off-by: Parag Warudkar <parag.warudkar@gmail.com>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
- getnstimeofday() was just a wrapper around __get_realtime_clock_ts()
- Replace calls to __get_realtime_clock_ts() by calls to getnstimeofday()
- Fix bogus reference to get_realtime_clock_ts(), which never existed
Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I was confused by FSEC = 10^15 NSEC statement, plus small whitespace
fixes. When there's copyright, there should be GPL.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Small cleanups to tick-related code. Wrong preempt count is followed
by BUG(), so it is hardly KERN_WARNING.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Clean up hungarian notation from timer code.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.25: (1470 commits)
[IPV6] ADDRLABEL: Fix double free on label deletion.
[PPP]: Sparse warning fixes.
[IPV4] fib_trie: remove unneeded NULL check
[IPV4] fib_trie: More whitespace cleanup.
[NET_SCHED]: Use nla_policy for attribute validation in ematches
[NET_SCHED]: Use nla_policy for attribute validation in actions
[NET_SCHED]: Use nla_policy for attribute validation in classifiers
[NET_SCHED]: Use nla_policy for attribute validation in packet schedulers
[NET_SCHED]: sch_api: introduce constant for rate table size
[NET_SCHED]: Use typeful attribute parsing helpers
[NET_SCHED]: Use typeful attribute construction helpers
[NET_SCHED]: Use NLA_PUT_STRING for string dumping
[NET_SCHED]: Use nla_nest_start/nla_nest_end
[NET_SCHED]: Propagate nla_parse return value
[NET_SCHED]: act_api: use PTR_ERR in tcf_action_init/tcf_action_get
[NET_SCHED]: act_api: use nlmsg_parse
[NET_SCHED]: act_api: fix netlink API conversion bug
[NET_SCHED]: sch_netem: use nla_parse_nested_compat
[NET_SCHED]: sch_atm: fix format string warning
[NETNS]: Add namespace for ICMP replying code.
...
When trying to load a module with the same name as a built-in one, a
scary kobject backtrace comes up. Prevent that from checking for this
condition and warning the user as to what exactly is going on.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The struct module taints member is supposed to store per-module taint
data. The kernel knows about certain specific external modules that will
taint the kernel, such as ndiswrapper. Use of ndiswrapper possibly
should set the per-module taint in addition to the global kernel
taint flag, unless we're arguing not because wrapper module itself
is not what actually causes the kernel to be tainted as such?
Signed-off-by: Jon Masters <jcm@jonmasters.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
the original code use KOBJ_NAME_LEN for built-in module name length,
that's defined to 20 in linux/kobject.h, but this is not enough appearntly,
many module names are longer than this;
#define KOBJ_NAME_LEN 20
another macro is MODULE_NAME_LEN defined in linux/module.h, I think this is
enough for module names:
#define MODULE_NAME_LEN (64 - sizeof(unsigned long))
Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
module_address_lookup releases preemption then returns a pointer into
the module space. The only user (kallsyms) copies the result, so just
do that under the preempt disable.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we put the module in the linked list *before* calling into to, we
get the module name and functions in the OOPS (is_module_address can
find the module). It also helps lockdep in a similar way.
Acked-and-tested-by: Joern Engel <joern@lazybastard.org>
Tested-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
There have been reports of modules failing to load because the modules
they depend on are still loading. This changes the modules to wait
for a reasonable length of time in that case. We time out eventually,
because there can be module loops or broken modules.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Current code could cause a bug in symbol_put_addr() if an arch used
kmalloc module text: we might think the symbol belongs to the core
kernel.
The downside is that this might make backtraces through (discarded)
init functions harder to read on some archs, but we already have that
issue for modules and noone has complained.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I have removed all the entries from this table (core_table,
ipv4_table and tr_table), so now we can safely drop it.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the basic infrastructure for per namespace sysctls.
A list of lists of sysctl headers is added, allowing each namespace to have
it's own list of sysctl headers.
Each list of sysctl headers has a lookup function to find the first
sysctl header in the list, allowing the lists to have a per namespace
instance.
register_sysct_root is added to tell sysctl.c about additional
lists of sysctl_headers. As all of the users are expected to be in
kernel no unregister function is provided.
sysctl_head_next is updated to walk through the list of lists.
__register_sysctl_paths is added to add a new sysctl table on
a non-default sysctl list.
The only intrusive part of this patch is propagating the information
to decided which list of sysctls to use for sysctl_check_table.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Daniel Lezcano <dlezcano@fr.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
By doing this we allow users of register_sysctl_paths that build
and dynamically allocate their ctl_table to be simpler. This allows
them to just remember the ctl_table_header returned from
register_sysctl_paths from which they can now find the
ctl_table array they need to free.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Daniel Lezcano <dlezcano@fr.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a number of modules that register a sysctl table
somewhere deeply nested in the sysctl hierarchy, such as
fs/nfs, fs/xfs, dev/cdrom, etc.
They all specify several dummy ctl_tables for the path name.
This patch implements register_sysctl_path that takes
an additional path name, and makes up dummy sysctl nodes
for each component.
This patch was originally written by Olaf Kirch and
brought to my attention and reworked some by Olaf Hering.
I have changed a few additional things so the bugs are mine.
After converting all of the easy callers Olaf Hering observed
allyesconfig ARCH=i386, the patch reduces the final binary size by 9369 bytes.
.text +897
.data -7008
text data bss dec hex filename
26959310 4045899 4718592 35723801 2211a19 ../vmlinux-vanilla
26960207 4038891 4718592 35717690 221023a ../O-allyesconfig/vmlinux
So this change is both a space savings and a code simplification.
CC: Olaf Kirch <okir@suse.de>
CC: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Daniel Lezcano <dlezcano@fr.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
syslets (or other threads/processes that want io context sharing) can
set this to enforce sharing of io context.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
revert 19ef930927.
Kevin Winchester reported a lockup during X startup an bisected
it to this commit.
Reported-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Forgot to remove this when removing the appldata binary sysctls.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (200 commits)
[SCSI] usbstorage: use last_sector_bug flag universally
[SCSI] libsas: abstract STP task status into a function
[SCSI] ultrastor: clean up inline asm warnings
[SCSI] aic7xxx: fix firmware build
[SCSI] aacraid: fib context lock for management ioctls
[SCSI] ch: remove forward declarations
[SCSI] ch: fix device minor number management bug
[SCSI] ch: handle class_device_create failure properly
[SCSI] NCR5380: fix section mismatch
[SCSI] sg: fix /proc/scsi/sg/devices when no SCSI devices
[SCSI] IB/iSER: add logical unit reset support
[SCSI] don't use __GFP_DMA for sense buffers if not required
[SCSI] use dynamically allocated sense buffer
[SCSI] scsi.h: add macro for enclosure bit of inquiry data
[SCSI] sd: add fix for devices with last sector access problems
[SCSI] fix pcmcia compile problem
[SCSI] aacraid: add Voodoo Lite class of cards.
[SCSI] aacraid: add new driver features flags
[SCSI] qla2xxx: Update version number to 8.02.00-k7.
[SCSI] qla2xxx: Issue correct MBC_INITIALIZE_FIRMWARE command.
...
Right now, the linux kernel (with scheduler statistics enabled) keeps track
of the maximum time a process is waiting to be scheduled. While the maximum
is a very useful metric, tracking average and total is equally useful
(at least for latencytop) to figure out the accumulated effect of scheduler
delays. The accumulated effect is important to judge the performance impact
of scheduler tuning/behavior.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
print_cfs_stats is callable from interrupt context (sysrq), hence it should
not take mutexes. Change it to use RCU since the task group data is RCU
freed anyway.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The attached patch is something really simple that can sometimes help
in getting more info out of a hung system.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
printk timestamps: use ktime_get().
Some platforms have a functioning clocksource function only after
they are done with early bootup, so delay this until out of
SYSTEM_BOOTING state.
it's also inherently safe now, as any bugs in this area will be
caught by the printk recursion checks.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LatencyTOP kernel infrastructure; it measures latencies in the
scheduler and tracks it system wide and per process.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
looking at it one more time:
(1) it looks to me that there is no need to call
sched_rt_ratio_exceeded() from pick_next_rt_entity()
- [ for CONFIG_FAIR_GROUP_SCHED ] queues with rt_rq->rt_throttled are
not within this 'tree-like hierarchy' (or whatever we should call it
:-)
- there is also no need to re-check 'rt_rq->rt_time > ratio' at this
point as 'rt_rq->rt_time' couldn't have been increased since the last
call to update_curr_rt() (which obviously calls
sched_rt_ratio_esceeded())
well, it might be that 'ratio' for this rt_rq has been re-configured
(and the period over which this rt_rq was active has not yet been
finished)... but I don't think we should really take this into
account.
(2) now pick_next_rt_entity() must never return NULL, so let's change
pick_next_task_rt() accordingly.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
sched: fix rq->clock warps on frequency changes
Fix 2bacec8c31
(sched: touch softlockup watchdog after idling) that reintroduced warps
on frequency changes. touch_softlockup_watchdog() calls __update_rq_clock
that checks rq->clock for warps, so call it after adjusting rq->clock.
Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ensure that the kernel threads are created with the usual nice level
and affinity even if kthreadd's properties were changed from the
default by root.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Based on a suggestion from Andi:
In various cases, the unload of a module may leave some bad state around
that causes a kernel crash AFTER a module is unloaded; and it's then hard
to find which module caused that.
This patch tracks the last unloaded module, and prints this as part of the
module list in the oops trace.
Right now, only the last 1 module is tracked; I expect that this is enough
for the vast majority of cases where this information matters; if it turns
out that tracking more is important, we can always extend it to that.
[ mingo@elte.hu: build fix ]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's rather common that an oops/WARN_ON/BUG happens during the load or
unload of a module. Unfortunatly, it's not always easy to see directly
which module is being loaded/unloaded from the oops itself. Worse,
it's not even always possible to ask the bug reporter, since there
are so many components (udev etc) that auto-load modules that there's
a good chance that even the reporter doesn't know which module this is.
This patch extends the existing "show if it's tainting" print code,
which is used as part of printing the modules in the oops/BUG/WARN_ON
to include a "+" for "being loaded" and a "-" for "being unloaded".
As a result this extension, the "taint_flags()" function gets renamed to
"module_flags()" (and takes a module struct as argument, not a taint
flags int).
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the curious logic to set it_sched_expires in the future. It useless
because rt.timeout wouldn't be incremented anyway.
Explicity check for RLIM_INFINITY as a test programm that had a 1s soft limit
and a inf hard limit would SIGKILL at 1s. This is because RLIM_INFINITY+d-1
is d-2.
Signed-off-by: Peter Zijlsta <a.p.zijlstra@chello.nl>
CC: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Only reschedule if the new group has a higher prio task.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
hrtimer_wakeup creates a
base->lock
rq->lock
lock dependancy. Avoid this by switching to HRTIMER_CB_IRQSAFE_NO_SOFTIRQ
which doesn't hold base->lock.
This fully untangles hrtimer locks from the scheduler locks, and allows
hrtimer usage in the scheduler proper.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently all highres=off timers are run from softirq context, but
HRTIMER_CB_IRQSAFE_NO_SOFTIRQ timers expect to run from irq context.
Fix this up by splitting it similar to the highres=on case.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In order to more easily allow for the scheduler to use timers, clean up
the locking a bit.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We need to teach no_hz about the rt throttling because its tick driven.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Extend group scheduling to also cover the realtime classes. It uses the time
limiting introduced by the previous patch to allow multiple realtime groups.
The hard time limit is required to keep behaviour deterministic.
The algorithms used make the realtime scheduler O(tg), linear scaling wrt the
number of task groups. This is the worst case behaviour I can't seem to get out
of, the avg. case of the algorithms can be improved, I focused on correctness
and worst case.
[ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Very simple time limit on the realtime scheduling classes.
Allow the rq's realtime class to consume sched_rt_ratio of every
sched_rt_period slice. If the class exceeds this quota the fair class
will preempt the realtime class.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use HR-timers (when available) to deliver an accurate preemption tick.
The regular scheduler tick that runs at 1/HZ can be too coarse when nice
level are used. The fairness system will still keep the cpu utilisation 'fair'
by then delaying the task that got an excessive amount of CPU time but try to
minimize this by delivering preemption points spot-on.
The average frequency of this extra interrupt is sched_latency / nr_latency.
Which need not be higher than 1/HZ, its just that the distribution within the
sched_latency period is important.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Why do we even have cond_resched when real preemption
is on? It seems to be a waste of space and time.
remove cond_resched with CONFIG_PREEMPT on.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce a new rlimit that allows the user to set a runtime timeout on
real-time tasks their slice. Once this limit is exceeded the task will receive
SIGXCPU.
So it measures runtime since the last sleep.
Input and ideas by Thomas Gleixner and Lennart Poettering.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Lennart Poettering <mzxreary@0pointer.de>
CC: Michael Kerrisk <mtk.manpages@googlemail.com>
CC: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the task_struct members specific to rt scheduling together.
A future optimization could be to put sched_entity and sched_rt_entity
into a union.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are already 4 error paths in alloc_uid() that do incremental rollbacks.
I think it's time to merge them. This costs us 8 lines of code :)
Maybe it would be better to merge this patch with the previous one, but I
remember that some time ago I sent a similar patch (fixing the error path and
cleaning it), but I was told to make two patches in such cases.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The baseline code statically builds the span maps when the domain is formed.
Previous attempts at dynamically updating the maps caused a suspend-to-ram
regression, which should now be fixed.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch allows preemptible RCU to tolerate CPU-hotplug operations.
It accomplishes this by maintaining a local copy of a map of online
CPUs, which it accesses under its own lock.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch implements a new version of RCU which allows its read-side
critical sections to be preempted. It uses a set of counter pairs
to keep track of the read-side critical sections and flips them
when all tasks exit read-side critical section. The details
of this implementation can be found in this paper -
http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf
and the article-
http://lwn.net/Articles/253651/
This patch was developed as a part of the -rt kernel development and
meant to provide better latencies when read-side critical sections of
RCU don't disable preemption. As a consequence of keeping track of RCU
readers, the readers have a slight overhead (optimizations in the paper).
This implementation co-exists with the "classic" RCU implementations
and can be switched to at compiler.
Also includes RCU tracing summarized in debugfs.
[ akpm@linux-foundation.org: build fixes on non-preempt architectures ]
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Reviewed-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix rcu_barrier() to work properly in preemptive kernel environment.
Also, the ordering of callback must be preserved while moving
callbacks to another CPU during CPU hotplug.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch re-organizes the RCU code to enable multiple implementations
of RCU. Users of RCU continues to include rcupdate.h and the
RCU interfaces remain the same. This is in preparation for
subsequently merging the preemptible RCU implementation.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch makes RCU use softirq instead of tasklets.
It also adds a memory barrier after raising the softirq
inorder to ensure that the cpu sees the most recently updated
value of rcu->cur while processing callbacks.
The discussion of the related theoretical race pointed out
by James Huang can be found here --> http://lkml.org/lkml/2007/11/20/603
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Reviewed-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We had support for overlapping cpuset based rto logic in early
prototypes that is no longer used, so remove it.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The overload set/clears were originally idempotent when this logic was first
implemented. But that is no longer true due to the addition of the atomic
counter and this logic was never updated to work properly with that change.
So only adjust the overload state if it is actually changing to avoid
getting out of sync.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dmitry Adamushko found that the current implementation of the RT
balancing code left out changes to the sched_setscheduler and
rt_mutex_setprio.
This patch addresses this issue by adding methods to the schedule classes
to handle being switched out of (switched_from) and being switched into
(switched_to) a sched_class. Also a method for changing of priorities
is also added (prio_changed).
This patch also removes some duplicate logic between rt_mutex_setprio and
sched_setscheduler.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To make the main sched.c code more agnostic to the schedule classes.
Instead of having specific hooks in the schedule code for the RT class
balancing. They are replaced with a pre_schedule, post_schedule
and task_wake_up methods. These methods may be used by any of the classes
but currently, only the sched_rt class implements them.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yanmin Zhang noticed a nice optimization:
p = l * nr / nl, nl = l/g -> p = g * nr
which eliminates a do_div() from __sched_period().
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clean-up try_to_wake_up().
Get rid of the 'new_cpu' variable in try_to_wake_up() [ that's, one
#ifdef section less ]. Also remove a few redundant blank lines.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
No need to do a check for 'affine wakeup and passive balancing possibilities'
in select_task_rq_fair() when task_cpu(p) == this_cpu.
I guess, this part got missed upon introduction of per-sched_class
select_task_rq() in try_to_wake_up().
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
style cleanup of various changes that were done recently.
no code changed:
text data bss dec hex filename
26399 2578 48 29025 7161 sched.o.before
26399 2578 48 29025 7161 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We move the rt-overload data as the first global to per-domain
reclassification. This limits the scope of overload related cache-line
bouncing to stay with a specified partition instead of affecting all
cpus in the system.
Finally, we limit the scope of find_lowest_cpu searches to the domain
instead of the entire system. Note that we would always respect domain
boundaries even without this patch, but we first would scan potentially
all cpus before whittling the list down. Now we can avoid looking at
RQs that are out of scope, again reducing cache-line hits.
Note: In some cases, task->cpus_allowed will effectively reduce our search
to within our domain. However, I believe there are cases where the
cpus_allowed mask may be all ones and therefore we err on the side of
caution. If it can be optimized later, so be it.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We add the notion of a root-domain which will be used later to rescope
global variables to per-domain variables. Each exclusive cpuset
essentially defines an island domain by fully partitioning the member cpus
from any other cpuset. However, we currently still maintain some
policy/state as global variables which transcend all cpusets. Consider,
for instance, rt-overload state.
Whenever a new exclusive cpuset is created, we also create a new
root-domain object and move each cpu member to the root-domain's span.
By default the system creates a single root-domain with all cpus as
members (mimicking the global state we have today).
We add some plumbing for storing class specific data in our root-domain.
Whenever a RQ is switching root-domains (because of repartitioning) we
give each sched_class the opportunity to remove any state from its old
domain and add state to the new one. This logic doesn't have any clients
yet but it will later in the series.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Christoph Lameter <clameter@sgi.com>
CC: Paul Jackson <pj@sgi.com>
CC: Simon Derr <simon.derr@bull.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch removes several cpumask operations by keeping track
of the first of the CPUS that is of the lowest priority. When
the search for the lowest priority runqueue is completed, all
the bits up to the first CPU with the lowest priority runqueue
is cleared.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We can cheaply track the number of bits set in the cpumask for the lowest
priority CPUs. Therefore, compute the mask's weight and use it to skip
the optimal domain search logic when there is only one CPU available.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We don't need to bother searching if the task cannot be migrated
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch changes the searching for a run queue by a waking RT task
to try to pick another runqueue if the currently running task
is an RT task.
The reason is that RT tasks behave different than normal
tasks. Preempting a normal task to run a RT task to keep
its cache hot is fine, because the preempted non-RT task
may wait on that same runqueue to run again unless the
migration thread comes along and pulls it off.
RT tasks behave differently. If one is preempted, it makes
an active effort to continue to run. So by having a high
priority task preempt a lower priority RT task, that lower
RT task will then quickly try to run on another runqueue.
This will cause that lower RT task to replace its nice
hot cache (and TLB) with a completely cold one. This is
for the hope that the new high priority RT task will keep
its cache hot.
Remeber that this high priority RT task was just woken up.
So it may likely have been sleeping for several milliseconds,
and will end up with a cold cache anyway. RT tasks run till
they voluntarily stop, or are preempted by a higher priority
task. This means that it is unlikely that the woken RT task
will have a hot cache to wake up to. So pushing off a lower
RT task is just killing its cache for no good reason.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We have logic to detect whether the system has migratable tasks, but we are
not using it when deciding whether to push tasks away. So we add support
for considering this new information.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current code base assumes a relatively flat CPU/core topology and will
route RT tasks to any CPU fairly equally. In the real world, there are
various toplogies and affinities that govern where a task is best suited to
run with the smallest amount of overhead. NUMA and multi-core CPUs are
prime examples of topologies that can impact cache performance.
Fortunately, linux is already structured to represent these topologies via
the sched_domains interface. So we change our RT router to consult a
combination of topology and affinity policy to best place tasks during
migration.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the original patch series that Steven Rostedt and I worked on together,
we both took different approaches to low-priority wakeup path. I utilized
"pre-routing" (push the task away to a less important RQ before activating)
approach, while Steve utilized a "post-routing" approach. The advantage of
my approach is that you avoid the overhead of a wasted activate/deactivate
cycle and peripherally related burdens. The advantage of Steve's method is
that it neatly solves an issue preventing a "pull" optimization from being
deployed.
In the end, we ended up deploying Steve's idea. But it later dawned on me
that we could get the best of both worlds by deploying both ideas together,
albeit slightly modified.
The idea is simple: Use a "light-weight" lookup for pre-routing, since we
only need to approximate a good home for the task. And we also retain the
post-routing push logic to clean up any inaccuracies caused by a condition
of "priority mistargeting" caused by the lightweight lookup. Most of the
time, the pre-routing should work and yield lower overhead. In the cases
where it doesnt, the post-router will bat cleanup.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It doesn't hurt if we allow the current CPU to be included in the
search. We will just simply skip it later if the current CPU turns out
to be the lowest.
We will use this later in the series
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Isolate the search logic into a function so that it can be used later
in places other than find_locked_lowest_rq().
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current wake-up code path tries to determine if it can optimize the
wake-up to "this_cpu" by computing load calculations. The problem is that
these calculations are only relevant to SCHED_OTHER tasks where load is king.
For RT tasks, priority is king. So the load calculation is completely wasted
bandwidth.
Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
"this_rq" is normally used to denote the RQ on the current cpu
(i.e. "cpu_rq(this_cpu)"). So clean up the usage of this_rq to be
more consistent with the rest of the code.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some RT tasks (particularly kthreads) are bound to one specific CPU.
It is fairly common for two or more bound tasks to get queued up at the
same time. Consider, for instance, softirq_timer and softirq_sched. A
timer goes off in an ISR which schedules softirq_thread to run at RT50.
Then the timer handler determines that it's time to smp-rebalance the
system so it schedules softirq_sched to run. So we are in a situation
where we have two RT50 tasks queued, and the system will go into
rt-overload condition to request other CPUs for help.
This causes two problems in the current code:
1) If a high-priority bound task and a low-priority unbounded task queue
up behind the running task, we will fail to ever relocate the unbounded
task because we terminate the search on the first unmovable task.
2) We spend precious futile cycles in the fast-path trying to pull
overloaded tasks over. It is therefore optimial to strive to avoid the
overhead all together if we can cheaply detect the condition before
overload even occurs.
This patch tries to achieve this optimization by utilizing the hamming
weight of the task->cpus_allowed mask. A weight of 1 indicates that
the task cannot be migrated. We will then utilize this information to
skip non-migratable tasks and to eliminate uncessary rebalance attempts.
We introduce a per-rq variable to count the number of migratable tasks
that are currently running. We only go into overload if we have more
than one rt task, AND at least one of them is migratable.
In addition, we introduce a per-task variable to cache the cpus_allowed
weight, since the hamming calculation is probably relatively expensive.
We only update the cached value when the mask is updated which should be
relatively infrequent, especially compared to scheduling frequency
in the fast path.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since we now take an active approach to load balancing, we don't need to
balance RT tasks via the normal task balancer. In fact, this code was
found to pull RT tasks away from CPUS that the active movement performed,
resulting in large latencies.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds pushing of overloaded RT tasks from a runqueue that is
having tasks (most likely RT tasks) added to the run queue.
TODO: We don't cover the case of waking of new RT tasks (yet).
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the algorithm to pull tasks from RT overloaded runqueues.
When a pull RT is initiated, all overloaded runqueues are examined for
a RT task that is higher in prio than the highest prio task queued on the
target runqueue. If another runqueue holds a RT task that is of higher
prio than the highest prio task on the target runqueue is found it is pulled
to the target runqueue.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds an RT overload accounting system. When a runqueue has
more than one RT task queued, it is marked as overloaded. That is that it
is a candidate to have RT tasks pulled from it.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds an algorithm to push extra RT tasks off a run queue to
other CPU runqueues.
When more than one RT task is added to a run queue, this algorithm takes
an assertive approach to push the RT tasks that are not running onto other
run queues that have lower priority. The way this works is that the highest
RT task that is not running is looked at and we examine the runqueues on
the CPUS for that tasks affinity mask. We find the runqueue with the lowest
prio in the CPU affinity of the picked task, and if it is lower in prio than
the picked task, we push the task onto that CPU runqueue.
We continue pushing RT tasks off the current runqueue until we don't push any
more. The algorithm stops when the next highest RT task can't preempt any
other processes on other CPUS.
TODO: The algorithm may stop when there are still RT tasks that can be
migrated. Specifically, if the highest non running RT task CPU affinity
is restricted to CPUs that are running higher priority tasks, there may
be a lower priority task queued that has an affinity with a CPU that is
running a lower priority task that it could be migrated to. This
patch set does not address this issue.
Note: checkpatch reveals two over 80 character instances. I'm not sure
that breaking them up will help visually, so I left them as is.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds accounting to each runqueue to keep track of the
highest prio task queued on the run queue. We only care about
RT tasks, so if the run queue does not contain any active RT tasks
its priority will be considered MAX_RT_PRIO.
This information will be used for later patches.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds accounting to keep track of the number of RT tasks running
on a runqueue. This information will be used in later patches.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
this patch extends the soft-lockup detector to automatically
detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are
printed the following way:
------------------>
INFO: task prctl:3042 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
prctl D fd5e3793 0 3042 2997
f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286
f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000
f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b
Call Trace:
[<c04883a5>] schedule_timeout+0x6d/0x8b
[<c04883d8>] schedule_timeout_uninterruptible+0x15/0x17
[<c0133a76>] msleep+0x10/0x16
[<c0138974>] sys_prctl+0x30/0x1e2
[<c0104c52>] sysenter_past_esp+0x5f/0xa5
=======================
2 locks held by prctl/3042:
#0: (&sb->s_type->i_mutex_key#5){--..}, at: [<c0197d11>] do_fsync+0x38/0x7a
#1: (jbd_handle){--..}, at: [<c01ca3d2>] journal_start+0xc7/0xe9
<------------------
the current default timeout is 120 seconds. Such messages are printed
up to 10 times per bootup. If the system has crashed already then the
messages are not printed.
if lockdep is enabled then all held locks are printed as well.
this feature is a natural extension to the softlockup-detector (kernel
locked up without scheduling) and to the NMI watchdog (kernel locked up
with IRQs disabled).
[ Gautham R Shenoy <ego@in.ibm.com>: CPU hotplug fixes. ]
[ Andrew Morton <akpm@linux-foundation.org>: build warning fix. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
This patch converts the known per-subsystem mutexes to get_online_cpus
put_online_cpus. It also eliminates the CPU_LOCK_ACQUIRE and
CPU_LOCK_RELEASE hotplug notification events.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use
get_online_cpus and put_online_cpus instead as it highlights the
refcount semantics in these operations.
The new API guarantees protection against the cpu-hotplug operation, but
it doesn't guarantee serialized access to any of the local data
structures. Hence the changes needs to be reviewed.
In case of pseries_add_processor/pseries_remove_processor, use
cpu_maps_update_begin()/cpu_maps_update_done() as we're modifying the
cpu_present_map there.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch implements a Refcount + Waitqueue based model for
cpu-hotplug.
Now, a thread which wants to prevent cpu-hotplug, will bump up a global
refcount and the thread which wants to perform a cpu-hotplug operation
will block till the global refcount goes to zero.
The readers, if any, during an ongoing cpu-hotplug operation are blocked
until the cpu-hotplug operation is over.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Paul Jackson <pj@sgi.com> [For !CONFIG_HOTPLUG_CPU ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current load balancing scheme isn't good enough for precise
group fairness.
For example: on a 8-cpu system, I created 3 groups as under:
a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)
a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.
This is what I get with the latest scheduler git tree:
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------
Explanation of o/p:
Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60
[I can share the test case that produces a similar o/p if reqd]
The deviation from desired group fairness is as below:
a = +24.47%
b = -9.13%
c = -15.33%
which is quite high.
After the patch below is applied, here are the results:
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
--------------------------------------------------------------------------------
Deviation from desired group fairness is as below:
a = +0.67%
b = -0.83%
c = +0.17%
which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.
Why do we see bad (group) fairness with current scheuler?
=========================================================
Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:
CPU0 CPU1
---------------------------
A (10) B(5)
C(5)
---------------------------
Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).
The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:
A = 50%
B = 25%
C = 25%
which is not the desired result.
What's changing in the patch?
=============================
- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
defined (see below)
- API Change
- Two tunables introduced in sysfs (under SCHED_DEBUG) to
control the frequency at which the load balance monitor
thread runs.
The basic change made in this patch is how cpu weight (rq->load.weight) is
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.
Let,
W(K,i) = Weight of group K on cpu i
T(K,i) = Task load present in group K's cfs_rq on cpu i
T(K) = Total task load of group K across various cpus
S(K) = Shares allocated to group K
NRCPUS = Number of online cpus in the scheduler domain to
which group K is assigned.
Then,
W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)
A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate
at which it runs.
Fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl>
- don't start the load_balance_monitor when there is only a single cpu.
- rename the kthread because its currently longer than TASK_COMM_LEN
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
doms_cur[] array represents various scheduling domains which are
mutually exclusive. Currently cpusets code can modify this array (by
calling partition_sched_domains()) as a result of user modifying
sched_load_balance flag for various cpusets.
This patch introduces a mutex and corresponding API (only when
CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read
the doms_cur[] array w/o worrying abt concurrent modifications to the
array.
The fair group scheduler code (introduced in next patch of this series)
makes use of this mutex to walk thr' doms_cur[] array while rebalancing
shares of task groups across cpus.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now
a summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.
This version of patch has a minor impact on code size, but should have
no runtime/functional impact for !CONFIG_FAIR_GROUP_SCHED.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Minor bug fixes for the group scheduler:
- Use a mutex to serialize add/remove of task groups and also when
changing shares of a task group. Use the same mutex when printing
cfs_rq debugging stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
walking task group list)
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Stefano Brivio reported weird printk timestamp behavior during
CPU frequency changes:
http://bugzilla.kernel.org/show_bug.cgi?id=9475
fix CONFIG_PRINT_TIME's reliance on sched_clock() and use cpu_clock()
instead.
Reported-and-bisected-by: Stefano Brivio <stefano.brivio@polimi.it>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
make printk more robust by allowing recursion only if there's a crash
going on. Also add recursion detection.
I've tested it with an artificially injected printk recursion - instead
of a lockup or spontaneous reboot or other crash, the output was a well
controlled:
[ 41.057335] SysRq : <2>BUG: recent printk recursion!
[ 41.057335] loglevel0-8 reBoot Crashdump show-all-locks(D) tErm Full kIll saK showMem Nice powerOff showPc show-all-timers(Q) unRaw Sync showTasks Unmount shoW-blocked-tasks
also do all this printk-debug logic with irqs disabled.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Nick Piggin <npiggin@suse.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hskinnemoen/avr32-2.6:
[AVR32] extint: Set initial irq type to low level
[AVR32] extint: change set_irq_type() handling
[AVR32] NMI debugging
[AVR32] constify function pointer tables
[AVR32] ATNGW100: Update defconfig
[AVR32] ATSTK1002: Update defconfig
[AVR32] Kconfig: Choose daughterboard instead of CPU
[AVR32] Add support for ATSTK1003 and ATSTK1004
[AVR32] Clean up external DAC setup code
[AVR32] ATSTK1000: Move gpio-leds setup to setup.c
[AVR32] Add support for AT32AP7001 and AT32AP7002
[AVR32] Provide more CPU information in /proc/cpuinfo and dmesg
[AVR32] Oprofile support
[AVR32] Include instrumentation menu
Disable VGA text console for AVR32 architecture
[AVR32] Enable debugging only when needed
ptrace: Call arch_ptrace_attach() when request=PTRACE_TRACEME
[AVR32] Remove redundant try_to_freeze() call from do_signal()
[AVR32] Drop GFP_COMP for DMA memory allocations
arch_ptrace_attach() is a hook that allows the architecture to do
book-keeping after a ptrace attach. This patch adds a call to this
hook when handling a PTRACE_TRACEME request as well.
Currently only one architecture, m32r, implements this hook. When
called, it initializes a number of debug trap slots in the ptraced
task's thread struct, and it looks to me like this is the right thing
to do after a PTRACE_TRACEME request as well, not only after
PTRACE_ATTACH. Please correct me if I'm wrong.
I want to use this hook on AVR32 to turn the debugging hardware on
when a process is actually being debugged and keep it off otherwise.
To be able to do this, I need to intercept PTRACE_TRACEME and
PTRACE_ATTACH, as well as PTRACE_DETACH and thread exit. The latter
two can be handled by existing hooks.
Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
All kobjects require a dynamically allocated name now. We no longer
need to keep track if the name is statically assigned, we can just
unconditionally free() all kobject names on cleanup.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There is no need for kobject_unregister() anymore, thanks to Kay's
kobject cleanup changes, so replace all instances of it with
kobject_put().
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Now that kobjects properly clean up their name structures, no matter if
they have a release function or not, we can drop this empty module
kobject release function too (it was needed prior to this because of the
way we handled static kobject names, we based the fact that if a release
function was present, then we could safely free the name string, now we
are more smart about things and only free names we have previously set.)
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Sysfs symlinks now require fully registered kobjects as a target,
otherwise the call to create a symlink will fail. Here we register
the kobject before we request the symlink in the holders directory.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tejun Heo <teheo@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The module driver specific code should belong in the driver core, not in
the kernel/ directory. So move this code. This is done in preparation
for some struct device_driver rework that should be confined to the
driver core code only.
This also lets us keep from exporting these functions, as no external
code should ever be calling it.
Thanks to Andrew Morton for the !CONFIG_MODULES fix.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
/sys/power should not be a kset, that's overkill. This patch renames it
to power_kset and fixes up all usages of it in the tree.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
kernel_kset does not need to be a kset, but a much simpler kobject now
that we have kobj_attributes.
We also rename kernel_kset to kernel_kobj to catch all users of this
symbol with a build error instead of an easy-to-ignore build warning.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Clean up the use of ksets and kobjects. Kobjects are instances of
objects (like struct user_info), ksets are collections of objects of a
similar type (like the uids directory containing the user_info directories).
So, use kobjects for the user_info directories, and a kset for the "uids"
directory.
On object cleanup, the final kobject_put() was missing.
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Switch all dynamically created ksets, that export simple attributes,
to kobj_attribute from subsys_attribute. Struct subsys_attribute will
be removed.
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Mike Halcrow <mhalcrow@us.ibm.com>
Cc: Phillip Hellewell <phillip@hellewell.homeip.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Dynamically create the kset instead of declaring it statically. We also
rename power_subsys to power_kset to catch all users of the variable and
we properly export it so that people don't have to guess that it really
is present in the system.
The pseries code is wierd, why is it createing /sys/power if CONFIG_PM
is disabled? Oh well, stupid big boxes ignoring config options...
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Dynamically create the kset instead of declaring it statically. We also
rename module_subsys to module_kset to catch all users of the variable.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Dynamically create the kset instead of declaring it statically. We also
rename kernel_subsys to kernel_kset to catch all users of this symbol
with a build error instead of an easy-to-ignore build warning.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
kobject_create_and_add is the same as kobject_add_dir, so drop
kobject_add_dir.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We don't need a "default" ktype for a kset. We should set this
explicitly every time for each kset. This change is needed so that we
can make ksets dynamic, and cleans up one of the odd, undocumented
assumption that the kset/kobject/ktype model has.
This patch is based on a lot of help from Kay Sievers.
Nasty bug in the block code was found by Dave Young
<hidave.darkstar@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Michael Wu noticed in his lkml post at
http://marc.info/?l=linux-kernel&m=119396182726091&w=2
that certain wireless drivers ended up having their name in module
memory, which would then crash the kernel on module unload.
The patch he proposed was a bit clumsy in that it increased the size of
a lockdep entry significantly; the patch below tries another approach,
it checks, on module teardown, if the name of a class is in module space
and then zaps the class. This is very similar to what we already do
with keys that are in module space.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using 64k pages on 64-bit PowerPC systems makes life difficult for
emulators that are trying to emulate an ISA, such as x86, which use a
smaller page size, since the emulator can no longer use the MMU and
the normal system calls for controlling page protections. Of course,
the emulator can emulate the MMU by checking and possibly remapping
the address for each memory access in software, but that is pretty
slow.
This provides a facility for such programs to control the access
permissions on individual 4k sub-pages of 64k pages. The idea is
that the emulator supplies an array of protection masks to apply to a
specified range of virtual addresses. These masks are applied at the
level where hardware PTEs are inserted into the hardware page table
based on the Linux PTEs, so the Linux PTEs are not affected. Note
that this new mechanism does not allow any access that would otherwise
be prohibited; it can only prohibit accesses that would otherwise be
allowed. This new facility is only available on 64-bit PowerPC and
only when the kernel is configured for 64k pages.
The masks are supplied using a new subpage_prot system call, which
takes a starting virtual address and length, and a pointer to an array
of protection masks in memory. The array has a 32-bit word per 64k
page to be protected; each 32-bit word consists of 16 2-bit fields,
for which 0 allows any access (that is otherwise allowed), 1 prevents
write accesses, and 2 or 3 prevent any access.
Implicit in this is that the regions of the address space that are
protected are switched to use 4k hardware pages rather than 64k
hardware pages (on machines with hardware 64k page support). In fact
the whole process is switched to use 4k hardware pages when the
subpage_prot system call is used, but this could be improved in future
to switch only the affected segments.
The subpage protection bits are stored in a 3 level tree akin to the
page table tree. The top level of this tree is stored in a structure
that is appended to the top level of the page table tree, i.e., the
pgd array. Since it will often only be 32-bit addresses (below 4GB)
that are protected, the pointers to the first four bottom level pages
are also stored in this structure (each bottom level page contains the
protection bits for 1GB of address space), so the protection bits for
addresses below 4GB can be accessed with one fewer loads than those
for higher addresses.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch allows the various users of attribute_groups to selectively
allow the appearance of group attributes. The primary consumer of
this will be the transport classes in which we currently have
elaborate attribute selection algorithms to do this same thing.
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
rcu_online_cpu() should be __cpuinit instead of __devinit.
WARNING: vmlinux.o(.text+0x4b6d5): Section mismatch: reference to .init.text: (between 'rcu_cpu_notify' and 'wakeme_after_rcu')
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
setting cpu share to 1 causes hangs, as reported in:
http://bugzilla.kernel.org/show_bug.cgi?id=9779
as the default share is 1024, the values of 0 and 1 can indeed
cause problems. Limit it to 2 or higher values.
These values can only be set by the root user - but still it
makes sense to protect against nonsensical values.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The caller is __cpuinit.
Also, this code block and its caller are inside #ifdef CONFIG_HOTPLUG_CPU
blocks, so this code should reflect that config symbol's usage.
WARNING: vmlinux.o(.text+0x4252f): Section mismatch: reference to .init.text: (between 'timer_cpu_notify' and 'msleep')
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@akpm@linux-foundation.org>
Fix section mismatch in hrtimer.c:
WARNING: vmlinux.o(.text+0x50c61): Section mismatch: reference to .init.text: (between 'hrtimer_cpu_notify' and 'down_read_trylock')
Noticed by Johannes Berg and confirmed by Sam Ravnborg.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@akpm@linux-foundation.org>
call_usermodehelper_exec() has an exit path that can leave the
helper_lock() call at the top of the routine unbalanced. The attached
patch fixes this issue.
Signed-off-by: Nigel Cunningham <nigel@tuxonice.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Young reported warnings from lockdep that the workqueue API
can sometimes try to register lockdep classes with the same key
but different names. This is not permitted in lockdep.
Unfortunately, I was unaware of that restriction when I wrote
the code to debug workqueue problems with lockdep and used the
workqueue name as the lockdep class name. This can obviously
lead to the problem if the workqueue name is dynamic.
This patch solves the problem by always using a constant name
for the workqueue's lockdep class, namely either the constant
name that was passed in or a string consisting of the variable
name.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Lockdep, during self-test (when it was simulating double unlocks) was
sometimes unconditionally unlocking a spinlock when it had not been
locked. This won't work for ticket locks.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Kyle McMartin reports sysrq_timer_list_show() can hit the module mutex
from hard interrupt context. These paths don't need to though, since we
long ago changed all the module list manipulation to occur via
stop_machine().
Disabling preemption is enough.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
pnpacpi: print resource shortage message only once
PM: ACPI and APM must not be enabled at the same time
ACPI: apply quirk_ich6_lpc_acpi to more ICH8 and ICH9
ACPICA: fix acpi_serialize hang regression
ACPI : Not register gsi for PCI IDE controller in legacy mode
ACPI: Reintroduce run time configurable max_cstate for !CPU_IDLE case
ACPI: Make sysfs interface in ACPI power optional.
ACPI: EC: Enable boot EC before bus_scan
increase PNP_MAX_PORT to 40 from 24
task_ppid_nr_ns is called in three places. One of these should never
have called it. In the other two, using it broke the existing
semantics. This was presumably accidental. If the function had not
been there, it would have been much more obvious to the eye that those
patches were changing the behavior. We don't need this function.
In task_state, the pid of the ptracer is not the ppid of the ptracer.
In do_task_stat, ppid is the tgid of the real_parent, not its pid.
I also moved the call outside of lock_task_sighand, since it doesn't
need it.
In sys_getppid, ppid is the tgid of the real_parent, not its pid.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ACPI and APM used "pm_active" to guarantee that
they would not be simultaneously active.
But pm_active was recently moved under CONFIG_PM_LEGACY,
so that without CONFIG_PM_LEGACY, pm_active became a NOP --
allowing ACPI and APM to both be simultaneously enabled.
This caused unpredictable results, including boot hangs.
Further, the code under CONFIG_PM_LEGACY is scheduled
for removal.
So replace pm_active with pm_flags.
pm_flags depends only on CONFIG_PM,
which is present for both CONFIG_APM and CONFIG_ACPI.
http://bugzilla.kernel.org/show_bug.cgi?id=9194
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The show_task function invoked by sysrq-t et al displays the
pid and parent's pid of each task. It seems more useful to
show the actual process hierarchy here than who is using
ptrace on each process.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roland Westrelin did a great analysis of a long standing thinko in the
return path of futex_lock_pi.
While we fixed the lock steal case long ago, which was easy to trigger,
we never had a test case which exposed this problem and stupidly never
thought about the reverse lock stealing scenario and the return to user
space with a stale state.
When a blocked tasks returns from rt_mutex_timed_locked without holding
the rt_mutex (due to a signal or timeout) and at the same time the task
holding the futex is releasing the futex and assigning the ownership of
the futex to the returning task, then it might happen that a third task
acquires the rt_mutex before the final rt_mutex_trylock() of the
returning task happens under the futex hash bucket lock. The returning
task returns to user space with ETIMEOUT or EINTR, but the user space
futex value is assigned to this task. The task which acquired the
rt_mutex fixes the user space futex value right after the hash bucket
lock has been released by the returning task, but for a short period of
time the user space value is wrong.
Detailed description is available at:
https://bugzilla.redhat.com/show_bug.cgi?id=400541
The fix for this is the same as we do when the rt_mutex was acquired by
a higher priority task via lock stealing from the designated new owner.
In that case we already fix the user space value and the internal
pi_state up before we return. This mechanism can be used to fixup the
above corner case as well. When the returning task, which failed to
acquire the rt_mutex, notices that it is the designated owner of the
futex, then it fixes up the stale user space value and the pi_state,
before returning to user space. This happens with the futex hash bucket
lock held, so the task which acquired the rt_mutex is guaranteed to be
blocked on the hash bucket lock. We can access the rt_mutex owner, which
gives us the pid of the new owner, safely here as the owner is not able
to modify (release) it while waiting on the hash bucket lock.
Rename the "curr" argument of fixup_pi_state_owner() to "newowner" to
avoid confusion with current and add the check for the stale state into
the failure path of rt_mutex_trylock() in the return path of
unlock_futex_pi(). If the situation is detected use
fixup_pi_state_owner() to assign everything to the owner of the
rt_mutex.
Pointed-out-and-tested-by: Roland Westrelin <roland.westrelin@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds the array length of "free_area.free_list" to the vmcoreinfo
data so that makedumpfile (dump filtering command) can exclude all free pages
in linux-2.6.24.
makedumpfile creates a small dumpfile by excluding unnecessary pages for the
analysis. To distinguish unnecessary pages, makedumpfile gets the vmcoreinfo
data which has the minimum debugging information only for dump filtering.
In 2.6.24-rc1 or later, the free_area.free_list is an array which has one list
for each migrate types instead of a single list. makedumpfile needs the array
length of "free_area.free_list" and the vmcoreinfo data should contain it.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Tested-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Acked-by: Simon Horman <horms@verge.net.au>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The ac_ppid field reported in process accounting records
should match what getppid() would have returned to that
process, regardless of whether a debugger is attached.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous commit missed one use of "may_attach()" that had been
renamed to __ptrace_may_attach(). Tssk, tssk, Al.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Contents of /proc/*/maps is sensitive and may become sensitive after
open() (e.g. if target originally shares our ->mm and later does exec
on suid-root binary).
Check at read() (actually, ->start() of iterator) time that mm_struct
we'd grabbed and locked is
- still the ->mm of target
- equal to reader's ->mm or the target is ptracable by reader.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Meelis Roos reported these warnings on sparc64:
CC kernel/sched.o
In file included from kernel/sched.c:879:
kernel/sched_debug.c: In function 'nsec_high':
kernel/sched_debug.c:38: warning: comparison of distinct pointer types lacks a cast
the debug check in do_div() is over-eager here, because the long long
is always positive in these places. Mark this by casting them to
unsigned long long.
no change in code output:
text data bss dec hex filename
51471 6582 376 58429 e43d sched.o.before
51471 6582 376 58429 e43d sched.o.after
md5:
7f7729c111f185bf3ccea4d542abc049 sched.o.before.asm
7f7729c111f185bf3ccea4d542abc049 sched.o.after.asm
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We're exporting an __init function, oops :-)
The core issue here is that add_preferred_console() is marked
as __init, this makes it impossible to invoke this thing from
a driver probe routine which is what the Sparc serial drivers
need to do.
There is no harm in dropping the __init marker. This code will
actually work properly when invoked from a modular driver,
except that init will probably not pick up the console change
without some other support code.
Then we can drop the __init from sunserial_console_match()
and we're no longer exporting an __init function to modules.
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to the change in kobject name handling, the module kobject needs to
have a null release function to ensure that the name it previously set
will be properly cleaned up.
All of this wierdness goes away in 2.6.25 with the rework of the kobject
name and cleanup logic, but this is required for 2.6.24.
Thanks to Alexey Dobriyan for finding the problem, and to Kay Sievers
for pointing out the simple way to fix it after I tried many complex
ways.
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Right now it's nearly impossible for parsers that collect kernel crashes
from logs or emails (such as www.kerneloops.org) to detect the
end-of-oops condition. In addition, it's not currently possible to
detect whether or not 2 oopses that look alike are actually the same
oops reported twice, or are truly two unique oopses.
This patch adds an end-of-oops marker, and makes the end marker include
a very simple 64-bit random ID to be able to detect duplicate reports.
Normally, this ID is calculated as a late_initcall() (in the hope that
at that time there is enough entropy to get a unique enough ID); however
for early oopses the oops_exit() function needs to generate the ID on
the fly.
We do this all at the _end_ of an oops printout, so this does not impact
our ability to get the most important portions of a crash out to the
console first.
[ Sidenote: the already existing oopses-since-bootup counter we print
during crashes serves as the differentiator between multiple oopses
that trigger during the same bootup. ]
Tested on 32-bit and 64-bit x86. Artificially injected very early
crashes as well, as expected they result in this constant ID after
multiple bootups:
---[ end trace ca143223eefdc828 ]---
---[ end trace ca143223eefdc828 ]---
because the random pools are still all zero. But it all still works
fine and causes no additional problems (which is the main goal of
instrumentation code).
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Realtime tasks would not account their runtime during ticks. Which would lead
to:
struct sched_param param = { .sched_priority = 10 };
pthread_setschedparam(pthread_self(), SCHED_FIFO, ¶m);
while (1) ;
Not showing up in top.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In commit 76d2160147 lazy irq disabling
was implemented, and the simple irq handler had a masking set to it.
Remy Bohmer discovered that some devices in the ARM architecture
would trigger the mask, but never unmask it. His patch to do the
unmasking was questioned by Russell King about masking simple irqs
to begin with. Looking further, it was discovered that the problems
Remy was seeing was due to improper use of the simple handler by
devices, and he later submitted patches to fix those. But the issue
that was uncovered was that the simple handler should never mask.
This patch reverts the masking in the simple handler.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
This patch fixes the following section mismatches with CONFIG_HOTPLUG=n,
CONFIG_HOTPLUG_CPU=y:
...
WARNING: vmlinux.o(.text+0x41cd3): Section mismatch: reference to .init.data:tvec_base_done.22610 (between 'timer_cpu_notify' and 'run_timer_softirq')
WARNING: vmlinux.o(.text+0x41d67): Section mismatch: reference to .init.data:tvec_base_done.22610 (between 'timer_cpu_notify' and 'run_timer_softirq')
...
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Resolve the following regression of a choppy, almost unusable laptop:
http://lkml.org/lkml/2007/12/7/299http://bugzilla.kernel.org/show_bug.cgi?id=9525
A previous version of the code did the reprogramming of the broadcast
device in the return from idle code. This was removed, but the logic in
tick_handle_oneshot_broadcast() was kept the same.
When a broadcast interrupt happens we signal the expiry to all CPUs
which have an expired event. If none of the CPUs has an expired event,
which can happen in dyntick mode, then we reprogram the broadcast
device. We do not reprogram otherwise, but this is only correct if all
CPUs, which are in the idle broadcast state have been woken up.
The code ignores, that there might be pending not yet expired events on
other CPUs, which are in the idle broadcast state. So the delivery of
those events can be delayed for quite a time.
Change the tick_handle_oneshot_broadcast() function to check for CPUs,
which are in broadcast state and are not woken up by the current event,
and enforce the rearming of the broadcast device for those CPUs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
measurements by Yanmin Zhang have shown that SCHED_BATCH tasks benefit
if they run the same place_entity() logic as SCHED_OTHER tasks - so
uniformize behavior in this area.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
min_sched_granularity_ns, max_sched_granularity_ns,
min_wakeup_granularity_ns and max_wakeup_granularity_ns are declared
"unsigned long".
This is incorrect since proc_dointvec_minmax() expects plain "int" guard
values.
This bug only triggers on big endian 64 bit arches.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This following commit
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=fdf8cb0909b531f9ae8f9b9d7e4eb35ba3505f07
un-inlined a low-level rwsem function, but did not mark it as __sched.
The result is that it now shows up as thread wchan (which also affects
/proc/profile stats). The following simple patch fixes this by properly
marking rwsem_down_failed_common() as a __sched function.
Also in this patch, which is up for discussion, marks down_read() and
down_write() proper as __sched. For profiling, it is pretty much
useless to know that a semaphore is beig help - it is necessary to know
_which_ one. By going up another frame on the stack, the information
becomes much more useful.
In summary, the below change to lib/rwsem.c should be applied; the
changes to kernel/rwsem.c could be applied if other kernel hackers agree
with my proposal that down_read()/down_write() in the profile is not
enough.
[ akpm@linux-foundation.org: build fix ]
Signed-off-by: Livio Soares <livio@eecg.toronto.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some services (e.g. sched_setscheduler(), rt_mutex_setprio() and
sched_move_task()) must handle a given task differently in case it's the
'rq->curr' task on its run-queue. The task_running() interface is not
suitable for determining such tasks for platforms with one of the
following options:
#define __ARCH_WANT_UNLOCKED_CTXSW
#define __ARCH_WANT_INTERRUPTS_ON_CTXSW
Due to the fact that it makes use of 'p->oncpu == 1' as a criterion but
such a task is not necessarily 'rq->curr'.
The detailed explanation is available here:
https://lists.linux-foundation.org/pipermail/containers/2007-December/009262.html
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
This reverts commit 54f9f80d65 ("hugetlb:
Add hugetlb_dynamic_pool sysctl")
Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
sysctl is not needed, as its semantics can be expressed by 0 in the
overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
(pool enabled).
(Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
ACPI: move timer broadcast before busmaster disable
clockevents: warn once when program_event() is called with negative expiry
hrtimers: avoid overflow for large relative timeouts
The hrtimer problem with large relative timeouts resulting in a
negative expiry time went unnoticed as there is no check in the
clockevents_program_event() code. Put a check there with a WARN_ONCE
to avoid such problems in the future.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Relative hrtimers with a large timeout value might end up as negative
timer values, when the current time is added in hrtimer_start().
This in turn is causing the clockevents_set_next() function to set an
huge timeout and sleep for quite a long time when we have a clock
source which is capable of long sleeps like HPET. With PIT this almost
goes unnoticed as the maximum delta is ~27ms. The non-hrt/nohz code
sorts this out in the next timer interrupt, so we never noticed that
problem which has been there since the first day of hrtimers.
This bug became more apparent in 2.6.24 which activates HPET on more
hardware.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
some platforms have sched_clock() implementations that cannot be called
very early during wakeup. If it's called it might hang or crash in hard
to debug ways. So only call update_rq_clock() [which calls sched_clock()]
if sched_init() has already been called. (rq->idle is NULL before the
scheduler is initialized.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
make cli/sti annotation warnings easier to interpret.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Similar to mutex_lock_interruptible, it can be interrupted by a fatal
signal only.
Signed-off-by: Liam R. Howlett <howlett@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Set TASK_WAKEKILL for TASK_STOPPED and TASK_TRACED, add TASK_KILLABLE and
use TASK_WAKEKILL in signal_wake_up()
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Make the Kconfig.instrumentation file a bit easier on the eyes, and use
the new ARCH_SUPPORTS_OPROFILE for x86[-64].
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The cleanup 09cadedbdc broke the oprofile
configuration for MIPS by allowing oprofile support to be built for
kernel models where oprofile doesn't have a chance in hell to work.
Just a dependecy list on a number of architectures is - surprise - broken
and should as per past discussions probably in most considered to be
broken in most cases. So I introduce a dependency for the oprofile
configuration on ARCH_SUPPORTS_OPROFILE.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
register_sysctl_table() can return NULL sometimes, e.g. when kmalloc()
returns NULL or when sysctl check fails.
I've also noticed, that many (most?) code in the kernel doesn't check for
the return value from register_sysctl_table() and later simply calls the
unregister_sysctl_table() with potentially NULL argument.
This is unlikely on a common kernel configuration, but in case we're
dealing with modules and/or fault-injection support, there's a slight
possibility of an OOPS.
Changing all the users to check for return code from the registering does
not look like a good solution - there are too many code doing this and
failure in sysctl tables registration is not a good reason to abort module
loading (in most of the cases).
So I think, that we can just have this check in unregister_sysctl_table
just to avoid accidental OOPS-es (actually, the unregister_sysctl_table()
did exactly this, before the start_unregistering() appeared).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we are complicating the code in copy_process, the clone ABI, and
if we fix the bugs sys_setsid itself, with an unnecessary open coded
version of sys_setsid.
So just simplify everything and don't special case the session and pgrp of
the initial process in a pid namespace.
Having this special case actually presents to user space the classic linux
startup conditions with session == pgrp == 0 for /sbin/init.
We already handle sending signals to processes in a child pid namespace.
We need to handle sending signals to processes in a parent pid namespace
for cases like SIGCHILD and SIGIO.
This makes nothing extra visible inside a pid namespace. So this extra
special case appears to have no redeeming merits.
Further removing this special case increases the flexibility of how we can
use pid namespaces, by not requiring the initial process in a pid namespace
to be a daemon.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Torsten Kaiser wrote:
| static inline int in_range(const void *start, const void *addr, const void *end)
| {
| return addr >= start && addr <= end;
| }
| This will return true, if addr is in the range of start (including)
| to end (including).
|
| But debug_check_no_locks_freed() seems does:
| const void *mem_to = mem_from + mem_len
| -> mem_to is the last byte of the freed range, that fits in_range
| lock_from = (void *)hlock->instance;
| -> first byte of the lock
| lock_to = (void *)(hlock->instance + 1);
| -> first byte of the next lock, not last byte of the lock that is being checked!
|
| The test is:
| if (!in_range(mem_from, lock_from, mem_to) &&
| !in_range(mem_from, lock_to, mem_to))
| continue;
| So it tests, if the first byte of the lock is in the range that is freed ->OK
| And if the first byte of the *next* lock is in the range that is freed
| -> Not OK.
We can also simplify in_range checks, we need only 2 comparisons, not 4.
If the lock is not in memory range, it should be either at the left of range
or at the right.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
fix the oops that can be seen in:
http://bugzilla.kernel.org/attachment.cgi?id=13828&action=view
it is not safe to print the locks of running tasks.
(even with this fix we have a small race - but this is a debug
function after all.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
style cleanup of various changes that were done recently.
no code changed:
text data bss dec hex filename
23680 2542 28 26250 668a sched.o.before
23680 2542 28 26250 668a sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
David Holmes found a bug in the -rt tree with respect to
pthread_cond_timedwait. After trying his test program on the latest git
from mainline, I found the bug was there too. The bug he was seeing
that his test program showed, was that if one were to do a "Ctrl-Z" on a
process that was in the pthread_cond_timedwait, and then did a "bg" on
that process, it would return with a "-ETIMEDOUT" but early. That is,
the timer would go off early.
Looking into this, I found the source of the problem. And it is a rather
nasty bug at that.
Here's the relevant code from kernel/futex.c: (not in order in the file)
[...]
smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
struct timespec __user *utime, u32 __user *uaddr2,
u32 val3)
{
struct timespec ts;
ktime_t t, *tp = NULL;
u32 val2 = 0;
int cmd = op & FUTEX_CMD_MASK;
if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
return -EFAULT;
if (!timespec_valid(&ts))
return -EINVAL;
t = timespec_to_ktime(ts);
if (cmd == FUTEX_WAIT)
t = ktime_add(ktime_get(), t);
tp = &t;
}
[...]
return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
}
[...]
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
int ret;
int cmd = op & FUTEX_CMD_MASK;
struct rw_semaphore *fshared = NULL;
if (!(op & FUTEX_PRIVATE_FLAG))
fshared = ¤t->mm->mmap_sem;
switch (cmd) {
case FUTEX_WAIT:
ret = futex_wait(uaddr, fshared, val, timeout);
[...]
static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
u32 val, ktime_t *abs_time)
{
[...]
struct restart_block *restart;
restart = ¤t_thread_info()->restart_block;
restart->fn = futex_wait_restart;
restart->arg0 = (unsigned long)uaddr;
restart->arg1 = (unsigned long)val;
restart->arg2 = (unsigned long)abs_time;
restart->arg3 = 0;
if (fshared)
restart->arg3 |= ARG3_SHARED;
return -ERESTART_RESTARTBLOCK;
[...]
static long futex_wait_restart(struct restart_block *restart)
{
u32 __user *uaddr = (u32 __user *)restart->arg0;
u32 val = (u32)restart->arg1;
ktime_t *abs_time = (ktime_t *)restart->arg2;
struct rw_semaphore *fshared = NULL;
restart->fn = do_no_restart_syscall;
if (restart->arg3 & ARG3_SHARED)
fshared = ¤t->mm->mmap_sem;
return (long)futex_wait(uaddr, fshared, val, abs_time);
}
So when the futex_wait is interrupt by a signal we break out of the
hrtimer code and set up or return from signal. This code does not return
back to userspace, so we set up a RESTARTBLOCK. The bug here is that we
save the "abs_time" which is a pointer to the stack variable "ktime_t t"
from sys_futex.
This returns and unwinds the stack before we get to call our signal. On
return from the signal we go to futex_wait_restart, where we update all
the parameters for futex_wait and call it. But here we have a problem
where abs_time is no longer valid.
I verified this with print statements, and sure enough, what abs_time
was set to ends up being garbage when we get to futex_wait_restart.
The solution I did to solve this (with input from Linus Torvalds)
was to add unions to the restart_block to allow system calls to
use the restart with specific parameters. This way the futex code now
saves the time in a 64bit value in the restart block instead of storing
it on the stack.
Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
Not sure what that is there for. If this turns out to be a problem, I've
tested this with using "unsigned int" for u32 and "unsigned long long" for
u64 and it worked just the same. I'm using u32 and u64 just to be
consistent with what the futex code uses.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>