linux

Author	SHA1	Message	Date
Ingo Molnar	87c8a64475	printk: export console_drivers this symbol is needed by drivers/video/xen-fbfront.ko. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-16 09:15:58 +02:00
Abhishek Sagar	a4500b84c5	ftrace: fix "notrace" filtering priority This is a fix to give notrace filter rules priority over "set_ftrace_filter" rules. This fix ensures that functions which are set to be filtered and are concurrently marked as "notrace" don't get recorded. As of now, if a record is marked as FTRACE_FL_FILTER and is enabled, then the notrace flag is not checked. Tested on x86-32. Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-14 08:32:29 +02:00
Vegard Nossum	a5a242dcee	stacktrace: print_stack_trace() cleanup - shorter code and better atomicity with regards to printk(). (It's been tested with the backtrace self-test code on i386 and x86_64.) Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-13 13:17:30 +02:00
Huang, Ying	2429e4ee78	lockdep: output lock_class key instead of address for forward dependency output The key instead of address of lock_class should be output in /proc/lockdep when forward dependency is output, because key is output for lock_class itself as identifier too. This patch is based on x86/auto-latest branch of git-x86 tree, and has been tested on x86_64 platform. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-13 08:55:01 +02:00
Masami Hiramatsu	67dddaad5d	kprobes: fix error checking of batch registration Fix error checking routine to catch an error which occurs in first __register_*probe(). Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-06-12 18:05:40 -07:00
Rafael J. Wysocki	d8f3de0d24	Suspend-related patches for 2.6.27 ACPI PM: Add possibility to change suspend sequence There are some systems out there that don't work correctly with our current suspend/hibernation code ordering. Provide a workaround for these systems allowing them to pass 'acpi_sleep=old_ordering' in the kernel command line so that it will use the pre-ACPI 2.0 ("old") suspend code ordering. Unfortunately, this requires us to add a platform hook to the resuming of devices for recovering the platform in case one of the device drivers' .suspend() routines returns error code. Namely, ACPI 1.0 specifies that _PTS should be called before suspending devices, but _WAK still should be called before resuming them in order to undo the changes made by _PTS. However, if there is an error during suspending devices, they are automatically resumed without returning control to the PM core, so the _WAK has to be called from within device_resume() in that cases. The patch also reorders and refactors the ACPI suspend/hibernation code to avoid duplication as far as reasonably possible. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2008-06-12 14:25:09 -07:00
Linus Torvalds	1b3cba8e60	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: 64-bit: fix arithmetics overflow sched: fair group: fix overflow(was: fix divide by zero) sched: fix TASK_WAKEKILL vs SIGKILL race	2008-06-12 12:55:18 -07:00
Lai Jiangshan	7a232e0350	sched: 64-bit: fix arithmetics overflow (overflow means weight >= 2^32 here, because inv_weigh = 2^32/weight) A weight of a cfs_rq is the sum of weights of which entities are queued on this cfs_rq, so it will overflow when there are too many entities. Although, overflow occurs very rarely, but it break fairness when it occurs. 64-bits systems have more memory than 32-bit systems and 64-bit systems can create more process usually, so overflow may occur more frequently. This patch guarantees fairness when overflow happens on 64-bit systems. Thanks to the optimization of compiler, it changes nothing on 32-bit. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-12 14:29:54 +02:00
Lai Jiangshan	2e084786f6	sched: fair group: fix overflow(was: fix divide by zero) I found a bug which can be reproduced by this way:(linux-2.6.26-rc5, x86-64) (use 2^32, 2^33, ...., 2^63 as shares value) # mkdir /dev/cpuctl # mount -t cgroup -o cpu cpuctl /dev/cpuctl # cd /dev/cpuctl # mkdir sub # echo 0x8000000000000000 > sub/cpu.shares # echo $$ > sub/tasks oops here! divide by zero. This is because do_div() expects the 2th parameter to be 32 bits, but unsigned long is 64 bits in x86_64. Peter Zijstra pointed it out that the sane thing to do is limit the shares value to something smaller instead of using an even more expensive divide. Also, I found another bug about "the shares value is too large": pid1 and pid2 are set affinity to cpu#0 pid1 is attached to cg1 and pid2 is attached to cg2 if cg1/cpu.shares = 1024 cg2/cpu.shares = 2000000000 then pid2 got 100% usage of cpu, and pid1 0% if cg1/cpu.shares = 1024 cg2/cpu.shares = 20000000000 then pid2 got 0% usage of cpu, and pid1 100% And a weight of a cfs_rq is the sum of weights of which entities are queued on this cfs_rq, so the shares value should be limited to a smaller value. I think that (1UL << 18) is a good limited value: 1) it's not too large, we can create a lot of group before overflow 2) it's several times the weight value for nice=-19 (not too small) Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-12 14:23:55 +02:00
Jiri Slaby	20764ff1ef	ftrace: fix printout Do not print loglevel before "entries of %ld bytes". Move it to the previous pr_info. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-12 11:51:03 +02:00
Rafael J. Wysocki	1eede070a5	Introduce new top level suspend and hibernation callbacks Introduce 'struct pm_ops' and 'struct pm_ext_ops' ('ext' meaning 'extended') representing suspend and hibernation operations for bus types, device classes, device types and device drivers. Modify the PM core to use 'struct pm_ops' and 'struct pm_ext_ops' objects, if defined, instead of the ->suspend(), ->resume(), ->suspend_late(), and ->resume_early() callbacks (the old callbacks will be considered as legacy and gradually phased out). The main purpose of doing this is to separate suspend (aka S2RAM and standby) callbacks from hibernation callbacks in such a way that the new callbacks won't take arguments and the semantics of each of them will be clearly specified. This has been requested for multiple times by many people, including Linus himself, and the reason is that within the current scheme if ->resume() is called, for example, it's difficult to say why it's been called (ie. is it a resume from RAM or from hibernation or a suspend/hibernation failure etc.?). The second purpose is to make the suspend/hibernation callbacks more flexible so that device drivers can handle more than they can within the current scheme. For example, some drivers may need to prevent new children of the device from being registered before their ->suspend() callbacks are executed or they may want to carry out some operations requiring the availability of some other devices, not directly bound via the parent-child relationship, in order to prepare for the execution of ->suspend(), etc. Ultimately, we'd like to stop using the freezing of tasks for suspend and therefore the drivers' suspend/hibernation code will have to take care of the handling of the user space during suspend/hibernation. That, in turn, would be difficult within the current scheme, without the new ->prepare() and ->complete() callbacks. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2008-06-10 10:59:50 -07:00
Ankita Garg	2b1bce1787	ftrace: disable tracing when current_tracer is set to "none" Found that inspite of setting the current_tracer to "none", trace from the previous trace type continued to be collected. The patch below fixes this and causes the trace to be disabled when the "none" type is selected. Compile and boot tested the patch for functionality. Signed-off-by: Ankita Garg <ankita@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 14:52:30 +02:00
Ingo Molnar	040ec23d07	sched: sched_clock() lockdep fix Sitsofe Wheeler bisected the following commit to cause a lockdep to warn about itself and turn itself off: > commit `c6531cce6e` > Author: Ingo Molnar <mingo@elte.hu> > Date: Mon May 12 21:21:14 2008 +0200 > > sched: do not trace sched_clock do not use raw irq flags in cpu_clock() as it causes lockdep to lose track of the true state of the IRQ flag. Reported-and-bisected-by: Sitsofe Wheeler <sitsofe@yahoo.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-06-10 14:52:14 +02:00
Paul Mundt	e9886ca3a9	sched: kill off dead cfs_rq_set_shares() Building with CONFIG_FAIR_GROUP_SCHED=y on UP results in an unused cfs_rq_set_shares() reference. As nothing is using this dummy function in the first place, just kill it off. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 14:51:04 +02:00
Mike Galbraith	6492c7f83e	sched: trivial sched_features cleanup Remove unused debug/tuning features. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 12:38:17 +02:00
David Rientjes	9985b0bab3	sched: prevent bound kthreads from changing cpus_allowed Kthreads that have called kthread_bind() are bound to specific cpus, so other tasks should not be able to change their cpus_allowed from under them. Otherwise, it is possible to move kthreads, such as the migration or software watchdog threads, so they are not allowed access to the cpu they work on. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 12:26:16 +02:00
Peter Zijlstra	7def2be1dc	sched: fix hotplug cpus on ia64 Cliff Wickman wrote: > I built an ia64 kernel from Andrew's tree (2.6.26-rc2-mm1) > and get a very predictable hotplug cpu problem. > billberry1:/tmp/cpw # ./dis > disabled cpu 17 > enabled cpu 17 > billberry1:/tmp/cpw # ./dis > disabled cpu 17 > enabled cpu 17 > billberry1:/tmp/cpw # ./dis > > The script that disables the cpu always hangs (unkillable) > on the 3rd attempt. > > And a bit further: > The kstopmachine thread always sits on the run queue (real time) for about > 30 minutes before running. this fix solves some (but not all) issues between CPU hotplug and RT bandwidth throttling. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 12:17:28 +02:00
Abhishek Sagar	34078a5e44	ftrace: prevent freeing of all failed updates Steven Rostedt wrote: > If we unload a module and reload it, will it ever get converted again? The intent was always to filter core kernel functions to prevent their freeing. Here's a fix which should allow re-recording of module call-sites. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 11:59:05 +02:00
Abhishek Sagar	eb9a7bf091	ftrace: add debugfs entry 'failures' Identify functions which had their mcount call-site updates failed. This can help us track functions which ftrace shouldn't fiddle with, and are thus not being traced. If there is no race with any external agent which is modifying the mcount call-site, then this file displays no entries (normal case). Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 11:58:17 +02:00
Abhishek Sagar	1d74f2a0f6	ftrace: remove ftrace_ip_converted() Remove the unneeded function ftrace_ip_converted(). Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 11:57:49 +02:00
Abhishek Sagar	0eb967012e	ftrace: prevent freeing of all failed updates Prevent freeing of records which cause problems and correspond to function from core kernel text. A new flag, FTRACE_FL_CONVERTED is used to mark a record as "converted". All other records are patched lazily to NOPs. Failed records now also remain on frace_hash table. Each invocation of ftrace_record_ip now checks whether the traced function has ever been recorded (including past failures) and doesn't re-record it again. Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 11:56:57 +02:00
Oleg Nesterov	6ad36762d7	__mutex_lock_common: use signal_pending_state() Change __mutex_lock_common() to use signal_pending_state() for the sake of the code re-use. This adds 7 bytes to kernel/mutex.o, but afaics only because gcc isn't smart enough. (btw, uninlining of __mutex_lock_common() shrinks .text from 2722 to 1542, perhaps it is worth doing). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 11:45:09 +02:00
Oleg Nesterov	16882c1e96	sched: fix TASK_WAKEKILL vs SIGKILL race schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case, this allows us to do current->state = TASK_INTERRUPTIBLE; schedule(); without fear to sleep with pending signal. However, the code like current->state = TASK_KILLABLE; schedule(); is not right, schedule() doesn't take TASK_WAKEKILL into account. This means that mutex_lock_killable(), wait_for_completion_killable(), down_killable(), schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has no effect). Introduce the new helper, signal_pending_state(), and change schedule() to use it. Hopefully it will have more users, that is why the task's state is passed separately. Note this "__TASK_STOPPED \| __TASK_TRACED" check in signal_pending_state(). This is needed to preserve the current behaviour (ptrace_notify). I hope this check will be removed soon, but this (afaics good) change needs the separate discussion. The fast path is "(state & (INTERRUPTIBLE \| WAKEKILL)) + signal_pending(p)", basically the same that schedule() does now. However, this patch of course bloats schedule(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-10 11:37:25 +02:00
Linus Torvalds	156a9ea43a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/chrisw/lsm-2.6: capabilities: remain source compatible with 32-bit raw legacy capability support. LSM: remove stale web site from MAINTAINERS	2008-06-06 11:31:55 -07:00
Lai Jiangshan	37340746a6	cpusets: fix bug when adding nonexistent cpu or mem Adding a nonexistent cpu to a cpuset will be omitted quietly. It should return -EINVAL. Example: (real_nr_cpus <= 4 < NR_CPUS or cpu#4 was just offline) # cat cpus 0-1 # /bin/echo 4 > cpus # /bin/echo $? 0 # cat cpus # The same occurs when add a nonexistent mem. This patch will fix this bug. And when *buf == "", the check is unneeded. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Jackson <pj@sgi.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-06-06 11:29:11 -07:00
Thomas Gleixner	e958b36004	sched: move weighted_cpuload into #ifdef CONFIG_SMP section weighted_cpuload is only used on SMP. move it into the CONFIG_SMP section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:25:02 +02:00
Max Krasnyansky	68f4f1ec08	sched: Move cpu masks from kernel/sched.c into kernel/cpu.c kernel/cpu.c seems a more logical place for those maps since they do not really have much to do with the scheduler these days. kernel/cpu.c is now built for the UP kernel too, but it does not affect the size the kernel sections. $ size vmlinux before text data bss dec hex filename 3313797 307060 310352 3931209 3bfc49 vmlinux after text data bss dec hex filename 3313797 307060 310352 3931209 3bfc49 vmlinux Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Cc: pj@sgi.com Cc: menage@google.com Cc: rostedt@goodmis.org Cc: mingo@elte.hu Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:25:01 +02:00
Max Krasnyansky	5c8e1ed1d2	sched: CPU hotplug events must not destroy scheduler domains created by the cpusets First issue is not related to the cpusets. We're simply leaking doms_cur. It's allocated in arch_init_sched_domains() which is called for every hotplug event. So we just keep reallocation doms_cur without freeing it. I introduced free_sched_domains() function that cleans things up. Second issue is that sched domains created by the cpusets are completely destroyed by the CPU hotplug events. For all CPU hotplug events scheduler attaches all CPUs to the NULL domain and then puts them all into the single domain thereby destroying domains created by the cpusets (partition_sched_domains). The solution is simple, when cpusets are enabled scheduler should not create default domain and instead let cpusets do that. Which is exactly what the patch does. Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Cc: pj@sgi.com Cc: menage@google.com Cc: rostedt@goodmis.org Cc: mingo@elte.hu Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:25:00 +02:00
Ingo Molnar	1100ac91b6	sched: fix cpuprio build bug this patch was not built on !SMP: kernel/sched_rt.c: In function 'inc_rt_tasks': kernel/sched_rt.c:404: error: 'struct rq' has no member named 'online' Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-06 15:19:46 +02:00
Thomas Gleixner	e539d8fcd1	sched: fix the cpuprio count really Peter pointed out that the last version of the "fix" was still one off under certain circumstances. Use BITS_TO_LONG instead to get an accurate result. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:44 +02:00
Gregory Haskins	709d4b0c60	sched: fix cpupri priocount A rounding error was pointed out by Peter Zijlstra which would result in the structure holding priorities to be off by one. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:43 +02:00
Gregory Haskins	1f11eb6a8b	sched: fix cpupri hotplug support The RT folks over at RedHat found an issue w.r.t. hotplug support which was traced to problems with the cpupri infrastructure in the scheduler: https://bugzilla.redhat.com/show_bug.cgi?id=449676 This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel. This patch applies to 25.4-rt4, though it should trivially apply to most cpupri enabled kernels mentioned above. It turned out that the issue was that offline cpus could get inadvertently registered with cpupri so that they were erroneously selected during migration decisions. The end result would be an OOPS as the offline cpu had tasks routed to it. This patch generalizes the old join/leave domain interface into an online/offline interface, and adjusts the root-domain/hotplug code to utilize it. I was able to easily reproduce the issue prior to this patch, and am no longer able to reproduce it after this patch. I can offline cpus indefinately and everything seems to be in working order. Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point me in the right direction. Also thank you to Peter for reviewing the early iterations of this patch. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:42 +02:00
Gautham R Shenoy	099f98c8a1	sched: print the sd->level in sched_domain_debug code While printing out the visual representation of the sched-domains, print the level (MC, SMT, CPU, NODE, ... ) of each of the sched_domains. Credit: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-06 15:19:41 +02:00
Dhaval Giani	6d6bc0ad86	sched: add comments for ifdefs in sched.c make sched.c easier to read. Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:38 +02:00
Arjan van de Ven	e21f5b153b	sched: print module list in the "scheduling while atomic" warning For the normal WARN_ON() etc we added a print-the-modules-list already, which is very useful to figure out candidates for certain types of bugs. This patch adds the same print to the "scheduling while atomic" BUG warning, for the same reason: when we get here it's very useful to see which modules are loaded, to narrow down the candidate code list. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: mingo@elte.hu Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:37 +02:00
Rabin Vincent	81d41d7ece	sched: fix defined-but-unused warning Fix this warning, which appears with !CONFIG_SMP: kernel/sched.c:1216: warning: `init_hrtick' defined but not used Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:34 +02:00
Thomas Gleixner	f7dcd80bbc	namespacecheck: fixes in kernel/sched.c Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:32 +02:00
Dmitry Adamushko	d07355f5de	sched: check for SD_SERIALIZE atomically in rebalance_domains() Nothing really serious here, mainly just a matter of nit-picking :-/ From: Dmitry Adamushko <dmitry.adamushko@gmail.com> For CONFIG_SCHED_DEBUG && CONFIG_SYSCT configs, sd->flags can be altered while being manipulated in rebalance_domains(). Let's do an atomic check. We rely here on the atomicity of read/write accesses for aligned words. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:30 +02:00
Gregory Haskins	6d299f1b53	sched: fix SCHED_OTHER balance iterator to include all tasks The currently logic inadvertently skips the last task on the run-queue, resulting in missed balance opportunities. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: David Bahi <dbahi@novell.com> CC: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:29 +02:00
Gregory Haskins	6e0534f278	sched: use a 2-d bitmap for searching lowest-pri CPU The current code use a linear algorithm which causes scaling issues on larger SMP machines. This patch replaces that algorithm with a 2-dimensional bitmap to reduce latencies in the wake-up path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:28 +02:00
Mike Galbraith	f333fdc909	sched: make !hrtick faster it is safe to ignore timers and flags when the feature is disabled. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:27 +02:00
Gregory Haskins	45c01e8249	sched: prioritize non-migratable tasks over migratable ones Dmitry Adamushko pointed out a known flaw in the rt-balancing algorithm that could allow suboptimal balancing if a non-migratable task gets queued behind a running migratable one. It is discussed in this thread: http://lkml.org/lkml/2008/4/22/296 This issue has been further exacerbated by a recent checkin to sched-devel (git-id 5eee63a5ebc19a870ac40055c0be49457f3a89a3). >From a pure priority standpoint, the run-queue is doing the "right" thing. Using Dmitry's nomenclature, if T0 is on cpu1 first, and T1 wakes up at equal or lower priority (affined only to cpu1) later, it should wait for T0 to finish. However, in reality that is likely suboptimal from a system perspective if there are other cores that could allow T0 and T1 to run concurrently. Since T1 can not migrate, the only choice for higher concurrency is to try to move T0. This is not something we addessed in the recent rt-balancing re-work. This patch tries to enhance the balancing algorithm by accomodating this scenario. It accomplishes this by incorporating the migratability of a task into its priority calculation. Within a numerical tsk->prio, a non-migratable task is logically higher than a migratable one. We maintain this by introducing a new per-priority queue (xqueue, or exclusive-queue) for holding non-migratable tasks. The scheduler will draw from the xqueue over the standard shared-queue (squeue) when available. There are several details for utilizing this properly. 1) During task-wake-up, we not only need to check if the priority preempts the current task, but we also need to check for this non-migratable condition. Therefore, if a non-migratable task wakes up and sees an equal priority migratable task already running, it will attempt to preempt it if there is a likelyhood that the current task will find an immediate home. 2) Tasks only get this non-migratable "priority boost" on wake-up. Any requeuing will result in the non-migratable task being queued to the end of the shared queue. This is an attempt to prevent the system from being completely unfair to migratable tasks during things like SCHED_RR timeslicing. I am sure this patch introduces potentially "odd" behavior if you concoct a scenario where a bunch of non-migratable threads could starve migratable ones given the right pattern. I am not yet convinced that this is a problem since we are talking about tasks of equal RT priority anyway, and there never is much in the way of guarantees against starvation under that scenario anyway. (e.g. you could come up with a similar scenario with a specific timing environment verses an affinity environment). I can be convinced otherwise, but for now I think this is "ok". Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-06 15:19:25 +02:00
Max Krasnyansky	1840475676	genirq: Expose default irq affinity mask (take 3) Current IRQ affinity interface does not provide a way to set affinity for the IRQs that will be allocated/activated in the future. This patch creates /proc/irq/default_smp_affinity that lets users set default affinity mask for the newly allocated IRQs. Changing the default does not affect affinity masks for the currently active IRQs, they have to be changed explicitly. Updated based on Paul J's comments and added some more documentation. Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Cc: pj@sgi.com Cc: a.p.zijlstra@chello.nl Cc: tglx@linutronix.de Cc: rdunlap@xenotime.net Cc: mingo@elte.hu Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-06-05 15:18:30 +02:00
Linus Torvalds	3b5b60b821	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdbts: Use HW breakpoints with CONFIG_DEBUG_RODATA kgdb: use common ascii helpers and put_unaligned_be32 helper	2008-06-04 08:08:27 -07:00
Steven Rostedt	ad90c0e3ce	ftrace: user update and disable dynamic ftrace daemon In dynamic ftrace, the mcount function starts off pointing to a stub function that just returns. On start up, the call to the stub is modified to point to a "record_ip" function. The job of the record_ip function is to add the function to a pre-allocated hash list. If the function is already there, it simply is ignored, otherwise it is added to the list. Later, a ftraced daemon wakes up and calls kstop_machine if any functions have been recorded, and changes the calls to the recorded functions to a simple nop. If no functions were recorded, the daemon goes back to sleep. The daemon wakes up once a second to see if it needs to update any newly recorded functions into nops. Usually it does not, but if a lot of code has been executed for the first time in the kernel, the ftraced daemon will call kstop_machine to update those into nops. The problem currently is that there's no way to stop the daemon from doing this, and it can cause unneeded latencies (800us which for some is bothersome). This patch adds a new file /debugfs/tracing/ftraced_enabled. If the daemon is active, reading this will return "enabled\n" and "disabled\n" when the daemon is not running. To disable the daemon, the user can echo "0" or "disable" into this file, and "1" or "enable" to re-enable the daemon. Since the daemon is used to convert the functions into nops to increase the performance of the system, I also added that anytime something is written into the ftraced_enabled file, kstop_machine will run if there are new functions that have been detected that need to be converted. This way the user can disable the daemon but still be able to control the conversion of the mcount calls to nops by simply, "echo 0 > /debugfs/tracing/ftraced_enabled" when they need to do more conversions. To see the number of converted functions: "cat /debugfs/tracing/dyn_ftrace_total_info" Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-02 12:50:04 +02:00
Abhishek Sagar	76094a2cf4	ftrace: distinguish kretprobe'd functions in trace logs Tracing functions via ftrace which have a kretprobe installed on them, can produce misleading output in their trace logs. E.g, consider the correct trace of the following sequence: do_IRQ() { ~ irq_enter(); ~ } Trace log (sample): <idle>-0 [00] 4154504455.781616: irq_enter <- do_IRQ But if irq_enter() has a kretprobe installed on it, the return value stored on the stack at each invocation is modified to divert the return to a kprobe trampoline function called kretprobe_trampoline(). So with this the trace would (currently) look like: <idle>-0 [00] 4154504455.781616: irq_enter <- kretprobe_trampoline Now this is quite misleading to the end user, as it suggests something that didn't actually happen. So just to avoid such misinterpretations, the inlined patch aims to output such a log as: <idle>-0 [00] 4154504455.781616: irq_enter <- [unknown/kretprobe'd] Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-02 12:41:19 +02:00
Jason Wessel	8c2238eaaf	softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression The touch_nmi_watchdog() routine on x86 ultimately calls touch_softlockup_watchdog(). The problem is that to touch the softlockup watchdog, the cpu_clock code has to be called which could involve multiple cpu locks and can lead to a hard hang if one of the locks is held by a processor that is not going to return anytime soon (such as could be the case with kgdb or perhaps even with some other kind of exception). This patch causes the public version of the touch_softlockup_watchdog() to defer the cpu clock access to a later point. The test case for this problem is to use the following kernel config options: CONFIG_KGDB_TESTS=y CONFIG_KGDB_TESTS_ON_BOOT=y CONFIG_KGDB_TESTS_BOOT_STRING="V1F100I100000" It should be noted that kgdb test suite and these options were not available until 2.6.26-rc2, so it was necessary to patch the kgdb test suite during the bisection. I would consider this patch a regression fix because the problem first appeared in commit `27ec440779` when some logic was added to try to periodically sync the clocks. It was possible to work around this particular problem by simply not performing the sync anytime the system was in a critical context. This was ok until commit `3e51f33fcc`, which added config option CONFIG_HAVE_UNSTABLE_SCHED_CLOCK and some multi-cpu locks to sync the clocks. It became clear that accessing this code from an nmi was the source of the lockups. Avoiding the access to the low level clock code from an code inside the NMI processing also fixed the problem with the 27ec44... commit. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-06-02 12:38:27 +02:00
Andrew G. Morgan	ca05a99a54	capabilities: remain source compatible with 32-bit raw legacy capability support. Source code out there hard-codes a notion of what the _LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the raw capability system calls capget() and capset(). Its unfortunate, but true. Since the confusing header file has been in a released kernel, there is software that is erroneously using 64-bit capabilities with the semantics of 32-bit compatibilities. These recently compiled programs may suffer corruption of their memory when sys_getcap() overwrites more memory than they are coded to expect, and the raising of added capabilities when using sys_capset(). As such, this patch does a number of things to clean up the situation for all. It 1. forces the _LINUX_CAPABILITY_VERSION define to always retain its legacy value. 2. adopts a new #define strategy for the kernel's internal implementation of the preferred magic. 3. deprecates v2 capability magic in favor of a new (v3) magic number. The functionality of v3 is entirely equivalent to v2, the only difference being that the v2 magic causes the kernel to log a "deprecated" warning so the admin can find applications that may be using v2 inappropriately. [User space code continues to be encouraged to use the libcap API which protects the application from details like this. libcap-2.10 is the first to support v3 capabilities.] Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518. Thanks to Bojan Smojver for the report. [akpm@linux-foundation.org: s/depreciate/deprecate/g] [akpm@linux-foundation.org: be robust about put_user size] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Bojan Smojver <bojan@rexursive.com> Cc: stable@kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>	2008-05-31 16:36:16 -07:00
Ingo Molnar	7a14ce1d8c	nohz: reduce jiffies polling overhead Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-30 14:16:10 +02:00
Ingo Molnar	02ff375590	softlockup: fix false positives on nohz if CPU is 100% idle for more than 60 seconds Fix (probably theoretical only) rq->clock update bug: in tick_nohz_update_jiffies() [which is called on all irq entry on all cpus where the irq entry hits an idle cpu] we call touch_softlockup_watchdog() before we update jiffies. That works fine most of the time when idle timeouts are within 60 seconds. But when an idle timeout is beyond 60 seconds, jiffies is updated with a jump of more than 60 seconds, which causes a jump in cpu-clock of more than 60 seconds, triggering a false positive. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-30 14:15:02 +02:00
Linus Torvalds	a7f75d3bed	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: re-tune NUMA topologies sched: stop wake_affine from causing serious imbalance sched: fix sched_clock_cpu() revert ("sched: fair-group: SMP-nice for group scheduling") sched: cleanup show_schedstat(): fix memleak sched: unite unlikely pairs in rt_policy() and schedule_debug() revert ("sched: fair: weight calculations")	2008-05-29 09:26:17 -07:00
Ingo Molnar	6715930654	Merge commit 'linus/master' into sched-fixes-for-linus	2008-05-29 16:05:05 +02:00
Mike Galbraith	b3137bc8e7	sched: stop wake_affine from causing serious imbalance Prevent short-running wakers of short-running threads from overloading a single cpu via wakeup affinity, and wire up disconnected debug option. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-29 11:29:20 +02:00
Peter Zijlstra	a381759d6a	sched: fix sched_clock_cpu() Make sched_clock_cpu() return 0 before it has been initialized and avoid corrupting its state due to doing so. This fixes the weird printk timestamp jump reported. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2008-05-29 11:29:19 +02:00
Ingo Molnar	6363ca57c7	revert ("sched: fair-group: SMP-nice for group scheduling") Yanmin Zhang reported: Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1. It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. With bisect, I located the following patch: \| `18d95a2832` is first bad commit \| commit `18d95a2832` \| Author: Peter Zijlstra <a.p.zijlstra@chello.nl> \| Date: Sat Apr 19 19:45:00 2008 +0200 \| \| sched: fair-group: SMP-nice for group scheduling Revert it so that we get v2.6.25 behavior. Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-29 11:28:57 +02:00
Ingo Molnar	4285f594f8	sched: cleanup Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-29 11:25:15 +02:00
Adrian Bunk	c6fba5451a	show_schedstat(): fix memleak The Coverity checker spotted a memleak introduced by commit `39106dcf85` (cpumask: use new cpus_scnprintf function). It seems the kfree() got lost between v2 and v3 of this patch... Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Mike Travis <travis@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-29 11:25:15 +02:00
Roel Kluin	3f33a7ce95	sched: unite unlikely pairs in rt_policy() and schedule_debug() Removes obfuscation and may improve assembly. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-29 11:25:14 +02:00
Ingo Molnar	f9305d4a09	revert ("sched: fair: weight calculations") Yanmin Zhang reported: Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has many regressions with 2.6.26-rc1: 1) 8-core stoakley: 28%; 2) 16-core tigerton: 20%; 3) Itanium Montvale: 50%. Bisect located this patch: \| `8f1bc385cf` is first bad commit \| commit `8f1bc385cf` \| Author: Peter Zijlstra <a.p.zijlstra@chello.nl> \| Date: Sat Apr 19 19:45:00 2008 +0200 \| \| sched: fair: weight calculations Revert it to the 2.6.25 state. Bisected-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-29 11:24:01 +02:00
Harvey Harrison	827e609b45	kgdb: use common ascii helpers and put_unaligned_be32 helper Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2008-05-28 12:49:56 -05:00
Tom Zanussi	a82c53a0e3	splice: fix sendfile() issue with relay Splice isn't always incrementing the ppos correctly, which broke relay splice. Signed-off-by: Tom Zanussi <zanussi@comcast.net> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-05-28 14:49:27 +02:00
Markus Armbruster	9e124fe16f	xen: Enable console tty by default in domU if it's not a dummy Without console= arguments on the kernel command line, the first console to register becomes enabled and the preferred console (the one behind /dev/console). This is normally tty (assuming CONFIG_VT_CONSOLE is enabled, which it commonly is). This is okay as long tty is a useful console. But unless we have the PV framebuffer, and it is enabled for this domain, tty0 in domU is merely a dummy. In that case, we want the preferred console to be the Xen console hvc0, and we want it without having to fiddle with the kernel command line. Commit `b8c2d3dfbc` did that for us. Since we now have the PV framebuffer, we want to enable and prefer tty again, but only when PVFB is enabled. But even then we still want to enable the Xen console as well. Problem: when tty registers, we can't yet know whether the PVFB is enabled. By the time we can know (xenstore is up), the console setup game is over. Solution: enable console tty by default, but keep hvc as the preferred console. Change the preferred console to tty when PVFB probes successfully, unless we've been given console kernel parameters. Signed-off-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-27 10:11:36 +02:00
Carlos R. Mafra	900cfa4619	hrtimer: Remove unused variables in ktime_divns() The variables dns and inc are not used, remove them. Signed-off-by: Carlos R. Mafra <crmafra@gmail.com> Cc: tglx@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-26 23:26:55 +02:00
Jeremy Fitzhardinge	d031476408	hrtimer: remove warning in hres_timers_resume hres_timers_resume() warns if there appears to be more than one cpu online. This warning makes sense when the suspend/resume mechanism offlines all cpus but one during the suspend/resume process. However, Xen suspend does not need to offline the other cpus; it merely keeps them tied up in stop_machine() while the virtual machine is suspended. The warning hres_timers_resume issues is therefore spurious. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-26 23:26:32 +02:00
Abhishek Sagar	492a7ea5bc	ftrace: fix updating of ftrace_update_cnt Hi Ingo/Steven, Ftrace currently maintains an update count which includes false updates, i.e, updates which failed. If anything, such failures should be tracked by some separate variable, but this patch provides a minimal fix. Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Cc: rostedt@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-26 22:52:10 +02:00
Abhishek Sagar	ffdaa3582b	ftrace: safe traversal of ftrace_hash hlist Hi Steven, I noticed that concurrent instances of ftrace_record_ip() have a race between ftrace_hash list traversal during ftrace_ip_in_hash() (before acquiring ftrace_shutdown_lock) and ftrace_add_hash(). If it's so then this should fix it. Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Cc: rostedt@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-26 22:52:04 +02:00
Steven Rostedt	41bc8144d0	ftrace: fix up cmdline recording The new work with converting the trace hooks over to markers broke the command line recording of ftrace. This patch fixes it again. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-26 22:51:49 +02:00
Steven Rostedt	41c52c0db9	ftrace: set_ftrace_notrace feature While debugging latencies in the RT kernel, I found that it would be nice to be able to filter away functions from the trace than just to filter on functions. I added a new interface to the debugfs tracing directory called set_ftrace_notrace When dynamic frace is enabled, this lets you filter away functions that will not be recorded in the trace. It is similar to adding 'notrace' to those functions but by doing it without recompiling the kernel. Here's how set_ftrace_filter and set_ftrace_notrace interact. Remember, if set_ftrace_filter is set, it removes all functions from the trace execpt for those listed in the set_ftrace_filter. set_ftrace_notrace will prevent those functions from being traced. If you were to set one function in both set_ftrace_filter and set_ftrace_notrace and that function was the same, then you would end up with an empty trace. the set of functions to trace is: set_ftrace_filter == empty then all functions not in set_ftrace_notrace else set of the set_ftrace_filter and not in set of set_ftrace_notrace. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-26 22:51:37 +02:00
Steven Rostedt	da89a7a253	ftrace: remove printks from irqsoff trace Printing out new max latencies was fine for the old RT tracer. But for mainline it is a bit messy. We also need to test if the run queue is locked before we can do the print. This means that we may not be printing out latencies if the run queue is locked on another CPU. This produces inconsistencies in the output. This patch simply removes the print altogether. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: pq@iki.fi Cc: proski@gnu.org Cc: sandmann@redhat.com Cc: a.p.zijlstra@chello.nl Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-26 22:51:27 +02:00
Steven Rostedt	7e18d8e701	ftrace: add function tracing to wake up tracing This patch adds function tracing to the functions that are called on the CPU of the task being traced. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: pq@iki.fi Cc: proski@gnu.org Cc: sandmann@redhat.com Cc: a.p.zijlstra@chello.nl Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-26 22:51:22 +02:00
Steven Rostedt	4902f8849d	ftrace: move ftrace_special to trace.c Move the ftrace_special out of sched_switch to trace.c. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: pq@iki.fi Cc: proski@gnu.org Cc: sandmann@redhat.com Cc: a.p.zijlstra@chello.nl Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-26 22:51:09 +02:00
Steven Rostedt	19384c0314	ftrace: limit use of check pages The check_pages function is called often enough that it can cause problems with trace outputs or even bringing the system to a halt. This patch limits the check_pages to the places that are most likely to have problems. The check is made at the flip between the global array and the max save array, as well as when the size of the buffers changes and the self tests. This patch also removes the BUG_ON from check_pages and replaces it with a WARN_ON and disabling of the tracer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: pq@iki.fi Cc: proski@gnu.org Cc: sandmann@redhat.com Cc: a.p.zijlstra@chello.nl Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-26 22:39:45 +02:00
Oleg Nesterov	cbaffba12c	posix timers: discard SI_TIMER signals on exec Based on Roland's patch. This approach was suggested by Austin Clements from the very beginning, and then by Linus. As Austin pointed out, the execing task can be killed by SI_TIMER signal because exec flushes the signal handlers, but doesn't discard the pending signals generated by posix timers. Perhaps not a bug, but people find this surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-26 10:37:07 -07:00
Oleg Nesterov	c8e85b4f4b	posix timers: sigqueue_free: don't free sigqueue if it is queued Currently sigqueue_free() removes sigqueue from list, but doesn't cancel the pending signal. This is not consistent, the task should either receive the "full" signal along with siginfo_t, or it shouldn't receive the signal at all. Change sigqueue_free() to clear SIGQUEUE_PREALLOC but leave sigqueue on list if it is queued. This is a user-visible change. If the signal is blocked, it stays queued after sys_timer_delete() until unblocked with the "stale" si_code/si_value, and of course it is still counted wrt RLIMIT_SIGPENDING which also limits the number of posix timers. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Austin Clements <amdragon+kernelbugzilla@mit.edu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-26 10:37:06 -07:00
Ingo Molnar	c6531cce6e	sched: do not trace sched_clock The tracer uses sched_clock, so do not trace it. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-25 16:02:23 +02:00
Carlos R. Mafra	962cf36c5b	Remove argument from open_softirq which is always NULL As git-grep shows, open_softirq() is always called with the last argument being NULL block/blk-core.c: open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL); kernel/hrtimer.c: open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL); kernel/rcuclassic.c: open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL); kernel/rcupreempt.c: open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL); kernel/sched.c: open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL); kernel/softirq.c: open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL); kernel/softirq.c: open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL); kernel/timer.c: open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL); net/core/dev.c: open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL); net/core/dev.c: open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL); This observation has already been made by Matthew Wilcox in June 2002 (http://www.cs.helsinki.fi/linux/linux-kernel/2002-25/0687.html) "I notice that none of the current softirq routines use the data element passed to them." and the situation hasn't changed since them. So it appears we can safely remove that extra argument to save 128 (54) bytes of kernel data (text). Signed-off-by: Carlos R. Mafra <crmafra@ift.unesp.br> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-25 07:43:15 +02:00
Ingo Molnar	1c4cd6dd1d	softlockup: fix softlockup_thresh fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-25 06:35:19 +02:00
Dimitri Sivanich	9383d96790	softlockup: fix softlockup_thresh unaligned access and disable detection at runtime Fix unaligned access errors when setting softlockup_thresh on 64 bit platforms. Allow softlockup detection to be disabled by setting softlockup_thresh <= 0. Detect that boot time softlockup detection has been disabled earlier in softlockup_tick. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-25 06:35:03 +02:00
Ingo Molnar	9c44bc03ff	softlockup: allow panic on lockup allow users to configure the softlockup detector to generate a panic instead of a warning message. high-availability systems might opt for this strict method (combined with panic_timeout= boot option/sysctl), instead of generating softlockup warnings ad infinitum. also, automated tests work better if the system reboots reliably (into a safe kernel) in case of a lockup. The full spectrum of configurability is supported: boot option, sysctl option and Kconfig option. it's default-disabled. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-25 06:34:44 +02:00
Nick Andrew	0915930805	printk: remember the message level for multi-line output printk(KERN_ALERT "Danger Will Robinson!\nAlien Approaching!\n"); At present this will result in one message at ALERT level and one at the current default message loglevel (e.g. WARNING). This is non-intuitive. Modify vprintk() to remember the message loglevel each time it is specified and use it for subsequent lines of output which do not specify one, within the same call to printk. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 23:15:19 +02:00
Nick Andrew	ac60ad7413	printk: refactor processing of line severity tokens Restructure the logic of vprintk() so the processing of the leading 3 characters of each input line is in one place, regardless whether printk_time is enabled. This makes the code smaller and easier to understand. size reduction in kernel/printk.o: text data bss dec hex filename 6157 397 1049804 1056358 101e66 printk.o.before 6117 397 1049804 1056318 101e3e printk.o.after and some style uncleanlinesses removed as well as a side-effect: Before: total: 19 errors, 22 warnings, 1340 lines checked After: total: 17 errors, 22 warnings, 1333 lines checked Signed-off-by: Nick Andrew <nick@nick-andrew.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 23:15:12 +02:00
Jan Kiszka	cd3a1b8562	printk: don't prefer unsuited consoles on registration console election: If some console happens to be registered first which does not provide a tty binding (!console->device), it prevents that more suited consoles which are registered later on can enter the candidate pool for console_device(). This is observable with KGDB's console which may already be registered (and exploited!) during early debugger connections, that is before any regular console registration. This patch fixes the issue by postponing the final, automated preferred_console selection until someone with a non-NULL device handler comes around. Signed-off-by: Jan Kiszka <jan.kiszka@web.de> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Gerd Hoffmann <kraxel@suse.de> Cc: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 23:15:07 +02:00
Tejun Heo	3b8945e8d4	printk: clean up recursion check related static variables Make printk_recursion_bug_msg static and drop printk prefix from recursion variables. Signed-off-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 23:15:02 +02:00
Thomas Gleixner	42fdfa238a	namespacecheck: more kernel/printk.c fixes [ Stephen Rothwell <sfr@canb.auug.org.au>: build fix ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 23:14:51 +02:00
Thomas Gleixner	7f6f3a39d2	namespacecheck: fix kernel printk.c Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 23:12:18 +02:00
Cedric Le Goater	5c02b57578	cgroups: remove node_ prefix_from ns subsystem This is a slight change in the namespace cgroup subsystem api. The change is that previously when cgroup_clone() was called (currently only from the unshare path in ns_proxy cgroup, you'd get a new group named "node_$pid" whereas now you'll get a group named after just your pid.) The only users who would notice it are those who are using the ns_proxy cgroup subsystem to auto-create cgroups when namespaces are unshared - something of an experimental feature, which I think really needs more complete container/namespace support in order to be useful. I suspect the only users are Cedric and Serge, or maybe a few others on containers@lists.linux-foundation.org. And in fact it would only be noticed by the users who make the assumption about how the name is generated, rather than getting it from the /proc/<pid>/cgroups file for the process in question. Whether the change is actually needed or not I'm fairly agnostic on, but I guess it is more elegant to just use the pid as the new group name rather than adding a fairly arbitrary "node_" prefix on the front. [menage@google.com: provided changelog] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: "Paul Menage" <menage@google.com> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:14 -07:00
Shi Weihua	7b26655f62	sys_prctl(): fix return of uninitialized value If none of the switch cases match, the PR_SET_PDEATHSIG and PR_SET_DUMPABLE cases of the switch statement will never write to local variable `error'. Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com> Cc: Andrew G. Morgan <morgan@kernel.org> Acked-by: "Serge E. Hallyn" <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:13 -07:00
Oleg Nesterov	da7978b034	signals: fix sigqueue_free() vs __exit_signal() race __exit_signal() does flush_sigqueue(tsk->pending) outside of ->siglock. This can race with another thread doing sigqueue_free(), we can free the same SIGQUEUE_PREALLOC sigqueue twice or corrupt the pending->list. Note that even sys_exit_group() can trigger this race, not only sys_timer_delete(). Move the callsite of flush_sigqueue(tsk->pending) under ->siglock. This patch doesn't touch flush_sigqueue(->shared_pending) below, it is called when there are no other threads which can play with signals, and sigqueue_free() can't be used outside of our thread group. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-24 09:56:10 -07:00
Hiroshi Shimamoto	81d50bb254	posix-timers: print RT watchdog message It's useful to detect which process is killed by RT watchdog. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 18:49:22 +02:00
Thomas Gleixner	4d2df795f0	sysprof: make it depend on X86 Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 15:00:46 +02:00
Pekka Paalanen	dee310d0ad	x86 mmiotrace: use resource_size_t for phys addresses Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-24 11:27:36 +02:00
Pekka Paalanen	e0fd5c2fa1	mmiotrace: do not print bogus pid for maps either Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-24 11:27:15 +02:00
Pekka Paalanen	2039238b79	mmiotrace: print overrun counts Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-24 11:27:08 +02:00
Pekka Paalanen	d0a7e8ca5b	mmiotrace: print header using the read hook. Now the header is printed only for `trace_pipe' file. Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-24 11:27:03 +02:00
Pekka Paalanen	736ca61fa8	x86 mmiotrace: Do not print bogus pid Non-zero pid indicates the MMIO access originated in user space. We do not catch that kind of accesses yet, so always print zero for now. Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 11:26:00 +02:00
Ingo Molnar	801a175bf6	mmiotrace: ftrace fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 11:25:23 +02:00
Pekka Paalanen	138295373c	ftrace: mmiotrace update, #2 another weekend, another patch. This should apply on top of my previous patch from March 23rd. Summary of changes: - Print PCI device list in output header - work around recursive probe hits on SMP - refactor dis/arm_kmmio_fault_page() and add check for page levels - remove un/reference_kmmio(), the die notifier hook is registered permanently into the list - explicitly check for single stepping in die notifier callback I have tested this version on my UP Athlon64 desktop with Nouveau, and SMP Core 2 Duo laptop with the proprietary nvidia driver. Both systems are 64-bit. One previously unknown bug crept into daylight: the ftrace framework's output routines print the first entry last after buffer has wrapped around. The most important regressions compared to non-ftrace mmiotrace at this time are: - failure of trace_pipe file - illegal lines in output file - unaware of losing data due to buffer full Personally I'd like to see these three solved before submitting to mainline. Other issues may come up once we know when we lose events. Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 11:25:16 +02:00
Pekka Paalanen	bd8ac686c7	ftrace: mmiotrace, updates here is a patch that makes mmiotrace work almost well within the tracing framework. The patch applies on top of my previous patch. I have my own output formatting in place now. Summary of changes: - fix the NULL dereference that was due to not calling tracing_reset() - add print_line() callback into struct tracer - implement print_line() for mmiotrace, producing up-to-spec text - add my output header, but that is not really called in the right place - rewrote the main structs in mmiotrace - added two new trace entry types: TRACE_MMIO_RW and TRACE_MMIO_MAP - made some functions in trace.c non-static - check current==NULL in tracing_generic_entry_update() - fix(?) comparison in trace_seq_printf() Things seem to work fine except a few issues. Markers (text lines injected into mmiotrace log) are missing, I did not feel hacking them in before we have variable length entries. My output header is printed only for 'trace' file, but not 'trace_pipe'. For some reason, despite my quick fix, iter->trace is NULL in print_trace_line() when called from 'trace_pipe' file, which means I don't get proper output formatting. I only tried by loading nouveau.ko, which just detects the card, and that is traced fine. I didn't try further. Map, two reads and unmap. Works perfectly. I am missing the information about overflows, I'd prefer to have a counter for lost events. I didn't try, but I guess currently there is no way of knowning when it overflows? So, not too far from being fully operational, it seems :-) And looking at the diffstat, there also is some 700-900 lines of user space code that just became obsolete. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 11:24:53 +02:00
Pekka Paalanen	f984b51e07	ftrace: add mmiotrace plugin On Sat, 22 Mar 2008 13:07:47 +0100 Ingo Molnar <mingo@elte.hu> wrote: > > > i'd suggest the following: pull x86.git and sched-devel.git into a > > > single tree [the two will combine without rejects]. Then try to add a > > > kernel/tracing/trace_mmiotrace.c ftrace plugin. The trace_sysprof.c > > > plugin might be a good example. > > > > I did this and now I have mmiotrace enabled/disabled via the tracing > > framework (what do we call this, since ftrace is one of the tracers?). > > cool! could you send the patches for that? (even if they are not fully > functional yet) Patch attached in the end. Nice to see how much code disappeared. I tried to mark all the features I had to break with XXX-comments. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 11:22:43 +02:00
Johannes Berg	bfeeeeb991	stacktrace: don't crash on invalid stack trace structs This patch makes the stacktrace printout code \warn when the entries pointer is unset rather than crashing when trying to access it in an attempt to make it a bit more robust. I was saving a stacktrace into an skb and forgot to copy it across skb copies... I have since fixed the code, but it would have been easier had the kernel not crashed in an interrupt. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-24 01:16:38 +02:00
Soeren Sandmann	cf3271a73b	ftrace/sysprof: don't trace the user stack if we are a kernel thread. Check that current->mm is non-NULL before attempting to trace the user stack. Also take depth of the kernel stack into account when comparing against sample_max_depth. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:59:12 +02:00
Ingo Molnar	8a9e94c1fb	sysprof: update copyrights Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:59:00 +02:00
Soeren Sandmann Pedersen	cd2134b1dd	sysprof: kernel trace add kernel backtracing to the sysprof tracer. change the format of the data, so that type=0 means beginning of stack trace, 1 means kernel address, 2 means user address, and 3 means end of trace. EIP addresses are no longer distinguished from return addresses, mostly because sysprof userspace doesn't make use of it. It may be worthwhile adding this back in though, just in case it becomes interesting. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:58:50 +02:00
Thomas Gleixner	5fc4511c75	ftrace: make it more available in the Kconfig Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:58:21 +02:00
Thomas Gleixner	9caee613d3	ftrace: fix __trace_special() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:55:54 +02:00
Thomas Gleixner	ada6b83506	ftrace: remove notrace Remove the notrace annotations. The build logic takes care of that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:50:41 +02:00
Ingo Molnar	d618b3e6e5	ftrace: sysprof updates make the sample period configurable. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:40:22 +02:00
Ingo Molnar	9f6b4e3f4a	ftrace: sysprof fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:40:13 +02:00
Ingo Molnar	ef4ab15ff3	ftrace: make sysprof dependent on x86 for now that's the only tested platform for now. If there's interest we can make it generic easily. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:40:01 +02:00
Ingo Molnar	842af315e8	ftrace: sysprof plugin improvement add sample maximum depth. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:39:47 +02:00
Ingo Molnar	a6dd24f8d0	ftrace: sysprof-plugin, add self-tests Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:39:38 +02:00
Ingo Molnar	56a08bdcff	ftrace: extend sysprof plugin some more Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:39:26 +02:00
Ingo Molnar	0075fa8030	ftrace: extend sysprof plugin add per CPU hrtimers. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:39:13 +02:00
Ingo Molnar	f06c38103e	ftrace: add sysprof plugin very first baby version. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 23:39:00 +02:00
Steven Rostedt	677aa9f77e	ftrace: add have dynamic ftrace config for archs Now that ftrace is being ported to other architectures, it has become apparent that DYNAMIC_FTRACE is dependent on whether or not that architecture implements dynamic ftrace. FTRACE itself may be ported to an architecture without porting dynamic ftrace. This patch adds HAVE_DYNAMIC_FTRACE to allow architectures to port ftrace without having to also port the dynamic aspect as well. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:49:18 +02:00
Steven Rostedt	6ec562328f	ftrace: use the new kbuild CFLAGS_REMOVE for kernel directory This patch removes the Makefile turd and uses the nice CFLAGS_REMOVE macro in the kernel directory. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:46:09 +02:00
Steven Rostedt	4e491d14f2	ftrace: support for PowerPC This patch adds full support for ftrace for PowerPC (both 64 and 32 bit). This includes dynamic tracing and function filtering. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:43:11 +02:00
Ingo Molnar	2d8b820b2e	ftrace: cleanups factor out code and clean it up. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:36:37 +02:00
Ingo Molnar	37135677e6	ftrace: fix mcount export bug David S. Miller noticed the following bug: the -pg instrumentation function callback is named differently on each platform. On x86 it is mcount, on sparc it is _mcount. So the export does not make sense in kernel/trace/ftrace.c - move it to x86. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:36:24 +02:00
David Miller	aa5e5ceaf5	ftrace: remove packed attribute on ftrace_page. It causes unaligned access traps on platforms like sparc (ftrace_page may be marked packed, but once we return a dyn_ftrace sub-object from this array to another piece of code, the "packed" part of the typing information doesn't propagate). But also, it didn't serve any purpose either. Even if packed, on 64-bit or 32-bit, it didn't give us any more dyn_ftrace entries per-page. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:35:57 +02:00
Ingo Molnar	74f4e369fc	ftrace: stacktrace fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:34:56 +02:00
Mathieu Desnoyers	5b82a1b08a	Port ftrace to markers Porting ftrace to the marker infrastructure. Don't need to chain to the wakeup tracer from the sched tracer, because markers support multiple probes connected. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:29:25 +02:00
Mathieu Desnoyers	dc102a8fae	Markers - remove extra format argument Denys Vlasenko <vda.linux@googlemail.com> : > Not in this patch, but I noticed: > > #define __trace_mark(name, call_private, format, args...) \ > do { \ > static const char __mstrtab_##name[] \ > __attribute__((section("__markers_strings"))) \ > = #name "\0" format; \ > static struct marker __mark_##name \ > __attribute__((section("__markers"), aligned(8))) = \ > { __mstrtab_##name, &__mstrtab_##name[sizeof(#name)], \ > 0, 0, marker_probe_cb, \ > { __mark_empty_function, NULL}, NULL }; \ > __mark_check_format(format, ## args); \ > if (unlikely(__mark_##name.state)) { \ > (__mark_##name.call) \ > (&__mark_##name, call_private, \ > format, ## args); \ > } \ > } while (0) > > In this call: > > (__mark_##name.call) \ > (&__mark_##name, call_private, \ > format, ## args); \ > > you make gcc allocate duplicate format string. You can use > &__mstrtab_##name[sizeof(#name)] instead since it holds the same string, > or drop ", format," above and "const char fmt" from here: > > void (call)(const struct marker mdata, / Probe wrapper / > void call_private, const char *fmt, ...); > > since mdata->format is the same and all callees which need it can take it there. Very good point. I actually thought about dropping it, since it would remove an unnecessary argument from the stack. And actually, since I now have the marker_probe_cb sitting between the marker site and the callbacks, there is no API change required. Thanks :) Mathieu Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: Denys Vlasenko <vda.linux@googlemail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:25:27 +02:00
Steven Rostedt	3eefae994d	ftrace: limit trace entries Currently there is no protection from the root user to use up all of memory for trace buffers. If the root user allocates too many entries, the OOM killer might start kill off all tasks. This patch adds an algorith to check the following condition: pages_requested > (freeable_memory + current_trace_buffer_pages) / 4 If the above is met then the allocation fails. The above prevents more than 1/4th of freeable memory from being used by trace buffers. To determine the freeable_memory, I made determine_dirtyable_memory in mm/page-writeback.c global. Special thanks goes to Peter Zijlstra for suggesting the above calculation. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:05:14 +02:00
Pekka Paalanen	6c6c27969a	ftrace: add readpos to struct trace_seq; add trace_seq_to_user() Refactor code from tracing_read_pipe() and create trace_seq_to_user(). Moved trace_seq_reset() call before iter->trace->read() call so that when all leftover data is returned, trace_seq is reset automatically. Signed-off-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:02:13 +02:00
Steven Rostedt	2bb6f8d638	ftrace: use raw_smp_processor_id for mcount functions Due to debug hooks in the kernel that can change the way smp_processor_id works, use raw_smp_processor_id in mcount called functions (namely ftrace_record_ip). Currently we annotate most debug functions from calling mcount, but we should not rely on that to prevent kernel lockups. This patch uses the raw_smp_processor_id to prevent a recusive crash that can happen if a debug hook in smp_processor_id calls mcount. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:01:34 +02:00
Ingo Molnar	a4feb8348b	ftrace: special stacktrace Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 22:01:13 +02:00
Ingo Molnar	9fe068e92f	ftrace: trace faster Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:57:15 +02:00
Steven Rostedt	4823ed7ead	ftrace: fix setting of pos in read_pipe In resetting the iterator in read_pipe, the reset of pos was postitioned in the wrong location with respect to the memset operation. The current code sets pos, incorrectly, to zero. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:57:01 +02:00
Steven Rostedt	107bad8bef	ftrace: add trace pipe header pluggin This patch adds a method for open_pipe and open_read to the pluggins so that they can add a header to the trace pipe call. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:56:29 +02:00
Steven Rostedt	53d0aa7730	ftrace: add logic to record overruns This patch sets up the infrastructure to record overruns of the tracing buffer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:56:02 +02:00
Steven Rostedt	25b0b44a1c	ftrace: fix comm on function trace output In cleaning up of the sched_switch code, the function trace recording of task comms was removed. This patch adds back the recording of comms for function trace. The output of ftrace now has the task comm instead of <...>. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:52:12 +02:00
Steven Rostedt	4fcdae83ce	ftrace: comment code This is first installment of adding documentation to the ftrace. Expect many more patches of this kind in the near future. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:52:01 +02:00
Steven Rostedt	ab46428c69	ftrace: modulize the number of CPU buffers Currently ftrace allocates a trace buffer for every possible CPU. Work is being done to change it to only online CPUs and add hooks to hotplug CPUS. This patch lays out the infrastructure for such a change. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:51:46 +02:00
Steven Rostedt	c6caeeb142	ftrace: replace simple_strtoul with strict_strtoul Andrew Morton suggested using strict_strtoul over simple_strtoul. This patch replaces them in ftrace. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:51:21 +02:00
Steven Rostedt	cffae437cd	ftrace: simple clean ups Andrew Morton mentioned some clean ups that should be done to ftrace. This patch does some of the simple clean ups. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:51:03 +02:00
Ingo Molnar	afc2abc0ae	ftrace: cleanups no code changed. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:50:50 +02:00
Thomas Gleixner	93dcc6ea09	ftrace: simplify hexprint simplify hex to ascii conversion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-23 21:50:38 +02:00
Steven Rostedt	bb065afb8e	lockdep: update lockdep_recursion on graph_lock With the introduction of ftrace, it is possible to recurse into the lockdep functions via the mcount call. To prevent possible lockups, updating the lockdep_recursion counter on grabbing the internal lockdep_lock should prevent deadlocks. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:50:21 +02:00
Steven Rostedt	a98a3c3fde	ftrace: trace_entries to dynamically change trace buffer size This patch adds /debug/tracing/trace_entries that allows users to see as well as modify the number of trace entries the buffers hold. The number of entries only increments in ENTRIES_PER_PAGE which is calculated by the size of an entry with the number of entries that can fit in a page. The user does not need to use an exact size, but the entries will be rounded to one of the increments. Trying to set the entries to 0 will return with -EINVAL. To avoid race conditions, the modification of the buffer size can only be done when tracing is completely disabled (current_tracer == none). A info message will be printed if a user tries to modify the buffer size when not set to none. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:49:54 +02:00
Steven Rostedt	05bd68c514	ftrace: user proper API for setting RT prios in selftest The wakeup selftest used an internal API for setting the test task priority. This patch fixes it to use the proper API for performing such a task. Thanks goes to Randy Dunlap for pointing out this build failure. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:49:42 +02:00
Steven Rostedt	2dc8f09571	ftrace: trace_pipe implement NONBLOCK This patch implements "NONBLOCK" for trace_pipe. If the trace_pipe is opened with O_NONBLOCK, then the trace_pipe read will not block when buffer is empty. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:47:07 +02:00
Steven Rostedt	845279972f	ftrace: return EOF in trace_pipe on change of tracer Break out of while loop with EOF when the current_trace changes. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:47:01 +02:00
Steven Rostedt	b5685aede3	ftrace: restore iterator trace in pipe read The trace iterator is reset in the read. We still need to restore the tracer that the trace_pipe was opened with. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:46:53 +02:00
Steven Rostedt	8487c23765	ftrace: allow trace_pipe to block on all reads We expect things like "cat" to block on reads to trace_pipe. That's what trace_pipe is for. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:46:39 +02:00
Ankita Garg	d17d969160	ftrace: fix conversion of task state to char in latency tracer The conversion of task states to a character in the sched_switch tracer (part of latency tracer infrastructure), seems to be incorrect. We currently do it by indexing into the state_to_char array using the state value. The state values do not map directly into the array index and are thus incorrect. The following patch addresses this issue. This is also what is being done even in the show_task() routine in kernel/sched.c The patch has been compile and run tested. Signed-off-by: Ankita Garg <ankita@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:46:30 +02:00
Thomas Gleixner	72829bc3d6	ftrace: move enums to ftrace.h and make helper function global picked from the mmiotracer patches to distangle the patch queues. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:37:28 +02:00
Steven Rostedt	30afdcb1de	ftrace: selftest protect againt max flip There is a slight race condition in the selftest where the max update of the wakeup and irqs/preemption off tests can be doing a max update as the buffers are being tested. If this happens the system can crash with a GPF. This patch adds the max update spinlock around the checking of the buffers to prevent such a race. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:15:59 +02:00
Steven Rostedt	d15f57f23e	ftrace: fix mutex unlock in trace output If the trace output changes on reading the trace files, there is a chance that the start function will return NULL. If the start function of a sequence returns NULL the stop equivalent is not called. In this case, all locks that are taken must be released even if they are released in the stop function. This patch fixes a case that a mutex was not released on return of NULL in the start sequence function. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:15:52 +02:00
Steven Rostedt	07a267cdd2	ftrace: add UNINTERRUPTIBLE state for kftraced on disable When dynamic ftrace fails and sets itself disabled, the ftraced daemon will go back to sleep everytime it wakes up. The setting of the ftraced state to UNINTERRUPTIBLE is skipped in this process, and the daemon takes up 100% of the CPU. This patch makes sure the ftraced daemon sets itself to UNINTERRUPTIBLE in that loop. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:15:42 +02:00
Ingo Molnar	c1d2327b36	ftrace: restrict tracing to HAVE_FTRACE architectures Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:15:29 +02:00
Steven Rostedt	1d09daa55d	ftrace: use Makefile to remove tracing from lockdep This patch removes the "notrace" annotation from lockdep and adds the debugging files in the kernel director to those that should not be compiled with "-pg" mcount tracing. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:15:14 +02:00
Steven Rostedt	92205c2343	ftrace: user raw_spin_lock in tracing Lock debugging enabled cause huge performance problems for tracing. Having the lock verification happening for every function that is called because mcount calls spin_lock can cripple the system. This patch converts the spin_locks used by ftrace into raw_spin_locks. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:14:11 +02:00
Steven Rostedt	c5f888cae4	ftrace: irqsoff use raw_smp_processor_id This patch changes the use of __get_cpu_var to explicitly calling raw_smp_processor_id and using the per_cpu() macro. On some debug configurations, the use of __get_cpu_var may cause ftrace to trigger and this can cause problems with the irqsoff tracing. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:13:41 +02:00
Ingo Molnar	4d9493c90f	ftrace: remove add-hoc code Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:13:32 +02:00
Steven Rostedt	d05cdb25d8	ftrace: fix dynamic ftrace selftest With the adding of the configuration changes in the Makefile to prevent tracing of functions in the ftrace code, all tracing of all the ftrace code has been removed. Unfortunately, one of the selftests, relied on a function to be traced. With the new change, the function was no longer traced and the test failed. This patch separates out the test function into its own file so that we can add the "-pg" flag to the compilation of that function and the adding of the mcount call to that function. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:13:23 +02:00
Steven Rostedt	06fa75ab56	ftrace: add TRACE_STACK and TRACE_SPECIAL to selftest validation The selftest validation code checks for valid entries in the trace buffer. TRACE_STACK and TRACE_SPECIAL have been added to the code but not to the validator. This patch adds the two to prevent them from flagging a failure in the selftest. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:13:12 +02:00
Steven Rostedt	4fe8c3048c	ftrace: printk and trace irqsoff and wakeups printk called from wakeup critical timings and irqs off can cause deadlocks since printk might do a wakeup itself. If the call to printk happens with the runqueue lock held, it can deadlock. This patch protects the printk from being called in trace irqs off with a test to see if the runqueue for the current CPU is locked. If it is locked, the printk is skipped. The wakeup always holds the runqueue lock, so the printk is simply removed. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:13:02 +02:00
Steven Rostedt	8f96da02c1	ftrace: remove wakeup from function trace trace_function is called by mcount and calling wake_up from that can have unpredictable results. This patch removes the wakeup from trace_function. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:12:48 +02:00
Ingo Molnar	694379e9ed	ftrace: make it more available in the Kconfig Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:12:26 +02:00
Peter Zijlstra	5429db2d26	ftrace: fix wakeup callback Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:09:02 +02:00
Peter Zijlstra	bac524d3f3	ftrace: trace next state Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:08:54 +02:00
Ingo Molnar	88a4216c3e	ftrace: sched special Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:08:47 +02:00
Ingo Molnar	36fc25a9f4	ftrace: sched tree fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:08:28 +02:00
Ingo Molnar	f29c73fe34	ftrace: include cpu in stacktrace Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:08:20 +02:00
Ingo Molnar	442e544ce5	ftrace: iter ctrl fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:08:12 +02:00
Ingo Molnar	d9af56fbd8	ftrace: fix cmdline tracing Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:07:57 +02:00
Ingo Molnar	36dfe9252b	ftrace: make use of tracing_cpumask Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:07:49 +02:00
Ingo Molnar	c7078de1aa	ftrace: add tracing_cpumask Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:07:41 +02:00
Ingo Molnar	4ac3ba41d3	ftrace: trace scheduler rbtree Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:07:31 +02:00
Ingo Molnar	1a3c303433	ftrace: fix __trace_special() Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:07:20 +02:00
Ingo Molnar	017730c112	ftrace: fix wakeups Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:05:02 +02:00
Ingo Molnar	24cd5d111e	ftrace: trace curr/next tasks Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:04:51 +02:00
Ingo Molnar	4e65551905	ftrace: sched tracer, trace full rbtree Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:04:44 +02:00
Ingo Molnar	4c1f4d4f01	ftrace: make nostacktrace the default Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:04:36 +02:00
Ingo Molnar	8ac0fca4cc	ftrace: sched tracer fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:04:28 +02:00
Ingo Molnar	86387f7ee5	ftrace: add stack tracing Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:04:20 +02:00
Ingo Molnar	57422797dc	ftrace: add wakeup events to sched tracer Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 21:04:06 +02:00
Ingo Molnar	e309b41dd6	ftrace: remove notrace now that we have a kbuild method for notrace, no need to pollute the C code with the annotations. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:58:28 +02:00
Ingo Molnar	b53dde9d34	ftrace: disable -pg for the tracer itself Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:56:53 +02:00
Steven Rostedt	caf8cdebfb	ftrace: remove address of function names PowerPC is very fragile when it comes to use of function names and function addresses. ftrace needs to either use all function addresses or function names (i.e. my_func as suppose to &my_func). This patch chooses to use the names and not the addresses, and makes ftrace consistent. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:56:31 +02:00
Ingo Molnar	9ff9cdb2d3	ftrace: cleanups clean up recent code. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:56:05 +02:00
Steven Rostedt	6fb44b717c	ftrace: add trace_function api for other tracers to use A new check was added in the ftrace function that wont trace if the CPU trace buffer is disabled. Unfortunately, other tracers used ftrace() to write to the buffer after they disabled it. The new disable check makes these calls into a nop. This patch changes the __ftrace that is called without the check into a new api for the other tracers to use, called "trace_function". The other tracers use this interface instead when the trace CPU buffer is already disabled. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:55:55 +02:00
Soeren Sandmann Pedersen	2a2cc8f7c4	ftrace: allow the event pipe to be polled Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:55:43 +02:00
Ingo Molnar	2577046740	ftrace: build fix no need to backmerge, only affects ftrace-enabled kernels. (which is not the default) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:55:22 +02:00
Ingo Molnar	5e3ca0ec76	ftrace: introduce the "hex" output method Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:54:55 +02:00
Ingo Molnar	2e0f576185	ftrace: build fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:54:47 +02:00
Ingo Molnar	0fd9e0dac9	ftrace: use cpu clock again Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:54:37 +02:00
Steven Rostedt	26994ead1f	ftrace: enabled tracing by default This patch is the correct way to have tracing enabled by default. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:54:26 +02:00
Steven Rostedt	4eebcc81a3	ftrace: disable tracing on failure Since ftrace touches practically every function. If we detect any anomaly, we want to fully disable ftrace. This patch adds code to try shutdown ftrace as much as possible without doing any more harm is something is detected not quite correct. This only kills ftrace, this patch does have checks for other parts of the tracer (irqsoff, wakeup, etc.). Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:54:16 +02:00
Steven Rostedt	37ad508419	ftrace - fix dynamic ftrace memory leak The ftrace dynamic function update allocates a record to store the instruction pointers that are being modified. If the modified instruction pointer fails to update, then the record is marked as failed and nothing more is done. Worse, if the modification fails, but the record ip function is still called, it will allocate a new record and try again. In just a matter of time, will this cause a serious memory leak and crash the system. This patch plugs this memory leak. When a record fails, it is included back into the pool of records to be used. Now a record may fail over and over again, but the number of allocated records will not increase. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:54:04 +02:00
Steven Rostedt	088b1e427d	ftrace: pipe fixes Some fixes for better output with the trace pipe. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:53:19 +02:00
Ingo Molnar	dcb6308f2b	ftrace, locking fix should be an irq-safe lock ... Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:53:09 +02:00
Ingo Molnar	f0a920d575	ftrace: add trace_special() for ad-hoc tracing. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:52:33 +02:00
Ingo Molnar	cb0f12aae8	ftrace: bin-output Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:43:23 +02:00
Ingo Molnar	f9896bf309	ftrace: add raw output Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:43:14 +02:00
Ingo Molnar	8c523a9d82	ftrace: clean-up-pipe-iteration Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:43:05 +02:00
Ingo Molnar	cdd31cd2d7	ftrace: remove-idx-sync remove idx syncing - it's expensive on SMP. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:42:55 +02:00
Ingo Molnar	53c37c17aa	ftrace: fast, scalable, synchronized timestamps implement globally synchronized, fast and scalable time source for tracing. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:42:41 +02:00
Ingo Molnar	750ed1a407	ftrace: timestamp syncing, prepare rename and uninline now() to ftrace_now(). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:42:31 +02:00
Ingo Molnar	4bf39a9411	ftrace: cleanups no code changed. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:42:20 +02:00
Ingo Molnar	d4c5a2f587	ftrace: fix locking we can hold all cpu trace buffer locks at once - put each into a separate lock class. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:42:12 +02:00
Steven Rostedt	b3806b4316	ftrace: user run time file reading This patch creates a file called trace_pipe in the tracing debug directory. This file is a consumer of the trace buffers. This means that reads of this file consumes the entries from the trace buffers so that they will not be read a second time, as contrast to the static buffers latency_trace and trace. Reading from the trace_pipe will remove the entries from trace and latency_trace too. The advantage that trace_pipe has is that it can record live traces. It will block when there is nothing in the buffer, and read the entries as they are entered. An EOF happens when tracing is disabled (tracing_enabled = 0). Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:42:01 +02:00
Steven Rostedt	214023c3d1	ftrace: add a buffer for output Later patches will need to print the same things as the seq output does. But those outputs will not use the seq utility. This patch adds a buffer to the iterator, that can be used by either the seq utility or other output. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:41:46 +02:00
Steven Rostedt	93a588f459	ftrace: change buffers to producer consumer This patch changes the way the CPU trace buffers are handled. Instead of always starting from the trace page head, the logic is changed to a producer consumer logic. This allows for the buffers to be drained while they are alive. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:41:35 +02:00
Steven Rostedt	1d4db00a5e	ftrace: reset selftests The tests may leave stuff in the buffers. This resets the buffers after each test is run. If a test fails, it does not reset the buffer to avoid touching a buffer that is corrupted. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:41:29 +02:00
Steven Rostedt	08bafa0efc	ftrace: disable all tracers on corrupted buffer If the trace buffer is detected to be corrupted, then we disable all tracers. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:41:22 +02:00
Ingo Molnar	4e3c3333f3	ftrace: fix time offset fix time offset calculations and ordering, plus make code more consistent. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:41:13 +02:00
Steven Rostedt	77a2b37d22	ftrace: startup tester on dynamic tracing. This patch adds a startup self test on dynamic code modification and filters. The test filters on a specific function, makes sure that no other function is traced, exectutes the function, then makes sure that the function is traced. This patch also fixes a slight bug with the ftrace selftest, where tracer_enabled was not being set. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:41:06 +02:00
Ingo Molnar	7bd2f24c2f	ftrace: add README make it easier for newbies to find their way around. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:40:56 +02:00
Ingo Molnar	c7aafc5497	ftrace: cleanups factor out code and clean it up. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:40:46 +02:00
Steven Rostedt	60a11774b3	ftrace: add self-tests Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:40:36 +02:00
Steven Rostedt	e1c08bdd9f	ftrace: force recording Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:40:29 +02:00
Steven Rostedt	57f50be14d	ftrace: fix max latency Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:40:22 +02:00
Steven Rostedt	89b2f97819	ftrace: fix updates to max trace This patch fixes some bugs to the updating of the max trace that was caused by implementing the new buffering. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:40:15 +02:00
Steven Rostedt	18cef379d3	ftrace: don't use raw_local_irq_save/restore Using raw_local_irq_save/restore confuses lockdep. It's fine to use the normal ones. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:40:05 +02:00
Steven Rostedt	0764d23cf0	ftrace: lockdep notrace annotations Add notrace annotations to lockdep to keep ftrace from causing recursive problems with lock tracing and debugging. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:39:40 +02:00
Steven Rostedt	361943ad0b	ftrace: irqs off smp_processor_id() fix The irqsoff function tracer did a __get_cpu_var to determine if it should trace the function or not. The problem is that __get_cpu_var can preempt between getting the CPU and reading the cpu variable. This means that the cpu variable that is being read is not from the cpu being run on. At worst, this can give a false positive, where we trace the function when we should not. It will never give a false negative since we only want to trace when interrupts are disabled and we never preempt when they are. This fix adds a check after reading the irq flags to only trace if the interrupts are actually disabled. It also changes the reading of the cpu variable to use a raw_smp_processor_id since we now don't care if we preempt. We still catch that fact. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:39:30 +02:00
Steven Rostedt	4c11d7aed3	ftrace: convert single large buffer into single pages. Allocating large buffers for the tracer may fail easily. This patch converts the buffer from a large ordered allocation to single pages. It uses the struct page LRU field to link the pages together. Later patches may also implement dynamic increasing and decreasing of the trace buffers. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:38:51 +02:00
Steven Rostedt	5072c59fd4	ftrace: add filter select functions to trace This patch adds two files to the debugfs system: /debugfs/tracing/available_filter_functions and /debugfs/tracing/set_ftrace_filter The available_filter_functions lists all functions that has been recorded by the ftraced that has called the ftrace_record_ip function. This is to allow users to see what functions have been converted to nops and can be enabled for tracing. To enable functions, simply echo the names (whitespace delimited) into set_ftrace_filter. Simple wildcards are also allowed. echo 'scheduler' > /debugfs/tracing/set_ftrace_filter Will have only the scheduler be activated when tracing is enabled. echo 'sched_' > /debugfs/tracing/set_ftrace_filter Will have only the functions starting with 'sched_' be activated. echo 'lock' > /debugfs/tracing/set_ftrace_filter Will have only functions ending with 'lock' be activated. echo 'lock' > /debugfs/tracing/set_ftrace_filter Will have only functions with 'lock' in its name be activated. Note: 'schedlock' will not work. The only wildcards that are allowed is an asterisk and the beginning and or end of the string passed in. Multiple names can be passed in with whitespace delimited: echo 'scheduler lock acpi' > /debugfs/tracing/set_ftrace_filter is also the same as: echo 'scheduler' > /debugfs/tracing/set_ftrace_filter echo 'lock' >> /debugfs/tracing/set_ftrace_filter echo 'acpi*' >> /debugfs/tracing/set_ftrace_filter Appending does just that. It appends to the list. To disable all filters simply echo an empty line in: echo > /debugfs/tracing/set_ftrace_filter Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:38:41 +02:00
Steven Rostedt	d61f82d066	ftrace: use dynamic patching for updating mcount calls This patch replaces the indirect call to the mcount function pointer with a direct call that will be patched by the dynamic ftrace routines. On boot up, the mcount function calls the ftace_stub function. When the dynamic ftrace code is initialized, the ftrace_stub is replaced with a call to the ftrace_record_ip, which records the instruction pointers of the locations that call it. Later, the ftraced daemon will call kstop_machine and patch all the locations to nops. When a ftrace is enabled, the original calls to mcount will now be set top call ftrace_caller, which will do a direct call to the registered ftrace function. This direct call is also patched when the function that should be called is updated. All patching is performed by a kstop_machine routine to prevent any type of race conditions that is associated with modifying code on the fly. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:33:47 +02:00
Steven Rostedt	3c1720f00b	ftrace: move memory management out of arch code This patch moves the memory management of the ftrace records out of the arch code and into the generic code making the arch code simpler. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:33:35 +02:00
Steven Rostedt	b0fc494fae	ftrace: add ftrace_enabled sysctl to disable mcount function This patch adds back the sysctl ftrace_enabled. This time it is defaulted to on, if DYNAMIC_FTRACE is configured. When ftrace_enabled is disabled, the ftrace function is set to the stub return. If DYNAMIC_FTRACE is also configured, on ftrace_enabled = 0, the registered ftrace functions will all be set to jmps, but no more new calls to ftrace recording (used to find the ftrace calling sites) will be called. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:33:19 +02:00
Steven Rostedt	3d0833953e	ftrace: dynamic enabling/disabling of function calls This patch adds a feature to dynamically replace the ftrace code with the jmps to allow a kernel with ftrace configured to run as fast as it can without it configured. The way this works, is on bootup (if ftrace is enabled), a ftrace function is registered to record the instruction pointer of all places that call the function. Later, if there's still any code to patch, a kthread is awoken (rate limited to at most once a second) that performs a stop_machine, and replaces all the code that was called with a jmp over the call to ftrace. It only replaces what was found the previous time. Typically the system reaches equilibrium quickly after bootup and there's no code patching needed at all. e.g. call ftrace /* 5 bytes / is replaced with jmp 3f / jmp is 2 bytes and we jump 3 forward */ 3: When we want to enable ftrace for function tracing, the IP recording is removed, and stop_machine is called again to replace all the locations of that were recorded back to the call of ftrace. When it is disabled, we replace the code back to the jmp. Allocation is done by the kthread. If the ftrace recording function is called, and we don't have any record slots available, then we simply skip that call. Once a second a new page (if needed) is allocated for recording new ftrace function calls. A large batch is allocated at boot up to get most of the calls there. Because we do this via stop_machine, we don't have to worry about another CPU executing a ftrace call as we modify it. But we do need to worry about NMI's so all functions that might be called via nmi must be annotated with notrace_nmi. When this code is configured in, the NMI code will not call notrace. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:33:09 +02:00
Steven Rostedt	6cd8a4bb2f	ftrace: trace preempt off critical timings Add preempt off timings. A lot of kernel core code is taken from the RT patch latency trace that was written by Ingo Molnar. This adds "preemptoff" and "preemptirqsoff" to /debugfs/tracing/available_tracers Now instead of just tracing irqs off, preemption off can be selected to be recorded. When this is selected, it shares the same files as irqs off timings. One can either trace preemption off, irqs off, or one or the other off. By echoing "preemptoff" into /debugfs/tracing/current_tracer, recording of preempt off only is performed. "irqsoff" will only record the time irqs are disabled, but "preemptirqsoff" will take the total time irqs or preemption are disabled. Runtime switching of these options is now supported by simpling echoing in the appropriate trace name into /debugfs/tracing/current_tracer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:32:54 +02:00
Steven Rostedt	81d68a96a3	ftrace: trace irq disabled critical timings This patch adds latency tracing for critical timings (how long interrupts are disabled for). "irqsoff" is added to /debugfs/tracing/available_tracers Note: tracing_max_latency also holds the max latency for irqsoff (in usecs). (default to large number so one must start latency tracing) tracing_thresh threshold (in usecs) to always print out if irqs off is detected to be longer than stated here. If irq_thresh is non-zero, then max_irq_latency is ignored. Here's an example of a trace with ftrace_enabled = 0 ======= preemption latency trace v1.1.5 on 2.6.24-rc7 Signed-off-by: Ingo Molnar <mingo@elte.hu> -------------------------------------------------------------------- latency: 100 us, #3/3, CPU#1 \| (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) ----------------- \| task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) ----------------- => started at: _spin_lock_irqsave+0x2a/0xb7 => ended at: _spin_unlock_irqrestore+0x32/0x5f _------=> CPU# / _-----=> irqs-off \| / _----=> need-resched \|\| / _---=> hardirq/softirq \|\|\| / _--=> preempt-depth \|\|\|\| / \|\|\|\|\| delay cmd pid \|\|\|\|\| time \| caller \ / \|\|\|\|\| \ \| / swapper-0 1d.s3 0us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000]) swapper-0 1d.s3 100us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000]) swapper-0 1d.s3 100us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f) vim:ft=help ======= And this is a trace with ftrace_enabled == 1 ======= preemption latency trace v1.1.5 on 2.6.24-rc7 -------------------------------------------------------------------- latency: 102 us, #12/12, CPU#1 \| (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) ----------------- \| task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) ----------------- => started at: _spin_lock_irqsave+0x2a/0xb7 => ended at: _spin_unlock_irqrestore+0x32/0x5f _------=> CPU# / _-----=> irqs-off \| / _----=> need-resched \|\| / _---=> hardirq/softirq \|\|\| / _--=> preempt-depth \|\|\|\| / \|\|\|\|\| delay cmd pid \|\|\|\|\| time \| caller \ / \|\|\|\|\| \ \| / swapper-0 1dNs3 0us+: _spin_lock_irqsave+0x2a/0xb7 (e1000_update_stats+0x47/0x64c [e1000]) swapper-0 1dNs3 46us : e1000_read_phy_reg+0x16/0x225 [e1000] (e1000_update_stats+0x5e2/0x64c [e1000]) swapper-0 1dNs3 46us : e1000_swfw_sync_acquire+0x10/0x99 [e1000] (e1000_read_phy_reg+0x49/0x225 [e1000]) swapper-0 1dNs3 46us : e1000_get_hw_eeprom_semaphore+0x12/0xa6 [e1000] (e1000_swfw_sync_acquire+0x36/0x99 [e1000]) swapper-0 1dNs3 47us : __const_udelay+0x9/0x47 (e1000_read_phy_reg+0x116/0x225 [e1000]) swapper-0 1dNs3 47us+: __delay+0x9/0x50 (__const_udelay+0x45/0x47) swapper-0 1dNs3 97us : preempt_schedule+0xc/0x84 (__delay+0x4e/0x50) swapper-0 1dNs3 98us : e1000_swfw_sync_release+0xc/0x55 [e1000] (e1000_read_phy_reg+0x211/0x225 [e1000]) swapper-0 1dNs3 99us+: e1000_put_hw_eeprom_semaphore+0x9/0x35 [e1000] (e1000_swfw_sync_release+0x50/0x55 [e1000]) swapper-0 1dNs3 101us : _spin_unlock_irqrestore+0xe/0x5f (e1000_update_stats+0x641/0x64c [e1000]) swapper-0 1dNs3 102us : _spin_unlock_irqrestore+0x32/0x5f (e1000_update_stats+0x641/0x64c [e1000]) swapper-0 1dNs3 102us : trace_hardirqs_on_caller+0x75/0x89 (_spin_unlock_irqrestore+0x32/0x5f) vim:ft=help ======= Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:32:46 +02:00
Steven Rostedt	352ad25aa4	ftrace: tracer for scheduler wakeup latency This patch adds the tracer that tracks the wakeup latency of the highest priority waking task. "wakeup" is added to /debugfs/tracing/available_tracers Also added to /debugfs/tracing tracing_max_latency holds the current max latency for the wakeup wakeup_thresh if set to other than zero, a log will be recorded for every wakeup that takes longer than the number entered in here (usecs for all counters) (deletes previous trace) Examples: (with ftrace_enabled = 0) ============ preemption latency trace v1.1.5 on 2.6.24-rc8 Signed-off-by: Ingo Molnar <mingo@elte.hu> -------------------------------------------------------------------- latency: 26 us, #2/2, CPU#1 \| (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) ----------------- \| task: migration/0-3 (uid:0 nice:-5 policy:1 rt_prio:99) ----------------- _------=> CPU# / _-----=> irqs-off \| / _----=> need-resched \|\| / _---=> hardirq/softirq \|\|\| / _--=> preempt-depth \|\|\|\| / \|\|\|\|\| delay cmd pid \|\|\|\|\| time \| caller \ / \|\|\|\|\| \ \| / quilt-8551 0d..3 0us+: wake_up_process+0x15/0x17 <ffffffff80233e80> (sched_exec+0xc9/0x100 <ffffffff80235343>) quilt-8551 0d..4 26us : sched_switch_callback+0x73/0x81 <ffffffff80338d2f> (schedule+0x483/0x6d5 <ffffffff8048b3ee>) vim:ft=help ============ (with ftrace_enabled = 1) ============ preemption latency trace v1.1.5 on 2.6.24-rc8 -------------------------------------------------------------------- latency: 36 us, #45/45, CPU#0 \| (M:rt VP:0, KP:0, SP:0 HP:0 #P:2) ----------------- \| task: migration/1-5 (uid:0 nice:-5 policy:1 rt_prio:99) ----------------- _------=> CPU# / _-----=> irqs-off \| / _----=> need-resched \|\| / _---=> hardirq/softirq \|\|\| / _--=> preempt-depth \|\|\|\| / \|\|\|\|\| delay cmd pid \|\|\|\|\| time \| caller \ / \|\|\|\|\| \ \| / bash-10653 1d..3 0us : wake_up_process+0x15/0x17 <ffffffff80233e80> (sched_exec+0xc9/0x100 <ffffffff80235343>) bash-10653 1d..3 1us : try_to_wake_up+0x271/0x2e7 <ffffffff80233dcf> (sub_preempt_count+0xc/0x7a <ffffffff8023309e>) bash-10653 1d..2 2us : try_to_wake_up+0x296/0x2e7 <ffffffff80233df4> (update_rq_clock+0x9/0x20 <ffffffff802303f3>) bash-10653 1d..2 2us : update_rq_clock+0x1e/0x20 <ffffffff80230408> (__update_rq_clock+0xc/0x90 <ffffffff80230366>) bash-10653 1d..2 3us : __update_rq_clock+0x1b/0x90 <ffffffff80230375> (sched_clock+0x9/0x29 <ffffffff80214529>) bash-10653 1d..2 4us : try_to_wake_up+0x2a6/0x2e7 <ffffffff80233e04> (activate_task+0xc/0x3f <ffffffff8022ffca>) bash-10653 1d..2 4us : activate_task+0x2d/0x3f <ffffffff8022ffeb> (enqueue_task+0xe/0x66 <ffffffff8022ff66>) bash-10653 1d..2 5us : enqueue_task+0x5b/0x66 <ffffffff8022ffb3> (enqueue_task_rt+0x9/0x3c <ffffffff80233351>) bash-10653 1d..2 6us : try_to_wake_up+0x2ba/0x2e7 <ffffffff80233e18> (check_preempt_wakeup+0x12/0x99 <ffffffff80234f84>) [...] bash-10653 1d..5 33us : tracing_record_cmdline+0xcf/0xd4 <ffffffff80338aad> (_spin_unlock+0x9/0x33 <ffffffff8048d3ec>) bash-10653 1d..5 34us : _spin_unlock+0x19/0x33 <ffffffff8048d3fc> (sub_preempt_count+0xc/0x7a <ffffffff8023309e>) bash-10653 1d..4 35us : wakeup_sched_switch+0x65/0x2ff <ffffffff80339f66> (_spin_lock_irqsave+0xc/0xa9 <ffffffff8048d08b>) bash-10653 1d..4 35us : _spin_lock_irqsave+0x19/0xa9 <ffffffff8048d098> (add_preempt_count+0xe/0x77 <ffffffff8023311a>) bash-10653 1d..4 36us : sched_switch_callback+0x73/0x81 <ffffffff80338d2f> (schedule+0x483/0x6d5 <ffffffff8048b3ee>) vim:ft=help ============ The [...] was added here to not waste your email box space. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:32:36 +02:00
Steven Rostedt	35e8e302e5	ftrace: add tracing of context switches This patch adds context switch tracing, of the format of: _------=> CPU# / _-----=> irqs-off \| / _----=> need-resched \|\| / _---=> hardirq/softirq \|\|\| / _--=> preempt-depth \|\|\|\| / \|\|\|\|\| delay cmd pid \|\|\|\|\| time \| pid:prio:state \ / \|\|\|\|\| \ \| / swapper-0 1d..3 137us+: 0:140:R --> 2912:120 sshd-2912 1d..3 216us+: 2912:120:S --> 0:140 swapper-0 1d..3 261us+: 0:140:R --> 2912:120 bash-2920 0d..3 267us+: 2920:120:S --> 0:140 sshd-2912 1d..3 330us!: 2912:120:S --> 0:140 swapper-0 1d..3 2389us+: 0:140:R --> 2847:120 yum-upda-2847 1d..3 2411us!: 2847:120:S --> 0:140 swapper-0 0d..3 11089us+: 0:140:R --> 3139:120 gdm-bina-3139 0d..3 11113us!: 3139:120:S --> 0:140 swapper-0 1d..3 102328us+: 0:140:R --> 2847:120 yum-upda-2847 1d..3 102348us!: 2847:120:S --> 0:140 "sched_switch" is added to /debugfs/tracing/available_tracers [ Eugene Teo <eugeneteo@kernel.sg: remove unused tracing_sched_switch_enabled ] Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:32:27 +02:00
Steven Rostedt	1b29b01887	ftrace: function tracer This is a simple trace that uses the ftrace infrastructure. It is designed to be fast and small, and easy to use. It is useful to record things that happen over a very short period of time, and not to analyze the system in general. Updates: available_tracers "function" is added to this file. current_tracer To enable the function tracer: echo function > /debugfs/tracing/current_tracer To disable the tracer: echo disable > /debugfs/tracing/current_tracer The output of the function_trace file is as follows "echo noverbose > /debugfs/tracing/iter_ctrl" preemption latency trace v1.1.5 on 2.6.24-rc7-tst Signed-off-by: Ingo Molnar <mingo@elte.hu> -------------------------------------------------------------------- latency: 0 us, #419428/4361791, CPU#1 \| (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4) ----------------- \| task: -0 (uid:0 nice:0 policy:0 rt_prio:0) ----------------- _------=> CPU# / _-----=> irqs-off \| / _----=> need-resched \|\| / _---=> hardirq/softirq \|\|\| / _--=> preempt-depth \|\|\|\| / \|\|\|\|\| delay cmd pid \|\|\|\|\| time \| caller \ / \|\|\|\|\| \ \| / swapper-0 0d.h. 1595128us+: set_normalized_timespec+0x8/0x2d <c043841d> (ktime_get_ts+0x4a/0x4e <c04499d4>) swapper-0 0d.h. 1595131us+: _spin_lock+0x8/0x18 <c0630690> (hrtimer_interrupt+0x6e/0x1b0 <c0449c56>) Or with verbose turned on: "echo verbose > /debugfs/tracing/iter_ctrl" preemption latency trace v1.1.5 on 2.6.24-rc7-tst -------------------------------------------------------------------- latency: 0 us, #419428/4361791, CPU#1 \| (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4) ----------------- \| task: -0 (uid:0 nice:0 policy:0 rt_prio:0) ----------------- swapper 0 0 9 00000000 00000000 [f3675f41] 1595.128ms (+0.003ms): set_normalized_timespec+0x8/0x2d <c043841d> (ktime_get_ts+0x4a/0x4e <c04499d4>) swapper 0 0 9 00000000 00000001 [f3675f45] 1595.131ms (+0.003ms): _spin_lock+0x8/0x18 <c0630690> (hrtimer_interrupt+0x6e/0x1b0 <c0449c56>) swapper 0 0 9 00000000 00000002 [f3675f48] 1595.135ms (+0.003ms): _spin_lock+0x8/0x18 <c0630690> (hrtimer_interrupt+0x6e/0x1b0 <c0449c56>) The "trace" file is not affected by the verbose mode, but is by the symonly. echo "nosymonly" > /debugfs/tracing/iter_ctrl tracer: [ 81.479967] CPU 0: bash:3154 register_ftrace_function+0x5f/0x66 <ffffffff80337a4d> <-- _spin_unlock_irqrestore+0xe/0x5a <ffffffff8048cc8f> [ 81.479967] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <ffffffff8048ccbf> <-- sub_preempt_count+0xc/0x7a <ffffffff80233d7b> [ 81.479968] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <ffffffff80233d9f> <-- in_lock_functions+0x9/0x24 <ffffffff8025a75d> [ 81.479968] CPU 0: bash:3154 vfs_write+0x11d/0x155 <ffffffff8029a043> <-- dnotify_parent+0x12/0x78 <ffffffff802d54fb> [ 81.479968] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <ffffffff802d5516> <-- _spin_lock+0xe/0x70 <ffffffff8048c910> [ 81.479969] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <ffffffff8048c91d> <-- add_preempt_count+0xe/0x77 <ffffffff80233df7> [ 81.479969] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <ffffffff80233e27> <-- in_lock_functions+0x9/0x24 <ffffffff8025a75d> echo "symonly" > /debugfs/tracing/iter_ctrl tracer: [ 81.479913] CPU 0: bash:3154 register_ftrace_function+0x5f/0x66 <-- _spin_unlock_irqrestore+0xe/0x5a [ 81.479913] CPU 0: bash:3154 _spin_unlock_irqrestore+0x3e/0x5a <-- sub_preempt_count+0xc/0x7a [ 81.479913] CPU 0: bash:3154 sub_preempt_count+0x30/0x7a <-- in_lock_functions+0x9/0x24 [ 81.479914] CPU 0: bash:3154 vfs_write+0x11d/0x155 <-- dnotify_parent+0x12/0x78 [ 81.479914] CPU 0: bash:3154 dnotify_parent+0x2d/0x78 <-- _spin_lock+0xe/0x70 [ 81.479914] CPU 0: bash:3154 _spin_lock+0x1b/0x70 <-- add_preempt_count+0xe/0x77 [ 81.479914] CPU 0: bash:3154 add_preempt_count+0x3e/0x77 <-- in_lock_functions+0x9/0x24 Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:32:13 +02:00
Steven Rostedt	bc0c38d139	ftrace: latency tracer infrastructure This patch adds the latency tracer infrastructure. This patch does not add anything that will select and turn it on, but will be used by later patches. If it were to be compiled, it would add the following files to the debugfs: The root tracing directory: /debugfs/tracing/ This patch also adds the following files: available_tracers list of available tracers. Currently no tracers are available. Looking into this file only shows "none" which is used to unregister all tracers. current_tracer The trace that is currently active. Empty on start up. To switch to a tracer simply echo one of the tracers that are listed in available_tracers: example: (used with later patches) echo function > /debugfs/tracing/current_tracer To disable the tracer: echo disable > /debugfs/tracing/current_tracer tracing_enabled echoing "1" into this file starts the ftrace function tracing (if sysctl kernel.ftrace_enabled=1) echoing "0" turns it off. latency_trace This file is readonly and holds the result of the trace. trace This file outputs a easier to read version of the trace. iter_ctrl Controls the way the output of traces look. So far there's two controls: echoing in "symonly" will only show the kallsyms variables without the addresses (if kallsyms was configured) echoing in "verbose" will change the output to show a lot more data, but not very easy to understand by humans. echoing in "nosymonly" turns off symonly. echoing in "noverbose" turns off verbose. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:32:06 +02:00
Arnaldo Carvalho de Melo	16444a8a40	ftrace: add basic support for gcc profiler instrumentation If CONFIG_FTRACE is selected and /proc/sys/kernel/ftrace_enabled is set to a non-zero value the ftrace routine will be called everytime we enter a kernel function that is not marked with the "notrace" attribute. The ftrace routine will then call a registered function if a function happens to be registered. [ This code has been highly hacked by Steven Rostedt and Ingo Molnar, so don't blame Arnaldo for all of this ;-) ] Update: It is now possible to register more than one ftrace function. If only one ftrace function is registered, that will be the function that ftrace calls directly. If more than one function is registered, then ftrace will call a function that will loop through the functions to call. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:31:58 +02:00
Steven Rostedt	7c731e0a49	ftrace: make the task state char-string visible to all The tracer wants to be able to convert the state number into a user visible character. This patch pulls that conversion string out the scheduler into the header. This way if it were to ever change, other parts of the kernel will know. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:31:05 +02:00
Ingo Molnar	bd3bff9e20	sched: add latency tracer callbacks to the scheduler add 3 lightweight callbacks to the tracer backend. zero impact if tracing is turned off. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 20:30:55 +02:00
Mike Travis	cad0e458d1	clocksource/events: use performance variant for_each_cpu_mask_nr Change references from for_each_cpu_mask to for_each_cpu_mask_nr where appropriate Reviewed-by: Paul Jackson <pj@sgi.com> Reviewed-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 18:39:06 +02:00
Mike Travis	363ab6f142	core: use performance variant for_each_cpu_mask_nr Change references from for_each_cpu_mask to for_each_cpu_mask_nr where appropriate Reviewed-by: Paul Jackson <pj@sgi.com> Reviewed-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-23 18:35:12 +02:00
Mike Travis	a953e4597a	sched: replace MAX_NUMNODES with nr_node_ids in kernel/sched.c * Replace usages of MAX_NUMNODES with nr_node_ids in kernel/sched.c, where appropriate. This saves some allocated space as well as many wasted cycles going through node entries that are non-existent. For inclusion into sched-devel/latest tree. Based on: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git + sched-devel/latest .../mingo/linux-2.6-sched-devel.git Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-23 18:22:17 +02:00
Christian Borntraeger	3401a61e16	stop_machine: make stop_machine_run more virtualization friendly On kvm I have seen some rare hangs in stop_machine when I used more guest cpus than hosts cpus. e.g. 32 guest cpus on 1 host cpu triggered the hang quite often. I could also reproduce the problem on a 4 way z/VM host with a 64 way guest. It turned out that the guest was consuming all available cpus mostly for spinning on scheduler locks like rq->lock. This is expected as the threads are calling yield all the time. The problem is now, that the host scheduling decisings together with the guest scheduling decisions and spinlocks not being fair managed to create an interesting scenario similar to a live lock. (Sometimes the hang resolved itself after some minutes) Changing stop_machine to yield the cpu to the hypervisor when yielding inside the guest fixed the problem for me. While I am not completely happy with this patch, I think it causes no harm and it really improves the situation for me. I used cpu_relax for yielding to the hypervisor, does that work on all architectures? p.s.: If you want to reproduce the problem, cpu hotplug and kprobes use stop_machine_run and both triggered the problem after some retries. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> CC: Ingo Molnar <mingo@elte.hu> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-23 13:09:34 +10:00
Denis V. Lunev	34e4e2fef4	modules: proper cleanup of kobject without CONFIG_SYSFS kobject: '<NULL>' (ffffffffa0104050): is not initialized, yet kobject_put() is being called. ------------[ cut here ]------------ WARNING: at /home/den/src/linux-netns26/lib/kobject.c:583 kobject_put+0x53/0x55() Modules linked in: ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ide_cd_mod cdrom button [last unloaded: pktgen] comm: rmmod Tainted: G W 2.6.26-rc3 #585 Call Trace: [<ffffffff802359ab>] warn_on_slowpath+0x58/0x7a [<ffffffff80236aca>] ? printk+0x67/0x69 [<ffffffff80236aca>] ? printk+0x67/0x69 [<ffffffff80324289>] kobject_put+0x53/0x55 [<ffffffff8025e2ee>] free_module+0x87/0xfa [<ffffffff8025fee5>] sys_delete_module+0x178/0x1e1 [<ffffffff804b1e70>] ? lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff804b1dff>] ? trace_hardirqs_on_thunk+0x35/0x3a [<ffffffff8020c0bb>] system_call_after_swapgs+0x7b/0x80 ---[ end trace 8f5aafa7f6406cf8 ]--- mod->mkobj.kobj is not initialized without CONFIG_SYSFS. Do not call kobject_put in this case. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-23 13:09:33 +10:00
Cyrill Gorcunov	c4ea6fcf5a	module loading ELF handling: use SELFMAG instead of numeric constant Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-23 13:09:32 +10:00
Linus Torvalds	16ae527bfa	Merge branch 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b51' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: [PATCH] list_for_each_rcu must die: audit [patch 1/1] audit_send_reply(): fix error-path memory leak [PATCH] open sessionid permissions	2008-05-19 16:38:10 -07:00
Huang Weiyi	247ab1a805	rcu: remove duplicated include in kernel/rcupreempt.c Removed duplicated include file <linux/rcupdate.h> in kernel/rcupreempt.c. Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-19 10:03:39 +02:00
Huang Weiyi	e19a98967f	rcu: remove duplicated include in kernel/rcupreempt_trace.c Removed duplicated include file <linux/rcupdate.h> in kernel/rcupreempt_trace.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-19 10:03:39 +02:00
Paul E. McKenney	d7c0651390	rcu: fix rcu_try_flip_waitack_needed() to prevent grace-period stall The comment was correct -- need to make the code match the comment. Without this patch, if a CPU goes dynticks idle (and stays there forever) in just the right phase of preemptible-RCU grace-period processing, grace periods stall. The offending sequence of events (courtesy of Promela/spin, at least after I got the liveness criterion coded correctly...) is as follows: o CPU 0 is in dynticks-idle mode. Its dynticks_progress_counter is (say) 10. o CPU 0 takes an interrupt, so rcu_irq_enter() increments CPU 0's dynticks_progress_counter to 11. o CPU 1 is doing RCU grace-period processing in rcu_try_flip_idle(), sees rcu_pending(), so invokes dyntick_save_progress_counter(), which in turn takes a snapshot of CPU 0's dynticks_progress_counter into CPU 0's rcu_dyntick_snapshot -- now set to 11. CPU 1 then updates the RCU grace-period state to rcu_try_flip_waitack(). o CPU 0 returns from its interrupt, so rcu_irq_exit() increments CPU 0's dynticks_progress_counter to 12. o CPU 1 later invokes rcu_try_flip_waitack(), which notices that CPU 0 has not yet responded, and hence in turn invokes rcu_try_flip_waitack_needed(). This function examines the state of CPU 0's dynticks_progress_counter and rcu_dyntick_snapshot variables, which it copies to curr (== 12) and snap (== 11), respectively. Because curr!=snap, the first condition fails. Because curr-snap is only 1 and snap is odd, the second condition fails. rcu_try_flip_waitack_needed() therefore incorrectly concludes that it must wait for CPU 0 to explicitly acknowledge the counter flip. o CPU 0 remains forever in dynticks-idle mode, never taking any more hardware interrupts or any NMIs, and never running any more tasks. (Of course, -something- will usually eventually happen, which might be why we haven't seen this one in the wild. Still should be fixed!) Therefore the grace period never ends. Fix is to make the code match the comment, as shown below. With this fix, the above scenario would be satisfied with curr being even, and allow the grace period to proceed. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Josh Triplett <josh@kernel.org> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-19 10:03:38 +02:00
Franck Bui-Huu	82524746c2	rcu: split list.h and move rcu-protected lists into rculist.h Move rcu-protected lists from list.h into a new header file rculist.h. This is done because list are a very used primitive structure all over the kernel and it's currently impossible to include other header files in this list.h without creating some circular dependencies. For example, list.h implements rcu-protected list and uses rcu_dereference() without including rcupdate.h. It actually compiles because users of rcu_dereference() are macros. Others RCU functions could be used too but aren't probably because of this. Therefore this patch creates rculist.h which includes rcupdates without to many changes/troubles. Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Triplett <josh@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-19 10:01:37 +02:00
Paul E. McKenney	2326974df2	rcu: add call_rcu_sched() and friends to rcutorture Add entry to rcu_torture_ops allowing the correct barrier function to be used upon exit from rcutorture. Also add torture options for the new call_rcu_sched() API. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-19 10:01:37 +02:00
Paul E. McKenney	70f12f848d	rcu: add rcu_barrier_sched() and rcu_barrier_bh() Add rcu_barrier_sched() and rcu_barrier_bh(). With these in place, rcutorture no longer gives the occasional oops when repeatedly starting and stopping torturing rcu_bh. Also adds the API needed to flush out pre-existing call_rcu_sched() callbacks. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-19 10:01:36 +02:00
Paul E. McKenney	8db559b830	rcu: add memory barriers and comments to rcu_check_callbacks() Add comments to the logic that infers quiescent states when interrupting from either user mode or the idle loop. Also add a memory barrier: it appears that James Huang was in fact onto something, as the scheduler is much less synchronization happy than it once was, so we can no longer rely on its memory barriers in all cases. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-by: James Huang <jamesclhuang@yahoo.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-19 10:01:36 +02:00
Paul E. McKenney	4446a36ff8	rcu: add call_rcu_sched() Fourth cut of patch to provide the call_rcu_sched(). This is again to synchronize_sched() as call_rcu() is to synchronize_rcu(). Should be fine for experimental and -rt use, but not ready for inclusion. With some luck, I will be able to tell Andrew to come out of hiding on the next round. Passes multi-day rcutorture sessions with concurrent CPU hotplugging. Fixes since the first version include a bug that could result in indefinite blocking (spotted by Gautham Shenoy), better resiliency against CPU-hotplug operations, and other minor fixes. Fixes since the second version include reworking grace-period detection to avoid deadlocks that could happen when running concurrently with CPU hotplug, adding Mathieu's fix to avoid the softlockup messages, as well as Mathieu's fix to allow use earlier in boot. Fixes since the third version include a wrong-CPU bug spotted by Andrew, getting rid of the obsolete synchronize_kernel API that somehow snuck back in, merging spin_unlock() and local_irq_restore() in a few places, commenting the code that checks for quiescent states based on interrupting from user-mode execution or the idle loop, removing some inline attributes, and some code-style changes. Known/suspected shortcomings: o I still do not entirely trust the sleep/wakeup logic. Next step will be to use a private snapshot of the CPU online mask in rcu_sched_grace_period() -- if the CPU wasn't there at the start of the grace period, we don't need to hear from it. And the bit about accounting for changes in online CPUs inside of rcu_sched_grace_period() is ugly anyway. o It might be good for rcu_sched_grace_period() to invoke resched_cpu() when a given CPU wasn't responding quickly, but resched_cpu() is declared static... This patch also fixes a long-standing bug in the earlier preemptable-RCU implementation of synchronize_rcu() that could result in loss of concurrent external changes to a task's CPU affinity mask. I still cannot remember who reported this... Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-19 10:01:36 +02:00
Steven Rostedt	8b09dee67f	rcupreempt: remove duplicate prototypes rcu_batches_completed and rcu_patches_completed_bh are both declared in rcuclassic.h and rcupreempt.h. This patch removes the extra prototypes for them from rcupdate.h. rcu_batches_completed_bh is defined as a static inline in the rcupreempt.h header file. Trying to export this as EXPORT_SYMBOL_GPL causes linking problems with the powerpc linker. There's no need to export a static inlined function. Modules must be compiled with the same type of RCU implementation as the kernel they are for. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-19 10:01:35 +02:00
Paul E. McKenney	6793a051fb	[PATCH] list_for_each_rcu must die: audit All uses of list_for_each_rcu() can be profitably replaced by the easier-to-use list_for_each_entry_rcu(). This patch makes this change for the Audit system, in preparation for removing the list_for_each_rcu() API entirely. This time with well-formed SOB. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-05-17 03:30:23 -04:00
Andrew Morton	fcaf1eb868	[patch 1/1] audit_send_reply(): fix error-path memory leak Addresses http://bugzilla.kernel.org/show_bug.cgi?id=10663 Reporter: Daniel Marjamki <danielm77@spray.se> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-05-17 03:30:22 -04:00
Al Viro	eceea0b3df	[PATCH] avoid multiplication overflows and signedness issues for max_fds Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and we don't want size calculation for ->fd[] to overflow. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-05-16 17:22:52 -04:00
Al Viro	02afc6267f	[PATCH] dup_fd() fixes, part 1 Move the sucker to fs/file.c in preparation to the rest Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-05-16 17:22:26 -04:00
Jeremy Kerr	493d35863d	mutex-debug: check mutex magic before owner Currently, the mutex debug code checks the lock->owner before lock->magic, so a corrupt mutex will most likely result in failing the owner check, rather than the magic check. This change to debug_mutex_unlock does the magic check first, so we have a better idea of what breaks. Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-16 16:53:35 +02:00
Harvey Harrison	3fc957721d	lib: create common ascii hex array Add a common hex array in hexdump.c so everyone can use it. Add a common hi/lo helper to avoid the shifting masking that is done to get the upper and lower nibbles of a byte value. Pull the pack_hex_byte helper from kgdb as it is opencoded many places in the tree that will be consolidated. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-14 19:11:14 -07:00
Mirco Tischler	0c70814c31	cgroups: fix compile warning Return type of cpu_rt_runtime_write() should be int instead of ssize_t. Signed-off-by: Mirco Tischler <mt-ml@gmx.de> Acked-by: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-14 19:11:14 -07:00
Russell King	ee9c578527	dyntick: Remove last reminants of dyntick support Remove the last reminants of dyntick support from the generic kernel. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>	2008-05-12 17:39:14 +01:00
Linus Torvalds	c3921ab715	Add new 'cond_resched_bkl()' helper function It acts exactly like a regular 'cond_resched()', but will not get optimized away when CONFIG_PREEMPT is set. Normal kernel code is already preemptable in the presense of CONFIG_PREEMPT, so cond_resched() is optimized away (see commit `02b67cc3ba` "sched: do not do cond_resched() when CONFIG_PREEMPT"). But when wanting to conditionally reschedule while holding a lock, you need to use "cond_sched_lock(lock)", and the new function is the BKL equivalent of that. Also make fs/locks.c use it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-11 16:04:48 -07:00
Linus Torvalds	8e3e076c5a	BKL: revert back to the old spinlock implementation The generic semaphore rewrite had a huge performance regression on AIM7 (and potentially other BKL-heavy benchmarks) because the generic semaphores had been rewritten to be simple to understand and fair. The latter, in particular, turns a semaphore-based BKL implementation into a mess of scheduling. The attempt to fix the performance regression failed miserably (see the previous commit `00b41ec261` 'Revert "semaphore: fix"'), and so for now the simple and sane approach is to instead just go back to the old spinlock-based BKL implementation that never had any issues like this. This patch also has the advantage of being reported to fix the regression completely according to Yanmin Zhang, unlike the semaphore hack which still left a couple percentage point regression. As a spinlock, the BKL obviously has the potential to be a latency issue, but it's not really any different from any other spinlock in that respect. We do want to get rid of the BKL asap, but that has been the plan for several years. These days, the biggest users are in the tty layer (open/release in particular) and Alan holds out some hope: "tty release is probably a few months away from getting cured - I'm afraid it will almost certainly be the very last user of the BKL in tty to get fixed as it depends on everything else being sanely locked." so while we're not there yet, we do have a plan of action. Tested-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andi Kleen <andi@firstfloor.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Alexander Viro <viro@ftp.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-10 20:58:02 -07:00
Linus Torvalds	00b41ec261	Revert "semaphore: fix" This reverts commit `bf726eab37`, as it has been reported to cause a regression with processes stuck in __down(), apparently because some missing wakeup. Quoth Sven Wegener: "I'm currently investigating a regression that has showed up with my last git pull yesterday. Bisecting the commits showed bf726e "semaphore: fix" to be the culprit, reverting it fixed the issue. Symptoms: During heavy filesystem usage (e.g. a kernel compile) I get several compiler processes in uninterruptible sleep, blocking all i/o on the filesystem. System is an Intel Core 2 Quad running a 64bit kernel and userspace. Filesystem is xfs on top of lvm. See below for the output of sysrq-w." See http://lkml.org/lkml/2008/5/10/45 for full report. In the meantime, we can just fix the BKL performance regression by reverting back to the good old BKL spinlock implementation instead, since any sleeping lock will generally perform badly, especially if it tries to be fair. Reported-by: Sven Wegener <sven.wegener@stealer.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-10 20:43:22 -07:00
Rusty Russell	91e37a793b	module: don't ignore vermagic string if module doesn't have modversions Linus found a logic bug: we ignore the version number in a module's vermagic string if we have CONFIG_MODVERSIONS set, but modversions also lets through a module with no __versions section for modprobe --force (with tainting, but still). We should only ignore the start of the vermagic string if the module actually has crcs to check. Rather than (say) having an entertaining hissy fit and creating a config option to work around the buggy code. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-09 07:45:18 -07:00
Rusty Russell	a5dd697074	module: be more picky about allowing missing module versions We allow missing __versions sections, because modprobe --force strips it. It makes less sense to allow sections where there's no version for a specific symbol the module uses, so disallow that. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-09 07:45:18 -07:00
Linus Torvalds	c4f51b4662	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes: sched: fix weight calculations semaphore: fix	2008-05-08 11:31:07 -07:00
Linus Torvalds	7a34912d90	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: Revert "relay: fix splice problem" docbook: fix bio missing parameter block: use unitialized_var() in bio_alloc_bioset() block: avoid duplicate calls to get_part() in disk stat code cfq-iosched: make io priorities inherit CPU scheduling class as well as nice block: optimize generic_unplug_device() block: get rid of likely/unlikely predictions in merge logic vfs: splice remove_suid() cleanup cfq-iosched: fix RCU race in the cfq io_context destructor handling block: adjust tagging function queue bit locking block: sysfs store function needs to grab queue_lock and use queue_flag_*()	2008-05-08 10:48:36 -07:00
Paul Menage	5be7a4792a	Fix cpuset sched_relax_domain_level control file Due to a merge conflict, the sched_relax_domain_level control file was marked as being handled by cpuset_read/write_u64, but the code to handle it was actually in cpuset_common_file_read/write. Since the value being written/read is in fact a signed integer, it should be treated as such; this patch adds cpuset_read/write_s64 functions, and uses them to handle the sched_relax_domain_level file. With this patch, the sched_relax_domain_level can be read and written, and the correct contents seen/updated. Signed-off-by: Paul Menage <menage@google.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-08 10:46:56 -07:00
Mike Galbraith	46151122e0	sched: fix weight calculations The conversion between virtual and real time is as follows: dvt = rw/w * dt <=> dt = w/rw * dvt Since we want the fair sleeper granularity to be in real time, we actually need to do: dvt = - rw/w * l This bug could be related to the regression reported by Yanmin Zhang: \| Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots \| of regressions with 2.6.26-rc1: \| \| 1) 8-core stoakley: 28%; \| 2) 16-core tigerton: 20%; \| 3) Itanium Montvale: 50%. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-08 17:00:42 +02:00
Ingo Molnar	bf726eab37	semaphore: fix Yanmin Zhang reported: \| Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th \| regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, \| and Itanium Montecito. Bisect located the patch below: \| \| `64ac24e738` is first bad commit \| commit `64ac24e738` \| Author: Matthew Wilcox <matthew@wil.cx> \| Date: Fri Mar 7 21:55:58 2008 -0500 \| \| Generic semaphore implementation \| \| After I manually reverted the patch against 2.6.26-rc1 while fixing \| lots of conflicts/errors, aim7 regression became less than 2%. i reproduced the AIM7 workload and can confirm Yanmin's findings that -.26-rc1 regresses over .25 - by over 67% here. Looking at the workload i found and fixed what i believe to be the real bug causing the AIM7 regression: it was inefficient wakeup / scheduling / locking behavior of the new generic semaphore code, causing suboptimal performance. The problem comes from the following code. The new semaphore code does this on down(): spin_lock_irqsave(&sem->lock, flags); if (likely(sem->count > 0)) sem->count--; else __down(sem); spin_unlock_irqrestore(&sem->lock, flags); and this on up(): spin_lock_irqsave(&sem->lock, flags); if (likely(list_empty(&sem->wait_list))) sem->count++; else __up(sem); spin_unlock_irqrestore(&sem->lock, flags); where __up() does: list_del(&waiter->list); waiter->up = 1; wake_up_process(waiter->task); and where __down() does this in essence: list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; waiter.up = 0; for (;;) { [...] spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); if (waiter.up) return 0; } the fastpath looks good and obvious, but note the following property of the contended path: if there's a task on the ->wait_list, the up() of the current owner will "pass over" ownership to that waiting task, in a wake-one manner, via the waiter->up flag and by removing the waiter from the wait list. That is all and fine in principle, but as implemented in kernel/semaphore.c it also creates a nasty, hidden source of contention! The contention comes from the following property of the new semaphore code: the new owner owns the semaphore exclusively, even if it is not running yet. So if the old owner, even if just a few instructions later, does a down() [lock_kernel()] again, it will be blocked and will have to wait on the new owner to eventually be scheduled (possibly on another CPU)! Or if another task gets to lock_kernel() sooner than the "new owner" scheduled, it will be blocked unnecessarily and for a very long time when there are 2000 tasks running. I.e. the implementation of the new semaphores code does wake-one and lock ownership in a very restrictive way - it does not allow opportunistic re-locking of the lock at all and keeps the scheduler from picking task order intelligently. This kind of scheduling, with 2000 AIM7 processes running, creates awful cross-scheduling between those 2000 tasks, causes reduced parallelism, a throttled runqueue length and a lot of idle time. With increasing number of CPUs it causes an exponentially worse behavior in AIM7, as the chance for a newly woken new-owner task to actually run anytime soon is less and less likely. Note that it takes just a tiny bit of contention for the 'new-semaphore catastrophy' to happen: the wakeup latencies get added to whatever small contention there is, and quickly snowball out of control! I believe Yanmin's findings and numbers support this analysis too. The best fix for this problem is to use the same scheduling logic that the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and wanted because we do not want to over-schedule), but also allow opportunistic locking of the lock even if a wakee is already "in flight". The patch below implements this new logic. With this patch applied the AIM7 regression is largely fixed on my quad testbox: # v2.6.25 vanilla: .................. Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 56096.4 91 207.5 789.7 0.4675 2000 55894.4 94 208.2 792.7 0.4658 # v2.6.26-rc1-166-gc0a1811 vanilla: ................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 33230.6 83 350.3 784.5 0.2769 2000 31778.1 86 366.3 783.6 0.2648 # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup: ............................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55707.1 92 209.0 795.6 0.4642 2000 55704.4 96 209.0 796.0 0.4642 i.e. a 67% speedup. We are now back to within 1% of the v2.6.25 performance levels and have zero idle time during the test, as expected. Btw., interactivity also improved dramatically with the fix - for example console-switching became almost instantaneous during this workload (which after all is running 2000 tasks at once!), without the patch it was stuck for a minute at times. There's another nice side-effect of this speedup patch, the new generic semaphore code got even smaller: text data bss dec hex filename 1241 0 0 1241 4d9 semaphore.o.before 1207 0 0 1207 4b7 semaphore.o.after (because the waiter.up complication got removed.) Longer-term we should look into using the mutex code for the generic semaphore code as well - but i's not easy due to legacies and it's outside of the scope of v2.6.26 and outside the scope of this patch as well. Bisected-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-08 17:00:42 +02:00
Jens Axboe	75065ff619	Revert "relay: fix splice problem" This reverts commit `c3270e577c`.	2008-05-08 14:06:19 +02:00
Peter Zijlstra	3e51f33fcc	sched: add optional support for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK this replaces the rq->clock stuff (and possibly cpu_clock()). - architectures that have an 'imperfect' hardware clock can set CONFIG_HAVE_UNSTABLE_SCHED_CLOCK - the 'jiffie' window might be superfulous when we update tick_gtod before the __update_sched_clock() call in sched_clock_tick() - cpu_clock() might be implemented as: sched_clock_cpu(smp_processor_id()) if the accuracy proves good enough - how far can TSC drift in a single jiffie when considering the filtering and idle hooks? [ mingo@elte.hu: various fixes and cleanups ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Ingo Molnar	dfbf4a1bc3	sched: fix cpu clock David Miller pointed it out that nothing in cpu_clock() sets prev_cpu_time. This caused __sync_cpu_clock() to be called all the time - against the intention of this code. The result was that in practice we hit a global spinlock every time cpu_clock() is called - which - even though cpu_clock() is used for tracing and debugging, is suboptimal. While at it, also: - move the irq disabling to the outest layer, this should make cpu_clock() warp-free when called with irqs enabled. - use long long instead of cycles_t - for platforms where cycles_t is 32-bit. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Miao Xie	cb4ad1ffc7	sched: fair-group: fix a Div0 error of the fair group scheduler When I echoed 0 into the "cpu.shares" file, a Div0 error occured. We found it is caused by the following calling. sched_group_set_shares(tg, shares) set_se_shares(tg->se[i], shares/nr_cpu_ids) __set_se_shares(se, shares) div64_64((1ULL<<32), shares) When the echoed value was less than the number of processores, the result of the sentence "shares/nr_cpu_ids" was 0, and then the system called div64() to divide the result, the Div0 error occured. It is unnecessary that the shares value is divided by nr_cpu_ids, I think. Because in the function __update_group_shares_cpu() and init_tg_cfs_entry(), the shares value isn't divided by nr_cpu_ids when setting shares of the sched entity. This patch fixes this bug. And echoing ULONG_MAX value into cpu.shares also causes Div0 error, so we set a macro MAX_SHARES to limit the max value of shares. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Heiko Carstens	712555ee4f	sched: fix missing locking in sched_domains code Concurrent calls to detach_destroy_domains and arch_init_sched_domains were prevented by the old scheduler subsystem cpu hotplug mutex. When this got converted to get_online_cpus() the locking got broken. Unlike before now several processes can concurrently enter the critical sections that were protected by the old lock. So use the already present doms_cur_mutex to protect these sections again. Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Ingo Molnar	690229a091	sched: make clock sync tunable by architecture code make time_sync_thresh tunable to architecture code. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Mike Galbraith	d7dcdc11cf	sched: fix debugging Revert debugging commit `7ba2e74ab5`. print_cfs_rq_tasks() can induce live-lock if a task is dequeued during list traversal. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
David Simner	673a90a1e0	sched: fix sched_info_switch not being called according to documentation http://bugzilla.kernel.org/show_bug.cgi?id=10545 sched_stats.h says that __sched_info_switch is "called when prev != next" in the comment. sched.c should therefore do that. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Peter Zijlstra	b328ca182f	sched: fix hrtick_start_fair and CPU-Hotplug Gautham R Shenoy reported: > While running the usual CPU-Hotplug stress tests on linux-2.6.25, > I noticed the following in the console logs. > > This is a wee bit difficult to reproduce. In the past 10 runs I hit this > only once. > > ------------[ cut here ]------------ > > WARNING: at kernel/sched.c:962 hrtick+0x2e/0x65() > > Just wondering if we are doing a good job at handling the cancellation > of any per-cpu scheduler timers during CPU-Hotplug. This looks like its indeed not cancelled at all and migrates the it to another cpu. Fix it via a proper hotplug notifier mechanism. Reported-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Gregory Haskins	104f64549c	sched: fix SCHED_FAIR wake-idle logic error We currently use an optimization to skip the overhead of wake-idle processing if more than one task is assigned to a run-queue. The assumption is that the system must already be load-balanced or we wouldnt be overloaded to begin with. The problem is that we are looking at rq->nr_running, which may include RT tasks in addition to CFS tasks. Since the presence of RT tasks really has no bearing on the balance status of CFS tasks, this throws the calculation off. This patch changes the logic to only consider the number of CFS tasks when making the decision to optimze the wake-idle. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Gregory Haskins	8ae121ac86	sched: fix RT task-wakeup logic Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we will always evaluate to "true". You can find the thread here: http://lkml.org/lkml/2008/4/22/296 In reality, we only want to try to push tasks away when a wake up request is not going to preempt the current task. So lets fix it. Note: We introduce test_tsk_need_resched() instead of open-coding the flag check so that the merge-conflict with -rt should help remind us that we may need to support NEEDS_RESCHED_DELAYED in the future, too. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Harvey Harrison	983ed7a66b	sched: add statics, don't return void expressions Noticed by sparse: kernel/sched.c:760:20: warning: symbol 'sched_feat_names' was not declared. Should it be static? kernel/sched.c:767:5: warning: symbol 'sched_feat_open' was not declared. Should it be static? kernel/sched_fair.c:845:3: warning: returning void-valued expression kernel/sched.c:4386:3: warning: returning void-valued expression Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:17 +02:00
Andrew Morton	d478c2cfaa	sched: add debug checks to idle functions Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: "Justin Mattock" <justinmattock@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:17 +02:00
Harvey Harrison	2abdad0a4c	sched: make rt_sched_class, idle_sched_class static The C files are included directly in sched.c, so they are effectively static. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:17 +02:00
Peter Zijlstra	e05510d01a	sched: optimize calc_delta_mine() Joel noticed that the !lw->inv_weight contition isn't unlikely anymore so remove the unlikely annotation. Also, remove the two div64_u64() inv_weight calculations, which makes them rely on the calc_delta_mine() path as well. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Joel Schopp <jschopp@austin.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:17 +02:00
Peter Zijlstra	a992241de6	sched: fix normalized sleeper Normalized sleeper uses calc_delta*() which requires that the rq load is already updated, so move account_entity_enqueue() before place_entity() Tested-by: Frans Pop <elendil@planet.nl> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:17 +02:00
Linus Torvalds	5717922a1b	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdb: kconfig fix xconfig/menuconfig element kgdb: fix signedness mixmatches, add statics, add declaration to header kgdb: 1000 loops for the single step test in kgdbts kgdb: trivial sparse fixes in kgdb test-suite kgdb: minor documentation fixes	2008-05-05 10:17:30 -07:00
Eric Sesterhenn	82af7aca56	Removal of FUTEX_FD Since FUTEX_FD was scheduled for removal in June 2007 lets remove it. Google Code search found no users for it and NGPT was abandoned in 2003 according to IBM. futex.h is left untouched to make sure the id does not get reassigned. Since queue_me() has no users left it is commented out to avoid a warning, i didnt remove it completely since it is part of the internal api (matching unqueue_me()) Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed rest) Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-05 08:18:45 -07:00
Harvey Harrison	688b744d8b	kgdb: fix signedness mixmatches, add statics, add declaration to header Noticed by sparse: arch/x86/kernel/kgdb.c:556:15: warning: symbol 'kgdb_arch_pc' was not declared. Should it be static? kernel/kgdb.c:149:8: warning: symbol 'kgdb_do_roundup' was not declared. Should it be static? kernel/kgdb.c:193:22: warning: symbol 'kgdb_arch_pc' was not declared. Should it be static? kernel/kgdb.c:712:5: warning: symbol 'remove_all_break' was not declared. Should it be static? Related to kgdb_hex2long: arch/x86/kernel/kgdb.c:371:28: warning: incorrect type in argument 2 (different signedness) arch/x86/kernel/kgdb.c:371:28: expected long long_val arch/x86/kernel/kgdb.c:371:28: got unsigned long <noident> kernel/kgdb.c:469:27: warning: incorrect type in argument 2 (different signedness) kernel/kgdb.c:469:27: expected long long_val kernel/kgdb.c:469:27: got unsigned long <noident> kernel/kgdb.c:470:27: warning: incorrect type in argument 2 (different signedness) kernel/kgdb.c:470:27: expected long long_val kernel/kgdb.c:470:27: got unsigned long <noident> kernel/kgdb.c:894:27: warning: incorrect type in argument 2 (different signedness) kernel/kgdb.c:894:27: expected long long_val kernel/kgdb.c:894:27: got unsigned long <noident> kernel/kgdb.c:895:27: warning: incorrect type in argument 2 (different signedness) kernel/kgdb.c:895:27: expected long long_val kernel/kgdb.c:895:27: got unsigned long <noident> kernel/kgdb.c:1127:28: warning: incorrect type in argument 2 (different signedness) kernel/kgdb.c:1127:28: expected long long_val kernel/kgdb.c:1127:28: got unsigned long <noident> kernel/kgdb.c:1132:25: warning: incorrect type in argument 2 (different signedness) kernel/kgdb.c:1132:25: expected long long_val kernel/kgdb.c:1132:25: got unsigned long <noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2008-05-05 07:13:21 -05:00
Linus Torvalds	826e4506a0	Make forced module loading optional The kernel module loader used to be much too happy to allow loading of modules for the wrong kernel version by default. For example, if you had MODVERSIONS enabled, but tried to load a module with no version info, it would happily load it and taint the kernel - whether it was likely to actually work or not! Generally, such forced module loading should be considered a really really bad idea, so make it conditional on a new config option (MODULE_FORCE_LOAD), and make it default to off. If somebody really wants to force module loads, that's their problem, but we should not encourage it. Especially as it happened to me by mistake (ie regular unversioned Fedora modules getting loaded) causing lots of strange behavior. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-04 17:04:16 -07:00
Linus Torvalds	afa26be86b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt * git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: clocksource: allow read access to available/current_clocksource clocksource: Fix permissions for available_clocksource hrtimer: remove duplicate helper function	2008-05-03 13:51:10 -07:00
Heiko Carstens	4f95f81a48	clocksource: allow read access to available/current_clocksource There is no harm, when users can read the info and we ask often enough during debugging for this kind of information. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-03 18:11:48 +02:00
Heiko Carstens	4359a023a8	clocksource: Fix permissions for available_clocksource File permissions for /sys/devices/system/clocksource/clocksource0/available_clocksource are 600 which allows write access. But this is in fact a read only file. So change permissions to 400. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-03 18:11:48 +02:00
Oliver Hartkopp	4346f65426	hrtimer: remove duplicate helper function The helper function hrtimer_callback_running() is used in kernel/hrtimer.c as well as in the updated net/can/bcm.c which now supports hrtimers. Moving the helper function to hrtimer.h removes the duplicate definition in the C-files. Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net> Cc: David Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-05-03 18:11:48 +02:00
H. Peter Anvin	b9095fd8a7	Make constants in kernel/timeconst.h fixed 64 bits Force constants in kernel/timeconst.h (except shift counts) to be 64 bits, using U64_C() constructor macros, and eliminate constants that cannot be represented at all in 64 bits. This avoids warnings with some gcc versions. Drop generating 64-bit constants, since we have no real hope of getting a full set (operation on 64-bit values requires a 128-bit intermediate result, which gcc only supports on 64-bit platforms, and only with libgcc support on some.) Note that the use of these constants does not depend on if we are on a 32- or 64-bit architecture. This resolves Bugzilla 10153. Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2008-05-02 16:18:42 -07:00
Linus Torvalds	b66e1f11eb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: [PATCH] fix sysctl_nr_open bugs [PATCH] sanitize anon_inode_getfd() [PATCH] split linux/file.h [PATCH] make osf_select() use core_sys_select() [PATCH] remove horrors with irix tty ioctls handling [PATCH] fix file and descriptor handling in perfmon	2008-05-02 11:23:14 -07:00
Thomas Gleixner	1adb0850a1	genirq: reenable a nobody cared disabled irq when a new driver arrives Uwe Kleine-Koenig has some strange hardware where one of the shared interrupts can be asserted during boot before the appropriate driver loads. Requesting the shared irq line from another driver result in a spurious interrupt storm which finally disables the interrupt line. I have seen similar behaviour on resume before (the hardware does not work anymore so I can not verify). Change the spurious disable logic to increment the disable depth and mark the interrupt with an extra flag which allows us to reenable the interrupt when a new driver arrives which requests the same irq line. In the worst case this will disable the irq again via the spurious trap, but there is a decent chance that the new driver is the one which can handle the already asserted interrupt and makes the box usable again. Eric Biederman said further: This case also happens on a regular basis in kdump kernels where we deliberately don't shutdown the hardware before starting the new kernel. This patch should reduce the need for using irqpoll in that situation by a small amount. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-and-Acked-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com>	2008-05-02 13:40:34 +02:00
Christoph Hellwig	bcf35afb52	make generic sys_ptrace unconditional With s390 the last arch switched to the generic sys_ptrace yesterday so we can now kill the ifdef around it to enforce every new port it using it instead of introducing new weirdo versions. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 10:21:54 -07:00
Al Viro	9f3acc3140	[PATCH] split linux/file.h Initial splitoff of the low-level stuff; taken to fdtable.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-05-01 13:08:16 -04:00
Linus Torvalds	03fc922f40	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: add MODULE_STATE_GOING notifier call module: Enhance verify_export_symbols module: set unused_gpl_crcs instead of overwriting unused_crcs module: neaten __find_symbol, rename to find_symbol module: reduce module image and resident size module: make module_sect_attrs private to kernel/module.c	2008-05-01 08:26:56 -07:00
Andrew Liu	8a3e77cc21	workqueue: remove redundant function invocation timer_stats_timer_set_start_info is invoked twice, additionally, the invocation of this function can be moved to where it is only called when a delay is really required. Signed-off-by: Andrew Liu <shengping.liu@windriver.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:04:02 -07:00
Michael Ellerman	be089d79c4	kexec: make extended crashkernel= syntax less confusing The extended crashkernel syntax is a little confusing in the way it handles ranges. eg: crashkernel=512M-2G:64M,2G-:128M Means if the machine has between 512M and 2G of memory the crash region should be 64M, and if the machine has 2G of memory the region should be 64M. Only if the machine has more than 2G memory will 128M be allocated. Although that semantic is correct, it is somewhat baffling. Instead I propose that the end of the range means the first address past the end of the range, ie: 512M up to but not including 2G. [bwalle@suse.de: clarify inclusive/exclusive in crashkernel commandline in documentation] Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Acked-by: Bernhard Walle <bwalle@suse.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Simon Horman <horms@verge.net.au> Signed-off-by: Bernhard Walle <bwalle@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:04:00 -07:00
Roman Zippel	7dffa3c673	ntp: handle leap second via timer Remove the leap second handling from second_overflow(), which doesn't have to check for it every second anymore. With CONFIG_NO_HZ this also makes sure the leap second is handled close to the full second. Additionally this makes it possible to abort a leap second properly by resetting the STA_INS/STA_DEL status bits. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:59 -07:00
Roman Zippel	8383c42399	ntp: remove current_tick_length() current_tick_length used to do a little more, but now it just returns tick_length, which we can also access directly at the few places, where it's needed. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:59 -07:00
Roman Zippel	7fc5c78409	ntp: rename TICK_LENGTH_SHIFT to NTP_SCALE_SHIFT As TICK_LENGTH_SHIFT is used for more than just the tick length, the name isn't quite approriate anymore, so this renames it to NTP_SCALE_SHIFT. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:59 -07:00
Roman Zippel	153b5d054a	ntp: support for TAI This adds support for setting the TAI value (International Atomic Time). The value is reported back to userspace via timex (as we don't have a ntp_gettime() syscall). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:59 -07:00
Roman Zippel	9f14f669d1	ntp: increase time_offset resolution time_offset is already a 64bit value but its resolution barely used, so this makes better use of it by replacing SHIFT_UPDATE with TICK_LENGTH_SHIFT. Side note: the SHIFT_HZ in SHIFT_UPDATE was incorrect for CONFIG_NO_HZ and the primary reason for changing time_offset to 64bit to avoid the overflow. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	074b3b8794	ntp: increase time_freq resolution This changes time_freq to a 64bit value and makes it static (the only outside user had no real need to modify it). Intermediate values were already 64bit, so the change isn't that big, but it saves a little in shifts by replacing SHIFT_NSEC with TICK_LENGTH_SHIFT. PPM_SCALE is then used to convert between user space and kernel space representation. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	eea83d896e	ntp: NTP4 user space bits update This adds a few more things from the ntp nanokernel related to user space. It's now possible to select the resolution used of some values via STA_NANO and the kernel reports in which mode it works (pll/fll). If some values for adjtimex() are outside the acceptable range, they are now simply normalized instead of letting the syscall fail. I removed MOD_CLKA/MOD_CLKB as the mapping didn't really makes any sense, the kernel doesn't support setting the clock. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	ee9851b218	ntp: cleanup ntp.c This is mostly a style cleanup of ntp.c and extracts part of do_adjtimex as ntp_update_offset(). Otherwise the functionality is still the same as before. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	f8bd2258e2	remove div_long_long_rem x86 is the only arch right now, which provides an optimized for div_long_long_rem and it has the downside that one has to be very careful that the divide doesn't overflow. The API is a little akward, as the arguments for the unsigned divide are signed. The signed version also doesn't handle a negative divisor and produces worse code on 64bit archs. There is little incentive to keep this API alive, so this converts the few users to the new API. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	6f6d6a1a6a	rename div64_64 to div64_u64 Rename div64_64 to div64_u64 to make it consistent with the other divide functions, so it clearly includes the type of the divide. Move its definition to math64.h as currently no architecture overrides the generic implementation. They can still override it of course, but the duplicated declarations are avoided. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: Avi Kivity <avi@qumranet.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Roman Zippel	71abb3af62	convert a few do_div users This converts a few users of do_div to div_[su]64 and this demonstrates nicely how it can reduce some expressions to one-liners. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Christian Borntraeger	e5e417232e	Fix cpu hotplug problem in softirq code currently cpu hotplug (unplug) seems broken on s390 and likely others. On cpu unplug the system starts to behave very strange and hangs. I bisected the problem to the following commit: commit `48f20a9a94` Author: Olof Johansson <olof@lixom.net> Date: Tue Mar 4 15:23:25 2008 -0800 tasklets: execute tasklets in the same order they were queued Reverting this patch seems to fix the problem. I looked into takeover_tasklet and it seems that there is a way to corrupt the tail pointer of the current cpu. If the tasklet list of the frozen cpu is empty, the tail pointer of the current cpu points to the address of the head pointer of the stopped cpu and not to the next pointer of a tasklet_struct. This patch avoids the list splice of the list is empty and cpu hotplug seems to work as the tail pointer is not corrupted. Olof, can you look into that patch and ACK/NACK it so Andrew can push this to Linus, if appropriate? Please note that some lines are longer than 80 chars, but line-wrapping looked worse that this version. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Olof Johansson <olof@lixom.net> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-05-01 08:03:58 -07:00
Peter Oberparleiter	df4b565e1f	module: add MODULE_STATE_GOING notifier call Provide module unload callback. Required by the gcov profiling infrastructure to keep track of profiling data structures. Signed-off-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-01 21:15:01 +10:00
Rusty Russell	b211104d11	module: Enhance verify_export_symbols Make verify_export_symbols check the modules unused, unused_gpl and gpl_future syms. Inspired by Jan Beulich's fix, but table-driven. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-01 21:15:00 +10:00
Rusty Russell	4e2d92454b	module: set unused_gpl_crcs instead of overwriting unused_crcs Obvious typo, but I don't know of any modules with unused GPL exports, and then it would take someone noticing that the version shouldn't have matched in a dependent module. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-01 21:15:00 +10:00
Rusty Russell	ad9546c991	module: neaten __find_symbol, rename to find_symbol __find_symbol() has grown over time: there are now 5 different arrays of symbols it traverses. It also shouldn't print out a warning on some calls (ie. verify_symbol which simply checks for name clashes, and __symbol_put which checks for bugs). 1) Rename to find_symbol: no need for underscores. 2) Use bool and add "warn" parameter to suppress warnings. 3) Make table-driven rather than open coded. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-01 21:15:00 +10:00
Rusty Russell	ea01e798e2	module: reduce module image and resident size Resulting reduction (x86-64, gcc 4.1.2) with my (special purpose, i.e. much reduced) configurations: - 16k kernel resident size - 180k module resident size - 10k module image size Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-01 21:14:59 +10:00
Rusty Russell	a58730c421	module: make module_sect_attrs private to kernel/module.c No-one else is using these afaics. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-05-01 21:14:59 +10:00
Linus Torvalds	08acd4f8af	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (179 commits) ACPI: Fix acpi_processor_idle and idle= boot parameters interaction acpi: fix section mismatch warning in pnpacpi intel_menlo: fix build warning ACPI: Cleanup: Remove unneeded, multiple local dummy variables ACPI: video - fix permissions on some proc entries ACPI: video - properly handle errors when registering proc elements ACPI: video - do not store invalid entries in attached_array list ACPI: re-name acpi_pm_ops to acpi_suspend_ops ACER_WMI/ASUS_LAPTOP: fix build bug thinkpad_acpi: fix possible NULL pointer dereference if kstrdup failed ACPI: check a return value correctly in acpi_power_get_context() #if 0 acpi/bay.c:eject_removable_drive() eeepc-laptop: add hwmon fan control eeepc-laptop: add backlight eeepc-laptop: add base driver ACPI: thinkpad-acpi: bump up version to 0.20 ACPI: thinkpad-acpi: fix selects in Kconfig ACPI: thinkpad-acpi: use a private workqueue ACPI: thinkpad-acpi: fluff really minor fix ACPI: thinkpad-acpi: use uppercase for "LED" on user documentation ... Fixed conflicts in drivers/acpi/video.c and drivers/misc/intel_menlow.c manually.	2008-04-30 11:52:52 -07:00
Len Brown	96916090f4	Merge branches 'release', 'acpica', 'bugzilla-10224', 'bugzilla-9772', 'bugzilla-9916', 'ec', 'eeepc', 'idle', 'misc', 'pm-legacy', 'sysfs-links-2.6.26', 'thermal', 'thinkpad' and 'video' into release	2008-04-30 13:58:00 -04:00
Harvey Harrison	af1f16d08f	kernel: replace remaining __FUNCTION__ occurrences __FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:54 -07:00
Thomas Gleixner	237fc6e7a3	add hrtimer specific debugobjects code hrtimers have now dynamic users in the network code. Put them under debugobjects surveillance as well. Add calls to the generic object debugging infrastructure and provide fixup functions which allow to keep the system alive when recoverable problems have been detected by the object debugging core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:53 -07:00
Thomas Gleixner	c6f3a97f86	debugobjects: add timer specific object debugging code Add calls to the generic object debugging infrastructure and provide fixup functions which allow to keep the system alive when recoverable problems have been detected by the object debugging core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:53 -07:00
Andrew Morton	354a1f4d99	alloc_uid: cleanup Use kmem_cache_zalloc(), remove large amounts of initialisation code and ifdeffery. Note: this assumes that memset(*atomic_t, 0) correctly initialises the atomic_t. This is true for all present archtiectures and if it becomes false for a future architecture then we'll need to make large changes all over the place anyway. Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:53 -07:00
Markus Armbruster	f735295b14	printk: don't read beyond string arguments' terminating zero Fix update_console_cmdline() not to to read beyond the terminating zero of its name argument. Signed-off-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:52 -07:00
Samuel Thibault	f7511d5f66	Basic braille screen reader support This adds a minimalistic braille screen reader support. This is meant to be used by blind people e.g. on boot failures or when / cannot be mounted etc and thus the userland screen readers can not work. [akpm@linux-foundation.org: fix exports] Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Cc: Jiri Kosina <jikos@jikos.cz> Cc: Dmitry Torokhov <dtor@mail.ru> Acked-by: Alan Cox <alan@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:52 -07:00
Miklos Szeredi	e4ad08fe64	mm: bdi: add separate writeback accounting capability Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is set, then don't update the per-bdi writeback stats from test_set_page_writeback() and test_clear_page_writeback(). Misc cleanups: - convert bdi_cap_writeback_dirty() and friends to static inline functions - create a flag that includes all three dirty/writeback related flags, since almst all users will want to have them toghether Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:50 -07:00
Pavel Emelyanov	caafa43243	pidns: make pid->level and pid_ns->level unsigned These values represent the nesting level of a namespace and pids living in it, and it's always non-negative. Turning this from int to unsigned int saves some space in pid.c (11 bytes on x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit. E.g. on ia64 this removes the sign extension calls, which compiler adds to optimize access to pid->nubers[ns->level]. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:49 -07:00
Adrian Bunk	ab883af53e	make marker_debug static With the needlessly global marker_debug being static gcc can optimize the unused code away. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:49 -07:00
Oleg Nesterov	12a3de0a96	pids: sys_getpgid: fix unsafe *pid usage, s/tasklist/rcu/ 1. sys_getpgid() needs rcu_read_lock() to derive the pgrp _nr, even if the task is current, otherwise we can race with another thread which does sys_setpgid(). 2. Use rcu_read_lock() instead of tasklist_lock when pid != 0, make sure that we don't use the NULL pid if the task exits right after successful find_task_by_vpid(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:49 -07:00
Oleg Nesterov	1dd768c081	pids: sys_getsid: fix unsafe *pid usage, fix possible 0 instead of -ESRCH 1. sys_getsid() needs rcu_read_lock() to derive the session _nr, even if the task is current, otherwise we can race with another thread which does sys_setsid(). 2. The task can exit between find_task_by_vpid() and task_session_vnr(), in that unlikely case sys_getsid() returns 0 instead of -ESRCH. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:49 -07:00
Oleg Nesterov	7d8da0962e	pids: __set_special_pids: use change_pid() helper Use change_pid() instead of detach_pid() + attach_pid() in __set_special_pids(). This way task_session() is not NULL in between. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:49 -07:00
Oleg Nesterov	83beaf3c6c	pids: sys_setpgid: use change_pid() helper Use change_pid() instead of detach_pid() + attach_pid() in sys_setpgid(). This way task_pgrp() is not NULL in between. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:49 -07:00
Oleg Nesterov	24336eaeec	pids: introduce change_pid() helper Based on Eric W. Biederman's idea. Without tasklist_lock held task_session()/task_pgrp() can return NULL if the caller races with setprgp()/setsid() which does detach_pid() + attach_pid(). This can happen even if task == current. Intoduce the new helper, change_pid(), which should be used instead. This way the caller always sees the special pid != NULL, either old or new. Also change the prototype of attach_pid(), it always returns 0 and nobody check the returned value. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:48 -07:00
Oleg Nesterov	65450cebc6	pids: de_thread: don't clear session/pgrp pids for the old leader Based on Eric W. Biederman's idea. Unless task == current, without tasklist_lock held task_session()/task_pgrp() can return NULL if the caller races with de_thread() which switches the group leader. Change transfer_pid() to not clear old->pids[type].pid for the old leader. This means that its .pid can point to "nowhere", but this is already true for sub-threads, and the old leader is not group_leader() any longer. IOW, with or without this change we can't trust task's special pids unless it is the group leader. With this change the following code rcu_read_lock(); task = find_task_by_xxx(); do_something(task_pgrp(task), task_session(task)); rcu_read_unlock(); can't race with exec and hit the NULL pid. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:48 -07:00
Pavel Emelyanov	5cd204550b	Deprecate find_task_by_pid() There are some places that are known to operate on tasks' global pids only: * the rest_init() call (called on boot) * the kgdb's getthread * the create_kthread() (since the kthread is run in init ns) So use the find_task_by_pid_ns(..., &init_pid_ns) there and schedule the find_task_by_pid for removal. [sukadev@us.ibm.com: Fix warning in kernel/pid.c] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:48 -07:00
Pavel Emelyanov	cb41d6d068	Use find_task_by_vpid in taskstats The pid to lookup a task by is passed inside taskstats code via genetlink message. Since netlink packets are now processed in the context of the sending task, this is correct to lookup the task with find_task_by_vpid() here. Besides, I fix the call to fill_pid() from taskstats_exit(), since the tsk->pid is not required in fill_pid() in this case, and the pid field on task_struct is going to be deprecated as well. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Jonathan Lim <jlim@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:48 -07:00
Oleg Nesterov	b7127aa454	free_pidmap: turn it into free_pidmap(struct upid ) The callers of free_pidmap() pass 2 members of "struct upid", we can just pass "struct upid " instead. Shaves off 10 bytes from pid.o. Also, simplify the alloc_pid's "out_free:" error path a little bit. This way it looks more clear which subset of pid->numbers[] we are freeing. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc :Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:48 -07:00
Alan Cox	f34d7a5b70	tty: The big operations rework - Operations are now a shared const function block as with most other Linux objects - Introduce wrappers for some optional functions to get consistent behaviour - Wrap put_char which used to be patched by the tty layer - Document which functions are needed/optional - Make put_char report success/fail - Cache the driver->ops pointer in the tty as tty->ops - Remove various surplus lock calls we no longer need - Remove proc_write method as noted by Alexey Dobriyan - Introduce some missing sanity checks where certain driver/ldisc combinations would oops as they didn't check needed methods were present [akpm@linux-foundation.org: fix fs/compat_ioctl.c build] [akpm@linux-foundation.org: fix isicom] [akpm@linux-foundation.org: fix arch/ia64/hp/sim/simserial.c build] [akpm@linux-foundation.org: fix kgdb] Signed-off-by: Alan Cox <alan@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Cc: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:47 -07:00
Oleg Nesterov	00cd5c37af	ptrace: permit ptracing of /sbin/init Afaics, currently there are no kernel problems with ptracing init, it can't lose SIGNAL_UNKILLABLE flag and be killed/stopped by accident. The ability to strace/debug init can be very useful if you try to figure out why it does not work as expected. However, admin should know what he does, "gdb /sbin/init 1" stops init, it can't reap orphaned zombies or take care of /etc/inittab until continued. It is even possible to crash init (and thus the whole system) if you wish, ptracer has full control. See also the long discussion: http://marc.info/?t=120628018600001 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:38 -07:00
Oleg Nesterov	33e9fc7d01	ptrace: ptrace_attach: use send_sig_info() instead force_sig_specific() Nobody can block/ignore SIGSTOP, no need to use force_sig_specific() in ptrace_attach. Use the "regular" send_sig_info(). With this patch stracing of /sbin/init doesn't clear its SIGNAL_UNKILLABLE, but not that this makes ptracing of init safe. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:38 -07:00
Oleg Nesterov	68cb947866	ptrace: __ptrace_unlink: use the ptrace_reparented() helper Currently __ptrace_unlink() checks list_empty(->ptrace_list) to figure out whether the child was reparented. Change the code to use ptrace_reparented() to make this check more explicit and consistent. No functional changes. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:38 -07:00
Oleg Nesterov	53b6f9fbd3	ptrace: introduce ptrace_reparented() helper Add another trivial helper for the sake of grep. It also auto-documents the fact that ->parent != real_parent implies ->ptrace. No functional changes. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:38 -07:00
Oleg Nesterov	2800d8d19e	document de_thread() with exit_notify() connection Add a couple of small comments, it is not easy to see what this code does. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:38 -07:00
Oleg Nesterov	376e1d2531	reparent_thread: use same_thread_group() Trivial, use same_thread_group() in reparent_thread(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:38 -07:00
Oleg Nesterov	d839fd4d2e	ptrace: introduce task_detached() helper exit.c has numerous "->exit_signal == -1" comparisons, this check is subtle and deserves a helper. Imho makes the code more parseable for humans. At least it's surely more greppable. Also, a couple of whitespace cleanups. No functional changes. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:38 -07:00
Roland McGrath	4e4c22c711	signals: add set_restore_sigmask This adds the set_restore_sigmask() inline in <linux/thread_info.h> and replaces every set_thread_flag(TIF_RESTORE_SIGMASK) with a call to it. No change, but abstracts the details of the flag protocol from all the calls. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:37 -07:00
Oleg Nesterov	80fe728d59	signals: allow the kernel to actually kill /sbin/init Currently the buggy /sbin/init hangs if SIGSEGV/etc happens. The kernel sends the signal, init dequeues it and ignores, returns from the exception, repeats the faulting instruction, and so on forever. Imho, such a behaviour is not good. I think that the explicit loud death of the buggy /sbin/init is better than the silent hang. Change force_sig_info() to clear SIGNAL_UNKILLABLE when the task should be really killed. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:37 -07:00
Oleg Nesterov	fae5fa44f1	signals: fix /sbin/init protection from unwanted signals The global init has a lot of long standing problems with the unhandled fatal signals. - The "is_global_init(current)" check in get_signal_to_deliver() protects only the main thread. Sub-thread can dequee the fatal signal and shutdown the whole thread group except the main thread. If it dequeues SIGSTOP /sbin/init will be stopped, this is not right too. Note that we can't use is_global_init(->group_leader), this breaks exec and this can't solve other problems we have. - Even if afterwards ignored, the fatal signals sets SIGNAL_GROUP_EXIT on delivery. This breaks exec, has other bad implications, and this is just wrong. Introduce the new SIGNAL_UNKILLABLE flag to fix these problems. It also helps to solve some other problems addressed by the subsequent patches. Currently we use this flag for the global init only, but it could also be used by kthreads and (perhaps) by the sub-namespace inits. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:37 -07:00
Oleg Nesterov	193191035a	signals: check_kill_permission: remove tasklist_lock Now that task_session() can't return a false NULL, check_kill_permission() doesn't need tasklist_lock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:37 -07:00
Oleg Nesterov	2e2ba22ea4	signals: check_kill_permission: check session under tasklist_lock This wasn't documented, but as Atsushi Tsuji pointed out check_kill_permission() needs tasklist_lock for task_session_nr(). I missed this fact when removed tasklist from the callers. Change check_kill_permission() to take tasklist_lock for the SIGCONT case. Re-order security checks so that we take tasklist_lock only if/when it is actually needed. This is a minimal fix for now, tasklist will be removed later. Also change the code to use task_session() instead of task_session_nr(). Also, remove the SIGCONT check from cap_task_kill(), it is bogus (and the whole function is bogus. Serge, Eric, why it is still alive?). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Atsushi Tsuji <a-tsuji@bk.jp.nec.com> Cc: Roland McGrath <roland@redhat.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:37 -07:00
Oleg Nesterov	53c30337f2	signals: send_signal: be paranoid about signalfd_notify() send_signal() shouldn't call signalfd_notify() if it then fails with -EAGAIN. Harmless, just a paranoid cleanup. Also remove the comment. It is obsolete, signalfd_notify() was simplified and does a simple wakeup. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:36 -07:00
Oleg Nesterov	021e1ae3d8	signals: document CLD_CONTINUED notification mechanics A couple of small comments about how CLD_CONTINUED notification works. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:36 -07:00
Oleg Nesterov	7e695a5ef5	signals: fold sig_ignored() into handle_stop_signal() Rename handle_stop_signal() to prepare_signal(), make it return a boolean, and move the callsites of sig_ignored() into it. No functional changes for now. But it would be nice to factor out the "should we drop this signal" checks as much as possible, before we try to fix the bugs with the sub-namespace init's signals (actually the global /sbin/init has some problems with signals too). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:36 -07:00
Oleg Nesterov	2dce81bff2	signals: cleanup the usage of print_fatal_signal() Move the callsite of print_fatal_signal() down, under "if (sig_kernel_coredump(signr))", so we don't need to check signr != SIGKILL. We are only interested in the sig_kernel_coredump() signals anyway, and due to the previous changes we almost never can see other fatal signals here except SIGKILL. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:36 -07:00
Oleg Nesterov	34c8f07b9a	signals: handle_stop_signal: don't worry about SIGKILL handle_stop_signal() clears SIGNAL_STOP_DEQUEUED when sig == SIGKILL. Remove this nasty special case. It was needed to prevent the race with group stop and exit caused by thread-specific SIGKILL. Now that we use complete_signal() for private signals too this is not needed, complete_signal() will notice SIGKILL and abort the soon-to-begin group stop. Except: the target thread is dead (has PF_EXITING). But in that case we should not just clear SIGNAL_STOP_DEQUEUED and nothing more. We should either kill the whole thread group, or silently ignore the signal. I suspect we are not right wrt zombie leaders, but this is another issue which and should be fixed separately. Note that this check can't abort the group stop if it was already started/finished, this check only adds a subtle side effect if we race with the thread which has already dequeued sig_kernel_stop() signal and temporary released ->siglock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:36 -07:00
Oleg Nesterov	ac5c215383	signals: join send_sigqueue() with send_group_sigqueue() We export send_sigqueue() and send_group_sigqueue() for the only user, posix_timer_event(). This is a bit silly, because both are just trivial helpers on top of do_send_sigqueue() and because the we pass the unused .si_signo parameter. Kill them both, rename do_send_sigqueue() to send_sigqueue(), and export it. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:36 -07:00
Oleg Nesterov	e62e6650e9	signals: unify send_sigqueue/send_group_sigqueue completely Suggested by Pavel Emelyanov. send_sigqueue/send_group_sigqueue are only differ in how they lock ->siglock. Unify them. send_group_sigqueue() uses spin_lock() because it knows the task can't exit, but in that case lock_task_sighand() can't fail and doesn't hurt. Note that the "sig" argument is ignored, it is always equal to ->si_signo. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:36 -07:00
Pavel Emelyanov	4cd4b6d4e0	signals: fold complete_signal() into send_signal/do_send_sigqueue Factor out complete_signal() callsites. This change completely unifies the helpers sending the specific/group signals. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:36 -07:00
Oleg Nesterov	5fcd835bf8	signals: use __group_complete_signal() for the specific signals too Based on Pavel Emelyanov's suggestion. Rename __group_complete_signal() to complete_signal() and use it to process the specific signals too. To do this we simply add the "int group" argument. This allows us to greatly simply the signal-sending code and adds a useful behaviour change. We can avoid the unneeded wakeups for the private signals because wants_signal() is more clever than sigismember(blocked), but more importantly we now take into account the fatal specific signals too. The latter allows us to kill some subtle checks in handle_stop_signal() and makes the specific/group signal's behaviour more consistent. For example, currently sigtimedwait(FATAL_SIGNAL) behaves differently depending on was the signal sent by kill() or tkill() if the signal was not blocked. And. This allows us to tweak/fix the behaviour when the specific signal is sent to the dying/dead ->group_leader. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:36 -07:00
Oleg Nesterov	2ca3515aa5	signals: change send_signal/do_send_sigqueue to take "boolean group" parameter send_signal() is used either with ->pending or with ->signal->shared_pending. Change it to take "int group" instead, this argument will be re-used later. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	71f11dc025	signals: move the definition of __group_complete_signal() up Move the unchanged definition of __group_complete_signal() so that send_signal can see it. To simplify the reading of the next patches. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	db51aeccd7	signals: microoptimize the usage of ->curr_target Suggested by Roland McGrath. Initialize signal->curr_target in copy_signal(). This way ->curr_target is never == NULL, we can kill the check in __group_complete_signal's hot path. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	08d2c30ce9	signals: send_sig_info: don't take tasklist_lock The comment in send_sig_info() is wrong, tasklist_lock can't help. The caller must ensure the task can't go away, otherwise ->sighand can be NULL even before we take the lock. p->sighand could be changed by exec(), but I can't imagine how it is possible to prevent exit(), but not exec(). Since the things seem to work, I assume all callers are correct. However, drm_vbl_send_signals() looks broken. block_all_signals() which is solely used by drm is definitely broken. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	3547ff3aef	signals: do_tkill: don't use tasklist_lock Convert do_tkill() to use rcu_read_lock() + lock_task_sighand() to avoid taking tasklist lock. Note that we don't return an error if lock_task_sighand() fails, we pretend the task dies after receiving the signal. Otherwise, we should fight with the nasty races with mt-exec without having any advantage. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	6e65acba7c	signals: move handle_stop_signal() into send_signal() Move handle_stop_signal() into send_signal(). This factors out a couple of callsites and allows us to do further unifications. Also, with this change specific_send_sig_info() does handle_stop_signal(). Not that this is really important, we never send STOP/CONT via send_sig() and friends, but still this looks more consistent. The only (afaics) special case is get_signal_to_deliver(). If the traced task dequeues SIGCONT, it can re-send it to itself after ptrace_stop() if the signal was blocked by debugger. In that case handle_stop_signal() is unnecessary, but hopefully not a problem. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	c99fcf28b8	signals: send_group_sigqueue: don't take tasklist_lock handle_stop_signal() was changed, now send_group_sigqueue() doesn't need tasklist_lock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	f8c5b5c06f	signals: __group_complete_signal: cache the value of p->signal Cosmetic, cache p->signal to make the code a bit more readable. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	5fc894bb4f	signals: send_sigqueue: don't forget about handle_stop_signal() send_group_sigqueue() calls handle_stop_signal(), send_sigqueue() doesn't. This is not consistent and in fact I'd say this is (minor) bug. Move handle_stop_signal() from send_group_sigqueue() to do_send_sigqueue(), the latter is called by send_sigqueue() too. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	5c193e8871	signals: send_sigqueue: don't take rcu lock lock_task_sighand() was changed, send_sigqueue() doesn't need rcu_read_lock() any longer. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	f6b76d4fb0	get_signal_to_deliver: use the cached ->signal/sighand values Cache the values of current->signal/sighand. Shrinks .text a bit and makes the code more readable. Also, remove "sigset_t *mask", it is pointless because in fact we save the constant offset. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:35 -07:00
Oleg Nesterov	ad16a46069	handle_stop_signal: use the cached p->signal value Cache the value of p->signal, and change the code to use while_each_thread() helper. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Oleg Nesterov	fc321d2e60	handle_stop_signal: unify partial/full stop handling Now that handle_stop_signal() doesn't drop ->siglock, we can't see both ->group_stop_count && SIGNAL_STOP_STOPPED. Merge two "if" branches. As Roland pointed out, we never actually needed 2 do_notify_parent_cldstop() calls. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Oleg Nesterov	6ca25b5513	kill_pid_info: don't take now unneeded tasklist_lock Previously handle_stop_signal(SIGCONT) could drop ->siglock. That is why kill_pid_info(SIGCONT) takes tasklist_lock to make sure the target task can't go away after unlock. Not needed now. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Oleg Nesterov	e442055193	signals: re-assign CLD_CONTINUED notification from the sender to reciever Based on discussion with Jiri and Roland. In short: currently handle_stop_signal(SIGCONT, p) sends the notification to p->parent, with this patch p itself notifies its parent when it becomes running. handle_stop_signal(SIGCONT) has to drop ->siglock temporary in order to notify the parent with do_notify_parent_cldstop(). This leads to multiple problems: - as Jiri Kosina pointed out, the stopped task can resume without actually seeing SIGCONT which may have a handler. - we race with another sig_kernel_stop() signal which may come in that window. - we race with sig_fatal() signals which may set SIGNAL_GROUP_EXIT in that window. - we can't avoid taking tasklist_lock() while sending SIGCONT. With this patch handle_stop_signal() just sets the new SIGNAL_CLD_CONTINUED flag in p->signal->flags and returns. The notification is sent by the first task which returns from finish_stop() (there should be at least one) or any other signalled thread from get_signal_to_deliver(). This is a user-visible change. Say, currently kill(SIGCONT, stopped_child) can't return without seeing SIGCHLD, with this patch SIGCHLD can be delayed unpredictably. Another difference is that if the child is ptraced by another process, CLD_CONTINUED may be delivered to ->real_parent after ptrace_detach() while currently it always goes to the tracer which doesn't actually need this notification. Hopefully not a problem. The patch asks for the futher obvious cleanups, I'll send them separately. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Oleg Nesterov	3b5e9e53c6	signals: cleanup security_task_kill() usage/implementation Every implementation of ->task_kill() does nothing when the signal comes from the kernel. This is correct, but means that check_kill_permission() should call security_task_kill() only for SI_FROMUSER() case, and we can remove the same check from ->task_kill() implementations. (sadly, check_kill_permission() is the last user of signal->session/__session but we can't s/task_session_nr/task_session/ here). NOTE: Eric W. Biederman pointed out cap_task_kill() should die, and I think he is very right. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: David Quigley <dpquigl@tycho.nsa.gov> Cc: Eric Paris <eparis@redhat.com> Cc: Harald Welte <laforge@gnumonks.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Pavel Emelyanov	9e3bd6c3fb	signals: consolidate send_sigqueue and send_group_sigqueue Both functions do the same thing after proper locking, but with different sigpending structs, so move the common code into a helper. After this we have 4 places that look very similar: send_sigqueue: calls do_send_sigqueue and signal_wakeup send_group_sigqueue: calls do_send_sigqueue and __group_complete_signal __group_send_sig_info: calls send_signal and __group_complete_signal specific_send_sig_info: calls send_signal and signal_wakeup Besides, send_signal performs actions similar to do_send_sigqueue's and __group_complete_signal - to signal_wakeup. It looks like they can be consolidated gracefully. Oleg said: Personally, I think this change is very good. But send_sigqueue() and send_group_sigqueue() have a very subtle difference which I was never able to understand. Let's suppose that sigqueue is already queued, and the signal is ignored (the latter means we should re-schedule cpu timer or handle overrruns). In that case send_sigqueue() returns 0, but send_group_sigqueue() returns 1. I think this is not the problem (in fact, I think this patch makes the behaviour more correct), but I hope Thomas can take a look and confirm. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Pavel Emelyanov	c5363d0363	signals: clean dequeue_signal from excess checks and assignments The signr variable may be declared without initialization - it is set ro the return value from __dequeue_signal() right at the function beginning. Besides, after recalc_sigpending() two checks for signr to be not 0 may be merged into one. Both if-s become easier to read. Thanks to Oleg for pointing out mistakes in the first version of this patch. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Pavel Emelyanov	93585eeaf3	signals: consolidate checks for whether or not to ignore a signal Both sig_ignored() and do_sigaction() check for signr to be explicitly or implicitly ignored. Introduce a helper for them. This patch is aimed to help handling signals by pid namespace's init, and was derived from one of Oleg's patches https://lists.linux-foundation.org/pipermail/containers/2007-December/009308.html so, if he doesn't mind, he should be considered as an author. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Oleg Nesterov	d6cf723a14	k_getrusage: don't take rcu_read_lock() Just a trivial example, more to come. k_getrusage() holds rcu_read_lock() because it was previously required by lock_task_sighand(). Unneeded now. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:34 -07:00
Oleg Nesterov	1406f2d321	lock_task_sighand: add rcu lock/unlock Most of the callers of lock_task_sighand() doesn't actually need rcu_lock(). lock_task_sighand() needs it only to safely play with tsk->sighand, it can take the lock itself. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Oleg Nesterov	bfc4b0890a	signals: do_group_exit(): use signal_group_exit() more consistently do_group_exit() checks SIGNAL_GROUP_EXIT to avoid taking sighand->siglock. Since `ed5d2cac11` exec() doesn't set this flag, we should use signal_group_exit(). This is not needed for correctness, but can speedup the multithreaded exec and makes the code more consistent. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Oleg Nesterov	573cf9ad72	signals: do_signal_stop(): use signal_group_exit() do_signal_stop() needs signal_group_exit() but checks sig->group_exit_task. This (optimization) is correct, SIGNAL_STOP_DEQUEUED and SIGNAL_GROUP_EXIT are mutually exclusive, but looks confusing. Use signal_group_exit(), this is not fastpath, the code clarity is more important. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Pavel Emelyanov	2acb024d55	signals: consolidate checking for ignored/legacy signals Two callers for send_signal() - the specific_send_sig_info and the __group_send_sig_info - both check for sig to be ignored or already queued. Move these checks into send_signal() and make it return 1 to indicate that the signal is dropped, but there's no error in this. Besides, merge comments and spell-check them. [oleg@tv-sign.ru: simplifications] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Pavel Emelyanov	af7fff9c13	signals: turn LEGACY_QUEUE macro into static inline function This makes the code more readable, due to less brackets and small letters in name. I also move it above the send_signal() as a preparation for the 3rd patch. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Pavel Emelyanov	e1401c6bbb	signals: remove unused variable from send_signal() This function doesn't change the ret's value and thus always returns 0, with a single exception of returning -EAGAIN explicitly. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-30 08:29:33 -07:00
Linus Torvalds	9781db7b34	Merge branch 'audit.b50' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b50' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: [PATCH] new predicate - AUDIT_FILETYPE [patch 2/2] Use find_task_by_vpid in audit code [patch 1/2] audit: let userspace fully control TTY input auditing [PATCH 2/2] audit: fix sparse shadowed variable warnings [PATCH 1/2] audit: move extern declarations to audit.h Audit: MAINTAINERS update Audit: increase the maximum length of the key field Audit: standardize string audit interfaces Audit: stop deadlock from signals under load Audit: save audit_backlog_limit audit messages in case auditd comes back Audit: collect sessionid in netlink messages Audit: end printk with newline	2008-04-29 11:41:22 -07:00
Linus Torvalds	bd5d435a96	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: Skip I/O merges when disabled block: add large command support block: replace sizeof(rq->cmd) with BLK_MAX_CDB ide: use blk_rq_init() to initialize the request block: use blk_rq_init() to initialize the request block: rename and export rq_init() block: no need to initialize rq->cmd with blk_get_request block: no need to initialize rq->cmd in prepare_flush_fn hook block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline block/elevator.c:elv_rq_merge_ok() mustn't be inline block: make queue flags non-atomic block: add dma alignment and padding support to blk_rq_map_kern unexport blk_max_pfn ps3disk: Remove superfluous cast block: make rq_init() do a full memset() relay: fix splice problem	2008-04-29 08:18:03 -07:00
Christoph Lameter	37487a5652	Add kbuild.h that contains common definitions for kbuild users The same definitions are used for the bounds logic and the asm-offsets.h generation by kbuild. Put them into include/linux/kbuild.h file. Also add a new feature COMMENT("text") which can be used to insert lines of ocmments into asm-offsets.h and bounds.h. Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Jay Estabrook <jay.estabrook@hp.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Richard Henderson <rth@twiddle.net> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Chris Zankel <chris@zankel.net> Cc: David S. Miller <davem@davemloft.net> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Bryan Wu <bryan.wu@analog.com> Cc: Mike Frysinger <vapier.adi@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Miles Bader <miles@gnu.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:29 -07:00
Masami Hiramatsu	68ab3d883a	relayfs: support larger relay buffer Use vmalloc() and memset() instead of kcalloc() to allocate a page* array when the array size is bigger than one page. This enables relayfs to support bigger relay buffers than 64MB on 4k-page system, 512MB on 16k-page system. [akpm@linux-foundation.org: cleanup] Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: David Wilder <dwilder@us.ibm.com> Reviewed-by: Tom Zanussi <zanussi@comcast.net> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:28 -07:00
Hirofumi Nakagawa	801678c5a3	Remove duplicated unlikely() in IS_ERR() Some drivers have duplicated unlikely() macros. IS_ERR() already has unlikely() in itself. This patch cleans up such pointless code. Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Jeff Garzik <jeff@garzik.org> Cc: Paul Clements <paul.clements@steeleye.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: David Brownell <david-b@pacbell.net> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Patrick McHardy <kaber@trash.net> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.de> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:25 -07:00
Pavel Emelyanov	d7321cd624	sysctl: add the ->permissions callback on the ctl_table_root When reading from/writing to some table, a root, which this table came from, may affect this table's permissions, depending on who is working with the table. The core hunk is at the bottom of this patch. All the rest is just pushing the ctl_table_root argument up to the sysctl_perm() function. This will be mostly (only?) used in the net sysctls. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Denis V. Lunev <den@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:23 -07:00
Pavel Emelyanov	2c4c7155f2	sysctl: clean from unneeded extern and forward declarations The do_sysctl_strategy isn't used outside kernel/sysctl.c, so this can be static and without a prototype in header. Besides, move this one and parse_table() above their callers and drop the forward declarations of the latter call. One more "besides" - fix two checkpatch warnings: space before a ( and an extra space at the end of a line. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Denis V. Lunev <den@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:23 -07:00
Holger Schurig	88f458e4b9	sysctl: allow embedded targets to disable sysctl_check.c Disable sysctl_check.c for embedded targets. This saves about about 11 kB in .text and another 11 kB in .data on a PXA255 embedded platform. Signed-off-by: Holger Schurig <hs4233@mail.mn-solutions.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:22 -07:00
Denis V. Lunev	c33fff0afb	kernel: use non-racy method for proc entries creation Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data be setup before gluing PDE to main tree. Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:22 -07:00
Alexey Dobriyan	c74c120a21	proc: remove proc_root from drivers Remove proc_root export. Creation and removal works well if parent PDE is supplied as NULL -- it worked always that way. So, one useless export removed and consistency added, some drivers created PDEs with &proc_root as parent but removed them as NULL and so on. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:18 -07:00
Matt Helsley	925d1c401f	procfs task exe symlink The kernel implements readlink of /proc/pid/exe by getting the file from the first executable VMA. Then the path to the file is reconstructed and reported as the result. Because of the VMA walk the code is slightly different on nommu systems. This patch avoids separate /proc/pid/exe code on nommu systems. Instead of walking the VMAs to find the first executable file-backed VMA we store a reference to the exec'd file in the mm_struct. That reference would prevent the filesystem holding the executable file from being unmounted even after unmapping the VMAs. So we track the number of VM_EXECUTABLE VMAs and drop the new reference when the last one is unmapped. This avoids pinning the mounted filesystem. [akpm@linux-foundation.org: improve comments] [yamamoto@valinux.co.jp: fix dup_mmap] Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Howells <dhowells@redhat.com> Cc:"Eric W. Biederman" <ebiederm@xmission.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:17 -07:00
David Howells	0b77f5bfb4	keys: make the keyring quotas controllable through /proc/sys Make the keyring quotas controllable through /proc/sys files: () /proc/sys/kernel/keys/root_maxkeys /proc/sys/kernel/keys/root_maxbytes Maximum number of keys that root may have and the maximum total number of bytes of data that root may have stored in those keys. () /proc/sys/kernel/keys/maxkeys /proc/sys/kernel/keys/maxbytes Maximum number of keys that each non-root user may have and the maximum total number of bytes of data that each of those users may have stored in their keys. Also increase the quotas as a number of people have been complaining that it's not big enough. I'm not sure that it's big enough now either, but on the other hand, it can now be set in /etc/sysctl.conf. Signed-off-by: David Howells <dhowells@redhat.com> Cc: <kwc@citi.umich.edu> Cc: <arunsr@cse.iitk.ac.in> Cc: <dwalsh@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:17 -07:00
David Howells	69664cf16a	keys: don't generate user and user session keyrings unless they're accessed Don't generate the per-UID user and user session keyrings unless they're explicitly accessed. This solves a problem during a login process whereby set*uid() is called before the SELinux PAM module, resulting in the per-UID keyrings having the wrong security labels. This also cures the problem of multiple per-UID keyrings sometimes appearing due to PAM modules (including pam_keyinit) setuiding and causing user_structs to come into and go out of existence whilst the session keyring pins the user keyring. This is achieved by first searching for extant per-UID keyrings before inventing new ones. The serial bound argument is also dropped from find_keyring_by_name() as it's not currently made use of (setting it to 0 disables the feature). Signed-off-by: David Howells <dhowells@redhat.com> Cc: <kwc@citi.umich.edu> Cc: <arunsr@cse.iitk.ac.in> Cc: <dwalsh@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:17 -07:00
Serge E. Hallyn	02fdb36ae7	ipc: sysvsem: refuse clone(CLONE_SYSVSEM\|CLONE_NEWIPC) CLONE_NEWIPC\|CLONE_SYSVSEM interaction isn't handled properly. This can cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing undo lists. Fix, part 3: refuse clone(CLONE_SYSVSEM\|CLONE_NEWIPC). With unshare, specifying CLONE_SYSVSEM means unshare the sysvsem. So it seems reasonable that CLONE_NEWIPC without CLONE_SYSVSEM would just imply CLONE_SYSVSEM. However with clone, specifying CLONE_SYSVSEM means share the sysvsem. So calling clone(CLONE_SYSVSEM\|CLONE_NEWIPC) is explicitly asking for something we can't allow. So return -EINVAL in that case. [akpm@linux-foundation.org: cleanups] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Manfred Spraul <manfred@colorfullife.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:14 -07:00
Manfred Spraul	6013f67fc1	ipc: sysvsem: force unshare(CLONE_SYSVSEM) when CLONE_NEWIPC sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing undo lists. Fix, part 2: perform an implicit CLONE_SYSVSEM in CLONE_NEWIPC. CLONE_NEWIPC creates a new IPC namespace, the task cannot access the existing semaphore arrays after the unshare syscall. Thus the task can/must detach from the existing undo list entries, too. This fixes the kernel corruption, because it makes it impossible that undo records from two different namespaces are in sysvsem.undo_list. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:14 -07:00
Manfred Spraul	9edff4ab1f	ipc: sysvsem: implement sys_unshare(CLONE_SYSVSEM) sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing undo lists. Fix, part 1: add support for sys_unshare(CLONE_SYSVSEM) The original reason to not support it was the potential (inevitable?) confusion due to the fact that sys_unshare(CLONE_SYSVSEM) has the inverse meaning of clone(CLONE_SYSVSEM). Our two most reasonable options then appear to be (1) fully support CLONE_SYSVSEM, or (2) continue to refuse explicit CLONE_SYSVSEM, but always do it anyway on unshare(CLONE_SYSVSEM). This patch does (1). Changelog: Apr 16: SEH: switch to Manfred's alternative patch which removes the unshare_semundo() function which always refused CLONE_SYSVSEM. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:14 -07:00
Nadia Derbey	6546bc4279	ipc: re-enable msgmni automatic recomputing msgmni if set to negative The enhancement as asked for by Yasunori: if msgmni is set to a negative value, register it back into the ipcns notifier chain. A new interface has been added to the notification mechanism: notifier_chain_cond_register() registers a notifier block only if not already registered. With that new interface we avoid taking care of the states changes in procfs. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:13 -07:00
Oleg Nesterov	d2ba7e2ae2	simplify cpu_hotplug_begin()/put_online_cpus() cpu_hotplug_begin() must be always called under cpu_add_remove_lock, this means that only one process can be cpu_hotplug.active_writer. So we don't need the cpu_hotplug.writer_queue, we can wake up the ->active_writer directly. Also, fix the comment. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Dipankar Sarma <dipankar@in.ibm.com> Acked-by: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:11 -07:00
Oleg Nesterov	1e35eaa2d8	cleanup_workqueue_thread: remove the unneeded "cpu" parameter cleanup_workqueue_thread() doesn't need the second argument, remove it. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:11 -07:00
Oleg Nesterov	00dfcaf748	workqueues: shrink cpu_populated_map when CPU dies When cpu_populated_map was introduced, it was supposed that cwq->thread can survive after CPU_DEAD, that is why we never shrink cpu_populated_map. This is not very nice, we can safely remove the already dead CPU from the map. The only required change is that destroy_workqueue() must hold the hotplug lock until it destroys all cwq->thread's, to protect the cpu_populated_map. We could make the local copy of cpu mask and drop the lock, but sizeof(cpumask_t) may be very large. Also, fix the comment near queue_work(). Unless _cpu_down() happens we do guarantee the cpu-affinity of the work_struct, and we have users which rely on this. [akpm@linux-foundation.org: repair comment] Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:11 -07:00
Paul Menage	786083667e	Cpuset hardwall flag: add a mem_hardwall flag to cpusets This flag provides the hardwalling properties of mem_exclusive, without enforcing the exclusivity. Either mem_hardwall or mem_exclusive is sufficient to prevent GFP_KERNEL allocations from passing outside the cpuset's assigned nodes. Signed-off-by: Paul Menage <menage@google.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:11 -07:00
Paul Menage	addf2c739d	Cpuset hardwall flag: switch cpusets to use the bulk cgroup_add_files() API Currently the cpusets mem_exclusive flag is overloaded to mean both "no-overlapping" and "no GFP_KERNEL allocations outside this cpuset". These patches add a new mem_hardwall flag with just the allocation restriction part of the mem_exclusive semantics, without breaking backwards-compatibility for those who continue to use just mem_exclusive. Additionally, the cgroup control file registration for cpusets is cleaned up to reduce boilerplate. This patch: This change tidies up the cpusets control file definitions, and reduces the amount of boilerplate required to add/change control files in the future. Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:11 -07:00
Adrian Bunk	9e0c914cab	kernel/cpuset.c: make 3 functions static Make the following needlessly global functions static: - cpuset_test_cpumask() - cpuset_change_cpumask() - cpuset_do_move_task() Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:11 -07:00
Pavel Emelyanov	c84872e168	memcgroup: add the max_usage member on the res_counter This field is the maximal value of the usage one since the counter creation (or since the latest reset). To reset this to the usage value simply write anything to the appropriate cgroup file. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Balbir Singh	cf475ad28a	cgroups: add an owner to the mm_struct Remove the mem_cgroup member from mm_struct and instead adds an owner. This approach was suggested by Paul Menage. The advantage of this approach is that, once the mm->owner is known, using the subsystem id, the cgroup can be determined. It also allows several control groups that are virtually grouped by mm_struct, to exist independent of the memory controller i.e., without adding mem_cgroup's for each controller, to mm_struct. A new config option CONFIG_MM_OWNER is added and the memory resource controller selects this config option. This patch also adds cgroup callbacks to notify subsystems when mm->owner changes. The mm_cgroup_changed callback is called with the task_lock() of the new task held and is called just prior to changing the mm->owner. I am indebted to Paul Menage for the several reviews of this patchset and helping me make it lighter and simpler. This patch was tested on a powerpc box, it was compiled with both the MM_OWNER config turned on and off. After the thread group leader exits, it's moved to init_css_state by cgroup_exit(), thus all future charges from runnings threads would be redirected to the init_css_set's subsystem. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: David Rientjes <rientjes@google.com>, Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Serge E. Hallyn	29486df325	cgroups: introduce cft->read_seq() Introduce a read_seq() helper in cftype, which uses seq_file to print out lists. Use it in the devices cgroup. Also split devices.allow into two files, so now devices.deny and devices.allow are the ones to use to manipulate the whitelist, while devices.list outputs the cgroup's current whitelist. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Li Zefan	28fd5dfc12	cgroups: remove the css_set linked-list Now we can run through the hash table instead of running through the linked-list. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Li Zefan	e8d55fdeb8	cgroups: simplify init_subsys() We are at system boot and there is only 1 cgroup group (i,e, init_css_set), so we don't need to run through the css_set linked list. Neither do we need to run through the task list, since no processes have been created yet. Also referring to a comment in cgroup.h: struct css_set { ... /* * Set of subsystem states, one for each subsystem. This array * is immutable after creation apart from the init_css_set * during subsystem registration (at boot time). / struct cgroup_subsys_state subsys[CGROUP_SUBSYS_COUNT]; } Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:10 -07:00
Li Zefan	472b1053f3	cgroups: use a hash table for css_set finding When we attach a process to a different cgroup, the css_set linked-list will be run through to find a suitable existing css_set to use. This patch implements a hash table for better performance. The following benchmarks have been tested: For N in 1, 5, 10, 50, 100, 500, 1000, create N cgroups with one sleeping task in each, and then move an additional task through each cgroup in turn. Here is a test result: N Loop orig - Time(s) hash - Time(s) ---------------------------------------------- 1 10000 1.201231728 1.196311177 5 2000 1.065743872 1.040566424 10 1000 0.991054735 0.986876440 50 200 0.976554203 0.969608733 100 100 0.998504680 0.969218270 500 20 1.157347764 0.962602963 1000 10 1.619521852 1.085140172 Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Pavel Emelyanov	d447ea2f30	cgroups: add the trigger callback to struct cftype Trigger callback can be used to receive a kick-up from the user space. The string written is ignored. The cftype->private is used for multiplexing events. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Paul Menage <menage@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Li Zefan	46ae220bea	cgroup: switch to proc_create() There is a race between create_proc_entry() and the assignment of file ops. proc_create() is invented to fix it. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Li Zefan	06a119204d	cgroup: annotate cgroup_init_subsys with __init It is called by cgroup_init() and cgroup_init_early() only, which are annotated with __init. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Paul Menage	06ecb27cfb	CGroups _s64 files: use read_s64/write_s64 in CFS cgroup for rt_runtime file This removes some filesystem boilerplate from the CFS cgroup subsystem. Signed-off-by: Paul Menage <menage@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Paul Menage	e73d2c61d1	CGroups _s64 files: add cgroups read_s64/write_s64 file methods These patches add cgroups read_s64 and write_s64 control file methods (the signed equivalent of read_u64/write_u64) and use them to implement the cpu.rt_runtime_us control file in the CFS cgroup subsystem. This patch: These are the signed equivalents of the read_u64/write_u64 methods Signed-off-by: Paul Menage <menage@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Paul Menage	3116f0e3df	CGroup API files: move "releasable" to cgroup_debug subsystem The "releasable" control file provided by the cgroup framework exports the state of a per-cgroup flag that's related to the notify-on-release feature. This isn't really generally useful, unless you're trying to debug this particular feature of cgroups. This patch moves the "releasable" file to the cgroup_debug subsystem. Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:09 -07:00
Paul Menage	9179656961	CGroup API files: add cgroup map data type Adds a new type of supported control file representation, a map from strings to u64 values. Each map entry is printed as a line in a similar format to /proc/vmstat, i.e. "$key $value\n" Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:08 -07:00
Paul Menage	700fe1ab99	CGroup API files: update cpusets to use cgroup structured file API Many of the cpusets control files are simple integer values, which don't require the overhead of memory allocations for reads and writes. Move the handlers for these control files into cpuset_read_u64() and cpuset_write_u64(). [akpm@linux-foundation.org: ad dmissing `break'] Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:08 -07:00
Paul Menage	b7269dfc82	CGroup API files: strip all trailing whitespace in cgroup_write_u64 This removes the need for people to remember to pass the -n flag to echo when writing values to cgroup control files. Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:08 -07:00
Paul Menage	2c7eabf376	CGroup API files: add res_counter_read_u64() Adds a function for returning the value of a resource counter member, in a form suitable for use in a cgroup read_u64 control file method. Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:08 -07:00
Paul Menage	f4c753b7ea	CGroup API files: rename read/write_uint methods to read_write_u64 Several people have justifiably complained that the "_uint" suffix is inappropriate for functions that handle u64 values, so this patch just renames all these functions and their users to have the suffic _u64. [peterz@infradead.org: build fix] Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:07 -07:00
Adrian Bunk	3ff31d0cca	cgroups: kernel/ns_cgroup.c should #include <linux/nsproxy.h> Every file should include the headers containing the externs its global functions (in this case for ns_cgroup_clone()). Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:07 -07:00
Paul Jackson	4fe91d518e	cgroup: fix sparse warning of shadow symbol in cgroup.c Fix a code warning: symbol 'p' shadows an earlier one This is a reincarnation of Harvey Harrison's patch: cpuset: sparse warnings in cpuset.c Independently, Cliff Wickman moved the affected code, from kernel/cpuset.c to kernel/cgroup.c, in his patch: cpusets: update_cpumask revision Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Harvey Harrison <harvey.harrison@gmail.com> Cc: Cliff Wickman <cpw@sgi.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:07 -07:00
Adrian Bunk	3df91fe30a	make cgroup_enable_task_cg_lists() static Make the needlessly global cgroup_enable_task_cg_lists() static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:07 -07:00
Michael Halcrow	6a3fd92e73	eCryptfs: make key module subsystem respect namespaces Make eCryptfs key module subsystem respect namespaces. Since I will be removing the netlink interface in a future patch, I just made changes to the netlink.c code so that it will not break the build. With my recent patches, the kernel module currently defaults to the device handle interface rather than the netlink interface. [akpm@linux-foundation.org: export free_user_ns()] Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:07 -07:00
Dave Young	5f97a5a879	isolate ratelimit from printk.c for other use Due to the rcupreempt.h WARN_ON trigged, I got 2G syslog file. For some serious complaining of kernel, we need repeat the warnings, so here I isolate the ratelimit part of printk.c to a standalone file. Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:06 -07:00
Robert P. J. Day	1aeb272cf0	kernel: explicitly include required header files under kernel/ Following an experimental deletion of the unnecessary directive #include <linux/slab.h> from the header file <linux/percpu.h>, these files under kernel/ were exposed as needing to include one of <linux/slab.h> or <linux/gfp.h>, so explicit includes were added where necessary. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:04 -07:00
Dmitry Adamushko	cbd9b67bd3	kthread: call wake_up_process() without the lock being held From the POV of synchronization, there should be no need to call wake_up_process() with the 'kthread_create_lock' being held. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:04 -07:00
Sam Ravnborg	f7b16c108f	cpu: fix section mismatch warning in reference to register_cpu_notifier Fix following warnings: WARNING: vmlinux.o(.text+0xc60): Section mismatch in reference from the function kvm_init() to the function .cpuinit.text:register_cpu_notifier() WARNING: vmlinux.o(.text+0x33869a): Section mismatch in reference from the function xfs_icsb_init_counters() to the function .cpuinit.text:register_cpu_notifier() WARNING: vmlinux.o(.text+0x5556a1): Section mismatch in reference from the function acpi_processor_install_hotplug_notify() to the function .cpuinit.text:register_cpu_notifier() WARNING: vmlinux.o(.text+0xfe6b28): Section mismatch in reference from the function cpufreq_register_driver() to the function .cpuinit.text:register_cpu_notifier() register_cpu_notifier() are only really defined when HOTPLUG_CPU is enabled. So references to the function are OK. Annotate it with __ref so we do not get warnings from callers and do not get warnings for the functions/data used by register_cpu_notifier(). Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:00 -07:00
Sam Ravnborg	514a20a5da	cpu: fix section mismatch warnings in *cpu_down Fix following warnings: WARNING: vmlinux.o(.text+0x75c8d): Section mismatch in reference from the function take_cpu_down() to the variable .cpuinit.data:cpu_chain WARNING: vmlinux.o(.text+0x75d2a): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain WARNING: vmlinux.o(.text+0x75d4d): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain WARNING: vmlinux.o(.text+0x75de4): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain WARNING: vmlinux.o(.text+0x75e33): Section mismatch in reference from the function _cpu_down() to the variable .cpuinit.data:cpu_chain cpu_down is only used from code surrounded by HOTPLUG_CPU so any references to __cpuinit is OK. Add a few __ref to tech modpost to ignore the references. This is just papering over the fact that the cpu hotplug code is fragile with respect to use of HOTPLUG_CPU and in many cases rely on __cpuinit to get rid of code when HOTPLUG_CPU is not enabled. For now this is the least invasive change. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:06:00 -07:00
Sam Ravnborg	9647155ffb	cpu: fix section mismatch warning in unregister_cpu_notifier Fix following warning: WARNING: vmlinux.o(.text+0x75f4e): Section mismatch in reference from the function unregister_cpu_notifier() to the variable .cpuinit.data:cpu_chain We know that unregister_cpu_notifier is using HOTPLUG_CPU stuff - so ignore these references. Annotating unregister_cpu_notifier had been another option but this caused far more warnings since not all callers were annotated __cpuinit. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:05:59 -07:00
Sripathi Kodi	679c9cd4ac	add RUSAGE_THREAD Add the RUSAGE_THREAD option for the getrusage system call. This is essentially Roland's patch from http://lkml.org/lkml/2008/1/18/589, but the line about RUSAGE_LWP line has been removed, as suggested by Ulrich and Christoph. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:05:59 -07:00
Nur Hussein	95b570c9ce	Taint kernel after WARN_ON(condition) The kernel is sent to tainted within the warn_on_slowpath() function, and whenever a warning occurs the new taint flag 'W' is set. This is useful to know if a warning occurred before a BUG by preserving the warning as a flag in the taint state. This does not work on architectures where WARN_ON has its own definition. These archs are: 1. s390 2. superh 3. avr32 4. parisc The maintainers of these architectures have been added in the Cc: list in this email to alert them to the situation. The documentation in oops-tracing.txt has been updated to include the new flag. Signed-off-by: Nur Hussein <nurhussein@gmail.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-29 08:05:59 -07:00
Tom Zanussi	c3270e577c	relay: fix splice problem Splice isn't always incrementing the ppos correctly, which broke relay splice. Signed-off-by: Tom Zanussi <zanussi@comcast.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-04-29 09:48:15 +02:00
Harvey Harrison	b331d259b1	kernel: fix integer as NULL pointer warnings kernel/cpuset.c:1268:52: warning: Using plain integer as NULL pointer kernel/pid_namespace.c:95:24: warning: Using plain integer as NULL pointer Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Reviewed-by: Paul Jackson <pj@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 17:29:18 -07:00
Roland McGrath	9d04d9280c	ptrace: conditionalize compat_ptrace_request My recent additions to compat_ptrace_request made it mandatory for CONFIG_COMPAT arch's to define copy_siginfo_from_user32. This broke some builds, though they all really should get cleaned up in that way. Since all the arch's that actually call compat_ptrace_request have now been cleaned up to use the generic compat_sys_ptrace, we can avoid the build problems on the crufty arch's by changing the conditionals on the definition. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 14:14:36 -07:00
Thomas Gleixner	0c96c5979a	hrtimer: raise softirq unlocked to avoid circular lock dependency The scheduler hrtimer bits in 2.6.25 introduced a circular lock dependency in a rare code path: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.25-sched-devel.git-x86-latest.git #19 ------------------------------------------------------- X/2980 is trying to acquire lock: (&rq->rq_lock_key#2){++..}, at: [<ffffffff80230146>] task_rq_lock+0x56/0xa0 but task is already holding lock: (&cpu_base->lock){++..}, at: [<ffffffff80257ae1>] lock_hrtimer_base+0x31/0x60 which lock already depends on the new lock. The scenario which leads to this is: posix-timer signal is delivered -> posix-timer is rearmed timer is already expired in hrtimer_enqueue() -> softirq is raised To prevent this we need to move the raise of the softirq out of the base->lock protected code path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2008-04-28 22:22:21 +02:00
Linus Torvalds	513694b5f9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt * git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: hrtimer: timeout too long when using HRTIMER_CB_SOFTIRQ	2008-04-28 09:36:40 -07:00
Andres Salomon	b6f448e99c	PM/gxfb: add hook to PM console layer that allows disabling of suspend VT switch Prior to suspend, we allocate and switch to a new VT; after suspend, we switch back to the original VT. This can be slow, and is completely unnecessary if the framebuffer we're using can restore video properly. This adds a hook that allows drivers to select whether or not to do this vt switch, and changes the gxfb driver to call this hook. It also adds a module param to gxfb to allow controlling of the vt switch (defaulting to no switch). (Note: I'm not convinced that console_sem is the best way to protect this, but we should probably have some form of locking..) [akpm@linux-foundation.org: build fix] Signed-off-by: Andres Salomon <dilinger@debian.org> Cc: Jordan Crouse <jordan.crouse@amd.com> Cc: "Antonino A. Daplas" <adaplas@pol.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:36 -07:00
Masami Hiramatsu	26b31c1908	kprobes: add (un)register_jprobes for batch registration Introduce unregister_/register_jprobes() for jprobe batch registration. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: David Miller <davem@davemloft.net> Cc: "Frank Ch. Eigler" <fche@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Masami Hiramatsu	4a296e07c3	kprobes: add (un)register_kretprobes for batch registration Introduce unregister_/register_kretprobes() for kretprobe batch registration. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: David Miller <davem@davemloft.net> Cc: "Frank Ch. Eigler" <fche@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Masami Hiramatsu	9861668f74	kprobes: add (un)register_kprobes for batch registration Introduce unregister_/register_kprobes() for kprobe batch registration. This can reduce waiting time for synchronized_sched() when a lot of probes have to be unregistered at once. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: David Miller <davem@davemloft.net> Cc: "Frank Ch. Eigler" <fche@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Srinivasa Ds	3d8d996e0c	kprobes: prevent probing of preempt_schedule() Prohibit users from probing preempt_schedule(). One way of prohibiting the user from probing functions is by marking such functions with __kprobes. But this method doesn't work for those functions, which are already marked to different section like preempt_schedule() (belongs to __sched section). So we use blacklist approach to refuse user from probing these functions. In blacklist approach we populate the blacklisted function's starting address and its size in kprobe_blacklist structure. Then we verify the user specified address against start and end of the blacklisted function. So any attempt to register probe on blacklisted functions will be rejected. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:32 -07:00
Andrew G. Morgan	3898b1b4eb	capabilities: implement per-process securebits Filesystem capability support makes it possible to do away with (set)uid-0 based privilege and use capabilities instead. That is, with filesystem support for capabilities but without this present patch, it is (conceptually) possible to manage a system with capabilities alone and never need to obtain privilege via (set)uid-0. Of course, conceptually isn't quite the same as currently possible since few user applications, certainly not enough to run a viable system, are currently prepared to leverage capabilities to exercise privilege. Further, many applications exist that may never get upgraded in this way, and the kernel will continue to want to support their setuid-0 base privilege needs. Where pure-capability applications evolve and replace setuid-0 binaries, it is desirable that there be a mechanisms by which they can contain their privilege. In addition to leveraging the per-process bounding and inheritable sets, this should include suppressing the privilege of the uid-0 superuser from the process' tree of children. The feature added by this patch can be leveraged to suppress the privilege associated with (set)uid-0. This suppression requires CAP_SETPCAP to initiate, and only immediately affects the 'current' process (it is inherited through fork()/exec()). This reimplementation differs significantly from the historical support for securebits which was system-wide, unwieldy and which has ultimately withered to a dead relic in the source of the modern kernel. With this patch applied a process, that is capable(CAP_SETPCAP), can now drop all legacy privilege (through uid=0) for itself and all subsequently fork()'d/exec()'d children with: prctl(PR_SET_SECUREBITS, 0x2f); This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is enabled at configure time. [akpm@linux-foundation.org: fix uninitialised var warning] [serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Paul Moore <paul.moore@hp.com> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:26 -07:00
Lee Schermerhorn	846a16bf0f	mempolicy: rename mpol_copy to mpol_dup This patch renames mpol_copy() to mpol_dup() because, well, that's what it does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an existing mempolicy, allocates a new one and copies the contents. In a later patch, I want to use the name mpol_copy() to copy the contents from one mempolicy to another like, e.g., strcpy() does for strings. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:23 -07:00
Lee Schermerhorn	f0be3d32b0	mempolicy: rename mpol_free to mpol_put This is a change that was requested some time ago by Mel Gorman. Makes sense to me, so here it is. Note: I retain the name "mpol_free_shared_policy()" because it actually does free the shared_policy, which is NOT a reference counted object. However, ... The mempolicy object[s] referenced by the shared_policy are reference counted, so mpol_put() is used to release the reference held by the shared_policy. The mempolicy might not be freed at this time, because some task attached to the shared object associated with the shared policy may be in the process of allocating a page based on the mempolicy. In that case, the task performing the allocation will hold a reference on the mempolicy, obtained via mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks holding such a reference have called mpol_put() for the mempolicy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:23 -07:00
Ken'ichi Ohmichi	122c7a5905	vmcoreinfo: add page flags values Add some values of page flags to the vmcoreinfo data. The vmcoreinfo data has the minimum debugging information only for dump filtering. makedumpfile (dump filtering command) gets it to distinguish unnecessary pages, and makedumpfile creates a small dumpfile. An old makedumpfile (v1.2.4 or before) had assumed some values of page flags internally, and this implementation could not follow the change of these values. For example, Christoph Lameter is changing these values by the follwing patch: http://lkml.org/lkml/2008/2/29/463 So a new makedumpfile (v1.2.5) came to need these values and I created this patch to let the kernel output them. Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:23 -07:00
Christoph Lameter	97965478a6	mm: Get rid of __ZONE_COUNT It was used to compensate because MAX_NR_ZONES was not available to the #ifdefs. Export MAX_NR_ZONES via the new mechanism and get rid of __ZONE_COUNT. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:22 -07:00
Christoph Lameter	9223b4190f	pageflags: get rid of FLAGS_RESERVED NR_PAGEFLAGS specifies the number of page flags we are using. From that we can calculate the number of bits leftover that can be used for zone, node (and maybe the sections id). There is no need anymore for FLAGS_RESERVED if we use NR_PAGEFLAGS. Use the new methods to make NR_PAGEFLAGS available via the preprocessor. NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields. These field widths have to be available to the preprocessor. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: David Miller <davem@davemloft.net> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:21 -07:00
Christoph Lameter	1cdf25d704	kbuild: create a way to create preprocessor constants from C expressions The use of enums create constants that are not available to the preprocessor when building the kernel (f.e. MAX_NR_ZONES). Arch code already has a way to export constants calculated to the preprocessor through the asm-offsets.c file. Generate something similar for the core kernel through kbuild. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:21 -07:00
Mel Gorman	19770b3260	mm: filter based on a nodemask as well as a gfp_mask The MPOL_BIND policy creates a zonelist that is used for allocations controlled by that mempolicy. As the per-node zonelist is already being filtered based on a zone id, this patch adds a version of __alloc_pages() that takes a nodemask for further filtering. This eliminates the need for MPOL_BIND to create a custom zonelist. A positive benefit of this is that allocations using MPOL_BIND now use the local node's distance-ordered zonelist instead of a custom node-id-ordered zonelist. I.e., pages will be allocated from the closest allowed node with available memory. [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments] [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask] [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:19 -07:00
Mel Gorman	dd1a239f6f	mm: have zonelist contains structs with both a zone pointer and zone_idx Filtering zonelists requires very frequent use of zone_idx(). This is costly as it involves a lookup of another structure and a substraction operation. As the zone_idx is often required, it should be quickly accessible. The node idx could also be stored here if it was found that accessing zone->node is significant which may be the case on workloads where nodemasks are heavily used. This patch introduces a struct zoneref to store a zone pointer and a zone index. The zonelist then consists of an array of these struct zonerefs which are looked up as necessary. Helpers are given for accessing the zone index as well as the node index. [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers] [hugh@veritas.com: mm-have-zonelist: fix memcg ooms] [hugh@veritas.com: just return do_try_to_free_pages] [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 08:58:18 -07:00
Al Viro	8b67dca942	[PATCH] new predicate - AUDIT_FILETYPE Argument is S_IF... \| <index>, where index is normally 0 or 1. Triggers if chosen element of ctx->names[] is present and the mode of object in question matches the upper bits of argument. I.e. for things like "is the argument of that chmod a directory", etc. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:28:37 -04:00
Pavel Emelyanov	4a761b8c1d	[patch 2/2] Use find_task_by_vpid in audit code The pid to lookup a task by is passed inside audit code via netlink message. Thanks to Denis Lunev, netlink packets are now (since 2.6.24) _always_ processed in the context of the sending task. So this is correct to lookup the task with find_task_by_vpid() here. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:28:30 -04:00
Harvey Harrison	7719e437fa	[PATCH 2/2] audit: fix sparse shadowed variable warnings Use msglen as the identifier. kernel/audit.c:724:10: warning: symbol 'len' shadows an earlier one kernel/audit.c:575:8: originally declared here Don't use ino_f to check the inode field at the end of the functions. kernel/auditfilter.c:429:22: warning: symbol 'f' shadows an earlier one kernel/auditfilter.c:420:21: originally declared here kernel/auditfilter.c:542:22: warning: symbol 'f' shadows an earlier one kernel/auditfilter.c:529:21: originally declared here i always used as a counter for a for loop and initialized to zero before use. Eliminate the inner i variables. kernel/auditsc.c:1295:8: warning: symbol 'i' shadows an earlier one kernel/auditsc.c:1152:6: originally declared here kernel/auditsc.c:1320:7: warning: symbol 'i' shadows an earlier one kernel/auditsc.c:1152:6: originally declared here Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:28:17 -04:00
Harvey Harrison	c782f242f0	[PATCH 1/2] audit: move extern declarations to audit.h Leave audit_sig_{uid\|pid\|sid} protected by #ifdef CONFIG_AUDITSYSCALL. Noticed by sparse: kernel/audit.c:73:6: warning: symbol 'audit_ever_enabled' was not declared. Should it be static? kernel/audit.c💯8: warning: symbol 'audit_sig_uid' was not declared. Should it be static? kernel/audit.c:101:8: warning: symbol 'audit_sig_pid' was not declared. Should it be static? kernel/audit.c:102:6: warning: symbol 'audit_sig_sid' was not declared. Should it be static? kernel/audit.c:117:23: warning: symbol 'audit_ih' was not declared. Should it be static? kernel/auditfilter.c:78:18: warning: symbol 'audit_filter_list' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:28:04 -04:00
Eric Paris	b556f8ad58	Audit: standardize string audit interfaces This patch standardized the string auditing interfaces. No userspace changes will be visible and this is all just cleanup and consistancy work. We have the following string audit interfaces to use: void audit_log_n_hex(struct audit_buffer ab, const unsigned char buf, size_t len); void audit_log_n_string(struct audit_buffer ab, const char buf, size_t n); void audit_log_string(struct audit_buffer ab, const char buf); void audit_log_n_untrustedstring(struct audit_buffer ab, const char string, size_t n); void audit_log_untrustedstring(struct audit_buffer ab, const char string); This may be the first step to possibly fixing some of the issues that people have with the string output from the kernel audit system. But we still don't have an agreed upon solution to that problem. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:19:22 -04:00
Eric Paris	f09ac9db2a	Audit: stop deadlock from signals under load A deadlock is possible between kauditd and auditd under load if auditd receives a signal. When auditd receives a signal it sends a netlink message to the kernel asking for information about the sender of the signal. In that same context the audit system will attempt to send a netlink message back to the userspace auditd. If kauditd has already filled the socket buffer (see netlink_attachskb()) auditd will now put itself to sleep waiting for room to send the message. Since auditd is responsible for draining that socket we have a deadlock. The fix, since the response from the kernel does not need to be synchronous is to send the signal information back to auditd in a separate thread. And thus auditd can continue to drain the audit queue normally. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:19:13 -04:00
Eric Paris	f3d357b092	Audit: save audit_backlog_limit audit messages in case auditd comes back This patch causes the kernel audit subsystem to store up to audit_backlog_limit messages for use by auditd if it ever appears sometime in the future in userspace. This is useful to collect audit messages during bootup and even when auditd is stopped. This is NOT a reliable mechanism, it does not ever call audit_panic, nor should it. audit_log_lost()/audit_panic() are called during the normal delivery mechanism. The messages are still sent to printk/syslog as usual and if too many messages appear to be queued they will be silently discarded. I liked doing it by default, but this patch only uses the queue in question if it was booted with audit=1 or if the kernel was built enabling audit by default. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:19:04 -04:00
Eric Paris	2532386f48	Audit: collect sessionid in netlink messages Previously I added sessionid output to all audit messages where it was available but we still didn't know the sessionid of the sender of netlink messages. This patch adds that information to netlink messages so we can audit who sent netlink messages. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 06:18:03 -04:00
Eric Paris	436c405c7d	Audit: end printk with newline A couple of audit printk statements did not have a newline. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-28 04:45:07 -04:00
Bodo Stroesser	d7b41a24bf	hrtimer: timeout too long when using HRTIMER_CB_SOFTIRQ When using hrtimer with timer->cb_mode == HRTIMER_CB_SOFTIRQ in some cases the clockevent is not programmed. This happens, if: - a timer is rearmed while it's state is HRTIMER_STATE_CALLBACK - hrtimer_reprogram() returns -ETIME, when it is called after CALLBACK is finished. This occurs if the new timer->expires is in the past when CALLBACK is done. In this case, the timer needs to be removed from the tree and put onto the pending list again. The patch is against 2.6.22.5, but AFAICS, it is relevant for 2.6.25 also (in run_hrtimer_pending()). Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com> Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-27 18:26:43 +02:00
Carsten Otte	402b08622d	s390: KVM preparation: provide hook to enable pgstes in user pagetable The SIE instruction on s390 uses the 2nd half of the page table page to virtualize the storage keys of a guest. This patch offers the s390_enable_sie function, which reorganizes the page tables of a single-threaded process to reserve space in the page table: s390_enable_sie makes sure that the process is single threaded and then uses dup_mm to create a new mm with reorganized page tables. The old mm is freed and the process has now a page status extended field after every page table. Code that wants to exploit pgstes should SELECT CONFIG_PGSTE. This patch has a small common code hit, namely making dup_mm non-static. Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's review feedback. Now we do have the prototype for dup_mm in include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now call task_lock() to prevent race against ptrace modification of mm_users. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@qumranet.com>	2008-04-27 12:00:40 +03:00
Al Viro	50704516f3	Fix uninitialized 'copy' in unshare_files Arrgghhh... Sorry about that, I'd been sure I'd folded that one, but it actually got lost. Please apply - that breaks execve(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-26 09:24:31 -07:00
Linus Torvalds	bc84e0a160	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: [PATCH] sanitize locate_fd() [PATCH] sanitize unshare_files/reset_files_struct [PATCH] sanitize handling of shared descriptor tables in failing execve() [PATCH] close race in unshare_files() [PATCH] restore sane ->umount_begin() API cifs: timeout dfs automounts +little fix.	2008-04-25 19:05:55 -07:00
Linus Torvalds	6e18933f2b	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [PATCH] Build fix for CONFIG_NUMA=y && CONFIG_SMP=n [IA64] fix bootmem regression on Altix	2008-04-25 12:50:00 -07:00
Linus Torvalds	0b79dada97	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes: sched: fix share (re)distribution softlockup: fix NOHZ wakeup seqlock: livelock fix	2008-04-25 12:47:56 -07:00
Al Viro	3b1253880b	[PATCH] sanitize unshare_files/reset_files_struct * let unshare_files() give caller the displaced files_struct * don't bother with grabbing reference only to drop it in the caller if it hadn't been shared in the first place * in that form unshare_files() is trivially implemented via unshare_fd(), so we eliminate the duplicate logics in fork.c * reset_files_struct() is not just only called for current; it will break the system if somebody ever calls it for anything else (we can't modify ->files of somebody else). Lose the task_struct * argument. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-25 09:23:59 -04:00
Al Viro	fd8328be87	[PATCH] sanitize handling of shared descriptor tables in failing execve() * unshare_files() can fail; doing it after irreversible actions is wrong and de_thread() is certainly irreversible. * since we do it unconditionally anyway, we might as well do it in do_execve() and save ourselves the PITA in binfmt handlers, etc. * while we are at it, binfmt_som actually leaked files_struct on failure. As a side benefit, unshare_files(), put_files_struct() and reset_files_struct() become unexported. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-25 09:23:53 -04:00
Al Viro	6b335d9c80	[PATCH] close race in unshare_files() updating current->files requires task_lock Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-25 09:23:48 -04:00
David Miller	5a9d3225a0	sched: use alloc_bootmem() instead of alloc_bootmem_low() There is no guarantee that there is physical ram below 4GB, and in fact many boxes don't have exactly that. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-25 09:53:06 +02:00
Peter Zijlstra	3f5087a2ba	sched: fix share (re)distribution fix __aggregate_redistribute_shares() related lockup reported by David S. Miller. The problem this code tries to solve is 'accurately' calculating the 'fair' share of the group weight for each cpu. The current code falls back to a global group rebalance in case the sched_domain's span it looks at has no shares, but does have tasks. The reason it gets stuck here, is because its inherently racy - if someone steals the last task after we compute the agg->rq_weight, but before we rebalance, we'll never get out of the loop. We could of course go fix that, but while looking at this issue I found that this 'fallback' wasn't nearly as rare as I'd hoped it to be. In fact its quite common - and given it walks the whole machine, thats very bad. The new approach is simple (why didn't I think of it before?), we set the aggregate shares to the full task group weight, and each larger sched domain that encounters an aggregate shares larger than the weight, clips it (it already re-distributes anyway). This nicely converges to the desired global picture where the sum of all shares equals the task group weight. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-25 00:25:08 +02:00
Ingo Molnar	126e01bf92	softlockup: fix NOHZ wakeup David Miller reported: \|---------------> the following commit: \| commit `27ec440779` \| Author: Ingo Molnar <mingo@elte.hu> \| Date: Thu Feb 28 21:00:21 2008 +0100 \| \| sched: make cpu_clock() globally synchronous \| \| Alexey Zaytsev reported (and bisected) that the introduction of \| cpu_clock() in printk made the timestamps jump back and forth. \| \| Make cpu_clock() more reliable while still keeping it fast when it's \| called frequently. \| \| Signed-off-by: Ingo Molnar <mingo@elte.hu> causes watchdog triggers when a cpu exits NOHZ state when it has been there for >= the soft lockup threshold, for example here are some messages from a 128 cpu Niagara2 box: [ 168.106406] BUG: soft lockup - CPU#11 stuck for 128s! [dd:3239] [ 168.989592] BUG: soft lockup - CPU#21 stuck for 86s! [swapper:0] [ 168.999587] BUG: soft lockup - CPU#29 stuck for 91s! [make:4511] [ 168.999615] BUG: soft lockup - CPU#2 stuck for 85s! [swapper:0] [ 169.020514] BUG: soft lockup - CPU#37 stuck for 91s! [swapper:0] [ 169.020514] BUG: soft lockup - CPU#45 stuck for 91s! [sh:4515] [ 169.020515] BUG: soft lockup - CPU#69 stuck for 92s! [swapper:0] [ 169.020515] BUG: soft lockup - CPU#77 stuck for 92s! [swapper:0] [ 169.020515] BUG: soft lockup - CPU#61 stuck for 92s! [swapper:0] [ 169.112554] BUG: soft lockup - CPU#85 stuck for 92s! [swapper:0] [ 169.112554] BUG: soft lockup - CPU#101 stuck for 92s! [swapper:0] [ 169.112554] BUG: soft lockup - CPU#109 stuck for 92s! [swapper:0] [ 169.112554] BUG: soft lockup - CPU#117 stuck for 92s! [swapper:0] [ 169.171483] BUG: soft lockup - CPU#40 stuck for 80s! [dd:3239] [ 169.331483] BUG: soft lockup - CPU#13 stuck for 86s! [swapper:0] [ 169.351500] BUG: soft lockup - CPU#43 stuck for 101s! [dd:3239] [ 169.531482] BUG: soft lockup - CPU#9 stuck for 129s! [mkdir:4565] [ 169.595754] BUG: soft lockup - CPU#20 stuck for 93s! [swapper:0] [ 169.626787] BUG: soft lockup - CPU#52 stuck for 93s! [swapper:0] [ 169.626787] BUG: soft lockup - CPU#84 stuck for 92s! [swapper:0] [ 169.636812] BUG: soft lockup - CPU#116 stuck for 94s! [swapper:0] It's simple enough to trigger this by doing a 10 minute sleep after a fresh bootup then starting a parallel kernel build. I suspect this might be reintroducing a problem we've had and fixed before, see the thread: http://marc.info/?l=linux-kernel&m=119546414004065&w=2 <---------------\| touch the softlockup watchdog when exiting NOHZ state - we are obviously not locked up. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-25 00:25:08 +02:00
Mike Travis	03970f065d	[PATCH] Build fix for CONFIG_NUMA=y && CONFIG_SMP=n Regression caused by `434d53b00d` Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2008-04-24 14:40:29 -07:00
Russ Anderson	472613b961	[IA64] fix bootmem regression on Altix A recent change prevents SGI Altix from booting. This patch fixes the problem. The regresson was introduced in commit `434d53b00d` Signed-off-by: Russ Anderson <rja@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2008-04-24 14:21:21 -07:00
Linus Torvalds	b0d19a378a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: iwlwifi: Fix built-in compilation of iwlcore net: Unexport move_addr_to_{kernel,user} rt2x00: Select LEDS_CLASS. iwlwifi: Select LEDS_CLASS. leds: Do not guard NEW_LEDS with HAS_IOMEM [IPSEC]: Fix catch-22 with algorithm IDs above 31 time: Export set_normalized_timespec. tcp: Make use of before macro in tcp_input.c hamradio: Remove unneeded and deprecated cli()/sti() calls in dmascc.c [NETNS]: Remove empty ->init callback. [DCCP]: Convert do_gettimeofday() to getnstimeofday(). [NETNS]: Don't initialize err variable twice. [NETNS]: The ip6_fib_timer can work with garbage on net namespace stop. [IPV4]: Convert do_gettimeofday() to getnstimeofday(). [IPV4]: Make icmp_sk_init() static. [IPV6]: Make struct ip6_prohibit_entry_template static. tcp: Trivial fix to correct function name in a comment in net/ipv4/tcp.c [NET]: Expose netdevice dev_id through sysfs skbuff: fix missing kernel-doc notation [ROSE]: Fix soft lockup wrt. rose_node_list_lock	2008-04-23 12:23:45 -07:00
Linus Torvalds	94bc891b00	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: [PATCH] get rid of __exit_files(), __exit_fs() and __put_fs_struct() [PATCH] proc_readfd_common() race fix [PATCH] double-free of inode on alloc_file() failure exit in create_write_pipe() [PATCH] teach seq_file to discard entries [PATCH] umount_tree() will unhash everything itself [PATCH] get rid of more nameidata passing in namespace.c [PATCH] switch a bunch of LSM hooks from nameidata to path [PATCH] lock exclusively in collect_mounts() and drop_collected_mounts() [PATCH] move a bunch of declarations to fs/internal.h	2008-04-22 18:28:34 -07:00
Al Viro	1ec7f1ddbe	[PATCH] get rid of __exit_files(), __exit_fs() and __put_fs_struct() The only reason to have separated __...() for those was to keep them inlined for local users in exit.c. Since Alexey removed the inline on those, there's no reason whatsoever to keep them around; just collapse with normal variants. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-04-22 19:55:09 -04:00
Randy Dunlap	73486722b7	kernel-doc: fix sched.c missing parameter Add missing kernel-doc in kernel/sched.c: Warning(linux-2.6.25-git3//kernel/sched.c:7044): No description found for parameter 'span' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-22 13:48:02 -07:00
YOSHIFUJI Hideaki	7c3f944e29	time: Export set_normalized_timespec. Sorry I have just realized set_normalized_timespec() (used in timespec_sub()) is not exported, and link will fail because of it... Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-04-21 19:45:12 -07:00
Linus Torvalds	e9b62693ae	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/juhl/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/juhl/trivial: (24 commits) DOC: A couple corrections and clarifications in USB doc. Generate a slightly more informative error msg for bad HZ fix typo "is" -> "if" in Makefile ext*: spelling fix prefered -> preferred DOCUMENTATION: Use newer DEFINE_SPINLOCK macro in docs. KEYS: Fix the comment to match the file name in rxrpc-type.h. RAID: remove trailing space from printk line DMA engine: typo fixes Remove unused MAX_NODES_SHIFT MAINTAINERS: Clarify access to OCFS2 development mailing list. V4L: Storage class should be before const qualifier (sn9c102) V4L: Storage class should be before const qualifier sonypi: Storage class should be before const qualifier intel_menlow: Storage class should be before const qualifier DVB: Storage class should be before const qualifier arm: Storage class should be before const qualifier ALSA: Storage class should be before const qualifier acpi: Storage class should be before const qualifier firmware_sample_driver.c: fix coding style MAINTAINERS: Add ati_remote2 driver ... Fixed up trivial conflicts in firmware_sample_driver.c	2008-04-21 16:36:46 -07:00
Linus Torvalds	bda0c0afa7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/pci-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/pci-2.6: (42 commits) PCI: Change PCI subsystem MAINTAINER PCI: pci-iommu-iotlb-flushing-speedup PCI: pci_setup_bridge() mustn't be __devinit PCI: pci_bus_size_cardbus() mustn't be __devinit PCI: pci_scan_device() mustn't be __devinit PCI: pci_alloc_child_bus() mustn't be __devinit PCI: replace remaining __FUNCTION__ occurrences PCI: Hotplug: fakephp: Return success, not ENODEV, when bus rescan is triggered PCI: Hotplug: Fix leaks in IBM Hot Plug Controller Driver - ibmphp_init_devno() PCI: clean up resource alignment management PCI: aerdrv_acpi.c: remove unneeded NULL check PCI: Update VIA CX700 quirk PCI: Expose PCI VPD through sysfs PCI: iommu: iotlb flushing PCI: simplify quirk debug output PCI: iova RB tree setup tweak PCI: parisc: use generic pci_enable_resources() PCI: ppc: use generic pci_enable_resources() PCI: powerpc: use generic pci_enable_resources() PCI: ia64: use generic pci_enable_resources() ...	2008-04-21 15:58:35 -07:00
Roland McGrath	e16b278164	ptrace: compat_ptrace_request siginfo This adds support for PTRACE_GETSIGINFO and PTRACE_SETSIGINFO in compat_ptrace_request. It relies on existing arch definitions for copy_siginfo_to_user32 and copy_siginfo_from_user32. On powerpc, this fixes a longstanding regression of 32-bit ptrace calls on 64-bit kernels vs native calls (64-bit calls or 32-bit kernels). This can be seen in a 32-bit call using PTRACE_GETSIGINFO to examine e.g. siginfo_t.si_addr from a signal that sets it. (This was broken as of 2.6.24 and, I presume, many or all prior versions.) Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-21 15:53:41 -07:00
Linus Torvalds	5dfeaef895	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: hrtimer: optimize the softirq time optimization hrtimer: reduce calls to hrtimer_get_softirq_time() clockevents: fix typo in tick-broadcast.c jiffies: add time_is_after_jiffies and others which compare with jiffies	2008-04-21 15:43:43 -07:00
Linus Torvalds	429f731dea	Merge branch 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: Deprecate the asm/semaphore.h files in feature-removal-schedule. Convert asm/semaphore.h users to linux/semaphore.h security: Remove unnecessary inclusions of asm/semaphore.h lib: Remove unnecessary inclusions of asm/semaphore.h kernel: Remove unnecessary inclusions of asm/semaphore.h include: Remove unnecessary inclusions of asm/semaphore.h fs: Remove unnecessary inclusions of asm/semaphore.h drivers: Remove unnecessary inclusions of asm/semaphore.h net: Remove unnecessary inclusions of asm/semaphore.h arch: Remove unnecessary inclusions of asm/semaphore.h	2008-04-21 15:41:27 -07:00
Linus Torvalds	ec965350bb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel: (62 commits) sched: build fix sched: better rt-group documentation sched: features fix sched: /debug/sched_features sched: add SCHED_FEAT_DEADLINE sched: debug: show a weight tree sched: fair: weight calculations sched: fair-group: de-couple load-balancing from the rb-trees sched: fair-group scheduling vs latency sched: rt-group: optimize dequeue_rt_stack sched: debug: add some debug code to handle the full hierarchy sched: fair-group: SMP-nice for group scheduling sched, cpuset: customize sched domains, core sched, cpuset: customize sched domains, docs sched: prepatory code movement sched: rt: multi level group constraints sched: task_group hierarchy sched: fix the task_group hierarchy for UID grouping sched: allow the group scheduler to have multiple levels sched: mix tasks and groups ...	2008-04-21 15:40:24 -07:00
Pavel Machek	f5264481c8	trivial: small cleanups These are small cleanups all over the tree. Trivial style and comment changes to fs/select.c, kernel/signal.c, kernel/stop_machine.c & mm/pdflush.c Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>	2008-04-21 22:15:06 +00:00
Thomas Gleixner	259aae864c	hrtimer: optimize the softirq time optimization The previous optimization did not take the case into account where a clock provides its own softirq_get_time() function. Check for the availablitiy of the clock get time function first and then check if we need to retrieve the time for both clocks via hrtimer_softirq_gettime() to avoid a double evaluation of time in that case as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-21 07:59:51 +02:00
Dimitri Sivanich	833883d9ac	hrtimer: reduce calls to hrtimer_get_softirq_time() It seems that hrtimer_run_queues() is calling hrtimer_get_softirq_time() more often than it needs to. This can cause frequent contention on systems with large numbers of processors/cores. With this patch, hrtimer_run_queues only calls hrtimer_get_softirq_time() if there is a pending timer in one of the hrtimer bases, and only once. This also combines hrtimer_run_queues() and the inline run_hrtimer_queue() into one function. [ tglx@linutronix.de: coding style ] Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-21 07:59:51 +02:00
Glauber Costa	833df317f9	clockevents: fix typo in tick-broadcast.c braodcast -> broadcast Signed-off-by: Glauber Costa <gcosta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-21 07:59:51 +02:00
Ivan Kokshaysky	884525655d	PCI: clean up resource alignment management Done per Linus' request and suggestions. Linus has explained that better than I'll be able to explain: On Thu, Mar 27, 2008 at 10:12:10AM -0700, Linus Torvalds wrote: > Actually, before we go any further, there might be a less intrusive > alternative: add just a couple of flags to the resource flags field (we > still have something like 8 unused bits on 32-bit), and use those to > implement a generic "resource_alignment()" routine. > > Two flags would do it: > > - IORESOURCE_SIZEALIGN: size indicates alignment (regular PCI device > resources) > > - IORESOURCE_STARTALIGN: start field is alignment (PCI bus resources > during probing) > > and then the case of both flags zero (or both bits set) would actually be > "invalid", and we would also clear the IORESOURCE_STARTALIGN flag when we > actually allocate the resource (so that we don't use the "start" field as > alignment incorrectly when it no longer indicates alignment). > > That wouldn't be totally generic, but it would have the nice property of > automatically at least add sanity checking for that whole "res->start has > the odd meaning of 'alignment' during probing" and remove the need for a > new field, and it would allow us to have a generic "resource_alignment()" > routine that just gets a resource pointer. Besides, I removed IORESOURCE_BUS_HAS_VGA flag which was unused for ages. Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Gary Hade <garyhade@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-04-20 21:47:08 -07:00
Ingo Molnar	486fdae214	sched: build fix Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:01 +02:00
Ingo Molnar	c24b7c5244	sched: features fix Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	f00b45c145	sched: /debug/sched_features provide a text based interface to the scheduler features; this saves the 'user' from setting bits using decimal arithmetic. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Ingo Molnar	06379aba52	sched: add SCHED_FEAT_DEADLINE unused at the moment. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	7ba2e74ab5	sched: debug: show a weight tree Print a tree of weights. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	8f1bc385cf	sched: fair: weight calculations In order to level the hierarchy, we need to calculate load based on the root view. That is, each task's load is in the same unit. A / \ B 1 / \ 2 3 To compute 1's load we do: weight(1) -------------- rq_weight(A) To compute 2's load we do: weight(2) weight(B) ------------ * ----------- rq_weight(B) rw_weight(A) This yields load fractions in comparable units. The consequence is that it changes virtual time. We used to have: time_{i} vtime_{i} = ------------ weight_{i} vtime = \Sum vtime_{i} = time / rq_weight. But with the new way of load calculation we get that vtime equals time. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	4a55bd5e97	sched: fair-group: de-couple load-balancing from the rb-trees De-couple load-balancing from the rb-trees, so that I can change their organization. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	ac884dec6d	sched: fair-group scheduling vs latency Currently FAIR_GROUP sched grows the scheduler latency outside of sysctl_sched_latency, invert this so it stays within. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	58d6c2d72f	sched: rt-group: optimize dequeue_rt_stack Now that the group hierarchy can have an arbitrary depth the O(n^2) nature of RT task dequeues will really hurt. Optimize this by providing space to store the tree path, so we can walk it the other way. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	d19ca30874	sched: debug: add some debug code to handle the full hierarchy Add some extra debug output so we can get a better overview of the full hierarchy. We print the cgroup path after each cfs_rq, so we can see what group we're looking at. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	18d95a2832	sched: fair-group: SMP-nice for group scheduling Implement SMP nice support for the full group hierarchy. On each load-balance action, compile a sched_domain wide view of the full task_group tree. We compute the domain wide view when walking down the hierarchy, and readjust the weights when walking back up. After collecting and readjusting the domain wide view, we try to balance the tasks within the task_groups. The current approach is a naively balance each task group until we've moved the targeted amount of load. Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP paper. XXX: there will be some numerical issues due to the limited nature of SCHED_LOAD_SCALE wrt to representing a task_groups influence on the total weight. When the tree is deep enough, or the task weight small enough, we'll run out of bits. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Abhishek Chandra <chandra@cs.umn.edu> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Hidetoshi Seto	1d3504fcf5	sched, cpuset: customize sched domains, core [rebased for sched-devel/latest] - Add a new cpuset file, having levels: sched_relax_domain_level - Modify partition_sched_domains() and build_sched_domains() to take attributes parameter passed from cpuset. - Fill newidle_idx for node domains which currently unused but might be required if sched_relax_domain_level become higher. - We can change the default level by boot option 'relax_domain_level='. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	b758149c02	sched: prepatory code movement Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	b40b2e8eb5	sched: rt: multi level group constraints multi level rt constraints Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	f473aa5e02	sched: task_group hierarchy Add the full parent<->child relation thing into task_groups as well. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	eff766a65c	sched: fix the task_group hierarchy for UID grouping UID grouping doesn't actually have a task_group representing the root of the task_group tree. Add one. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Dhaval Giani	ec7dc8ac73	sched: allow the group scheduler to have multiple levels This patch makes the group scheduler multi hierarchy aware. [a.p.zijlstra@chello.nl: rt-parts and assorted fixes] Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Dhaval Giani	354d60c2ff	sched: mix tasks and groups This patch allows tasks and groups to exist in the same cfs_rq. With this change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N) fairness model where M tasks and N groups exist at the cfs_rq level. [a.p.zijlstra@chello.nl: rt bits and assorted fixes] Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Ingo Molnar	ea736ed5d3	sched: fix checks Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Peter Zijlstra	112f53f5d7	sched: old sleeper bonus Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Mike Travis	cd8ba7cd9b	sched: add new set_cpus_allowed_ptr function Add a new function that accepts a pointer to the "newly allowed cpus" cpumask argument. int set_cpus_allowed_ptr(struct task_struct p, const cpumask_t new_mask) The current set_cpus_allowed() function is modified to use the above but this does not result in an ABI change. And with some compiler optimization help, it may not introduce any additional overhead. Additionally, to enforce the read only nature of the new_mask arg, the "const" property is migrated to sub-functions called by set_cpus_allowed. This silences compiler warnings. Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Mike Travis	e0982e90cd	init: move setup of nr_cpu_ids to as early as possible Move the setting of nr_cpu_ids from sched_init() to start_kernel() so that it's available as early as possible. Note that an arch has the option of setting it even earlier if need be, but it should not result in a different value than the setup_nr_cpu_ids() function. Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Mike Travis	4bdbaad33d	sched: remove another cpumask_t variable from stack * Remove another cpumask_t variable from stack that was missed in the last kernel_sched_c updates. Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Mike Travis	39106dcf85	cpumask: use new cpus_scnprintf function * Cleaned up references to cpumask_scnprintf() and added new cpulist_scnprintf() interfaces where appropriate. * Fix some small bugs (or code efficiency improvments) for various uses of cpumask_scnprintf. * Clean up some checkpatch errors. Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Mike Travis	7c16ec585c	cpumask: reduce stack usage in SD_x_INIT initializers * Remove empty cpumask_t (and all non-zero/non-null) variables in SD__INIT macros. Use memset(0) to clear. Also, don't inline the initializer functions to save on stack space in build_sched_domains(). Merge change to include/linux/topology.h that uses the new node_to_cpumask_ptr function in the nr_cpus_node macro into this patch. Depends on: [mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch [sched-devel]: sched: add new set_cpus_allowed_ptr function Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Mike Travis	c5f59f0833	nodemask: use new node_to_cpumask_ptr function * Use new node_to_cpumask_ptr. This creates a pointer to the cpumask for a given node. This definition is in mm patch: asm-generic-add-node_to_cpumask_ptr-macro.patch * Use new set_cpus_allowed_ptr function. Depends on: [mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch [sched-devel]: sched: add new set_cpus_allowed_ptr function [x86/latest]: x86: add cpus_scnprintf function Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Greg Banks <gnb@melbourne.sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Mike Travis	b53e921ba1	generic: reduce stack pressure in sched_affinity * Modify sched_affinity functions to pass cpumask_t variables by reference instead of by value. * Use new set_cpus_allowed_ptr function. Depends on: [sched-devel]: sched: add new set_cpus_allowed_ptr function Cc: Paul Jackson <pj@sgi.com> Cc: Cliff Wickman <cpw@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Mike Travis	f9a86fcbbb	cpuset: modify cpuset_set_cpus_allowed to use cpumask pointer * Modify cpuset_cpus_allowed to return the currently allowed cpuset via a pointer argument instead of as the function return value. * Use new set_cpus_allowed_ptr function. * Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses. Depends on: [sched-devel]: sched: add new set_cpus_allowed_ptr function Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:58 +02:00
Mike Travis	f70316dace	generic: use new set_cpus_allowed_ptr function * Use new set_cpus_allowed_ptr() function added by previous patch, which instead of passing the "newly allowed cpus" cpumask_t arg by value, pass it by pointer: -int set_cpus_allowed(struct task_struct p, cpumask_t new_mask) +int set_cpus_allowed_ptr(struct task_struct p, const cpumask_t new_mask) Modify CPU_MASK_ALL Depends on: [sched-devel]: sched: add new set_cpus_allowed_ptr function Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:58 +02:00
Mike Travis	434d53b00d	sched: remove fixed NR_CPUS sized arrays in kernel_sched_c * Change fixed size arrays to per_cpu variables or dynamically allocated arrays in sched_init() and sched_init_smp(). (1) static struct sched_entity init_sched_entity_p[NR_CPUS]; (1) static struct cfs_rq init_cfs_rq_p[NR_CPUS]; (1) static struct sched_rt_entity init_sched_rt_entity_p[NR_CPUS]; (1) static struct rt_rq init_rt_rq_p[NR_CPUS]; static struct sched_group *sched_group_nodes_bycpu[NR_CPUS]; (1) - these arrays are allocated via alloc_bootmem_low() Change sched_domain_debug_one() to use cpulist_scnprintf instead of cpumask_scnprintf. This reduces the output buffer required and improves readability when large NR_CPU count machines arrive. * In sched_create_group() we allocate new arrays based on nr_cpu_ids. Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:58 +02:00
Mike Travis	d366f8cbc1	cpumask: Cleanup more uses of CPU_MASK and NODE_MASK * Replace usages of CPU_MASK_NONE, CPU_MASK_ALL, NODE_MASK_NONE, NODE_MASK_ALL to reduce stack requirements for large NR_CPUS and MAXNODES counts. * In some cases, the cpumask variable was initialized but then overwritten with another value. This is the case for changes like this: - cpumask_t oldmask = CPU_MASK_ALL; + cpumask_t oldmask; Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:58 +02:00
Gregory Haskins	9f0e738f49	sched: fix cpus_allowed settings Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:58 +02:00
Dhaval Giani	0297b80339	sched: allow cpuacct stats to be reset Currently the schedstats implementation does not allow the statistics to be reset. This patch aims to allow that. echo 0 > cpuacct.usage resets the usage. Any other value is not allowed and returns -EINVAL. Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:58 +02:00
Dhaval Giani	32cd756a80	sched: cleanup cpuacct variable names Change the variable names to the common convention for the cpuacct subsystem. Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:58 +02:00
Olof Johansson	48f20a9a94	tasklets: execute tasklets in the same order they were queued I noticed this when looking at an openswan issue. Openswan (ab?)uses the tasklet API to defer processing of packets in some situations, with one packet per tasklet_action(). I started noticing sequences of backwards-ordered sequence numbers coming over the wire, since new tasklets are always queued at the head of the list but processed sequentially. Convert it to instead append new entries to the tail of the list. As an extra bonus, the splicing code in takeover_tasklets() no longer has to iterate over the list. Signed-off-by: Olof Johansson <olof@lixom.net> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:58 +02:00
Peter Zijlstra	ac086bc229	sched: rt-group: smp balancing Currently the rt group scheduling does a per cpu runtime limit, however the rt load balancer makes no guarantees about an equal spread of real- time tasks, just that at any one time, the highest priority tasks run. Solve this by making the runtime limit a global property by borrowing excessive runtime from the other cpus once the local limit runs out. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:58 +02:00
Peter Zijlstra	d0b27fa778	sched: rt-group: synchonised bandwidth period Various SMP balancing algorithms require that the bandwidth period run in sync. Possible improvements are moving the rt_bandwidth thing into root_domain and keeping a span per rt_bandwidth which marks throttled cpus. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Peter Zijlstra	79b3feffb1	sched: fix regression with sched yield Balbir Singh reported: > 1:mon> t > [c0000000e7677da0] c000000000067de0 .sys_sched_yield+0x6c/0xbc > [c0000000e7677e30] c000000000008748 syscall_exit+0x0/0x40 > --- Exception: c01 (System Call) at 00000400001d09e4 > SP (4000664cb10) is in userspace > 1:mon> r > cpu 0x1: Vector: 300 (Data Access) at [c0000000e7677aa0] > pc: c000000000068e50: .yield_task_fair+0x94/0xc4 > lr: c000000000067de0: .sys_sched_yield+0x6c/0xbc the check that should have avoided that is: /* * Are we the only task in the tree? */ if (unlikely(rq->load.weight == curr->se.load.weight)) return; But I guess that overlooks rt tasks, they also increase the load. So I guess something like this ought to fix it.. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Dmitry Adamushko	19fb518c2a	latencytop: optimize LT_BACKTRACEDEPTH loops a bit There is no need to loop any longer when 'same == 0'. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Ingo Molnar	50df5d6aea	sched: remove sysctl_sched_batch_wakeup_granularity it's unused. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Ingo Molnar	02e2b83bd2	sched: reenable sync wakeups Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Ingo Molnar	d25ce4cd49	sched: cache hot buddy Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Ingo Molnar	1fc8afa4c8	sched: feat affine wakeups Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Ingo Molnar	b85d066726	sched: introduce SCHED_FEAT_SYNC_WAKEUPS, turn it off turn off sync wakeups by default. They are not needed anymore - the buddy logic should be smart enough to keep the system from overscheduling. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Peter Zijlstra	0bbd3336ee	sched: fix wakeup granularity for buddies The wakeup buddy logic didn't use the same wakeup granularity logic as the wakeup preemption did, this might cause the ->next buddy to be selected past the point where we would have preempted had the task been a single running instance. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Guillaume Chazarain	15934a3732	sched: fix rq->clock overflows detection with CONFIG_NO_HZ When using CONFIG_NO_HZ, rq->tick_timestamp is not updated every TICK_NSEC. We check that the number of skipped ticks matches the clock jump seen in __update_rq_clock(). Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Reynes Philippe	30914a58af	sched: sched.c needs tick.h kernel/sched.c:506: erreur: implicit declaration of function tick_get_tick_sched kernel/sched.c:506: erreur: invalid type argument of -> kernel/sched.c:506: erreur: NOHZ_MODE_INACTIVE undeclared (first use in this function) kernel/sched.c:506: erreur: (Each undeclared identifier is reported only once kernel/sched.c:506: erreur: for each function it appears in.) Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Ingo Molnar	27ec440779	sched: make cpu_clock() globally synchronous Alexey Zaytsev reported (and bisected) that the introduction of cpu_clock() in printk made the timestamps jump back and forth. Make cpu_clock() more reliable while still keeping it fast when it's called frequently. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Ingo Molnar	018d6db4cb	sched: re-do "sched: fix fair sleepers" re-apply: \| commit `e22ecef1d2` \| Author: Ingo Molnar <mingo@elte.hu> \| Date: Fri Mar 14 22:16:08 2008 +0100 \| \| sched: fix fair sleepers \| \| Fair sleepers need to scale their latency target down by runqueue \| weight. Otherwise busy systems will gain ever larger sleep bonus. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Suresh Siddha	2adee9b30d	x86: fpu xstate split fix Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-19 19:19:55 +02:00
Suresh Siddha	61c4628b53	x86, fpu: split FPU state from task struct - v5 Split the FPU save area from the task struct. This allows easy migration of FPU context, and it's generally cleaner. It also allows the following two optimizations: 1) only allocate when the application actually uses FPU, so in the first lazy FPU trap. This could save memory for non-fpu using apps. Next patch does this lazy allocation. 2) allocate the right size for the actual cpu rather than 512 bytes always. Patches enabling xsave/xrstor support (coming shortly) will take advantage of this. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-19 19:19:55 +02:00
Thomas Gleixner	d8bb6f4c16	x86: tsc prevent time going backwards We already catch most of the TSC problems by sanity checks, but there is a subtle bug which has been in the code forever. This can cause time jumps in the range of hours. This was reported in: http://lkml.org/lkml/2007/8/23/96 and http://lkml.org/lkml/2008/3/31/23 I was able to reproduce the problem with a gettimeofday loop test on a dual core and a quad core machine which both have sychronized TSCs. The TSCs seems not to be perfectly in sync though, but the kernel is not able to detect the slight delta in the sync check. Still there exists an extremly small window where this delta can be observed with a real big time jump. So far I was only able to reproduce this with the vsyscall gettimeofday implementation, but in theory this might be observable with the syscall based version as well. CPU 0 updates the clock source variables under xtime/vyscall lock and CPU1, where the TSC is slighty behind CPU0, is reading the time right after the seqlock was unlocked. The clocksource reference data was updated with the TSC from CPU0 and the value which is read from TSC on CPU1 is less than the reference data. This results in a huge delta value due to the unsigned subtraction of the TSC value and the reference value. This algorithm can not be changed due to the support of wrapping clock sources like pm timer. The huge delta is converted to nanoseconds and added to xtime, which is then observable by the caller. The next gettimeofday call on CPU1 will show the correct time again as now the TSC has advanced above the reference value. To prevent this TSC specific wreckage we need to compare the TSC value against the reference value and return the latter when it is larger than the actual TSC value. I pondered to mark the TSC unstable when the readout is smaller than the reference value, but this would render an otherwise good and fast clocksource unusable without a real good reason. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:19:55 +02:00
Erik Bosman	8fb402bccf	generic, x86: add prctl commands PR_GET_TSC and PR_SET_TSC This patch adds prctl commands that make it possible to deny the execution of timestamp counters in userspace. If this is not implemented on a specific architecture, prctl will return -EINVAL. ned-off-by: Erik Bosman <ejbosman@cs.vu.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-19 19:19:55 +02:00
Matthew Wilcox	a655020753	kernel: Remove unnecessary inclusions of asm/semaphore.h None of these files use any of the functionality promised by asm/semaphore.h. Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2008-04-18 22:17:04 -04:00
Ahmed S. Darwish	04305e4aff	Audit: Final renamings and cleanup Rename the se_str and se_rule audit fields elements to lsm_str and lsm_rule to avoid confusion. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Acked-by: James Morris <jmorris@namei.org>	2008-04-19 09:59:43 +10:00
Ahmed S. Darwish	9d57a7f9e2	SELinux: use new audit hooks, remove redundant exports Setup the new Audit LSM hooks for SELinux. Remove the now redundant exported SELinux Audit interface. Audit: Export 'audit_krule' and 'audit_field' to the public since their internals are needed by the implementation of the new LSM hook 'audit_rule_known'. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Acked-by: James Morris <jmorris@namei.org>	2008-04-19 09:53:46 +10:00
Ahmed S. Darwish	d7a96f3a1a	Audit: internally use the new LSM audit hooks Convert Audit to use the new LSM Audit hooks instead of the exported SELinux interface. Basically, use: security_audit_rule_init secuirty_audit_rule_free security_audit_rule_known security_audit_rule_match instad of (respectively) : selinux_audit_rule_init selinux_audit_rule_free audit_rule_has_selinux selinux_audit_rule_match Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Acked-by: James Morris <jmorris@namei.org>	2008-04-19 09:52:37 +10:00
Ahmed S. Darwish	2a862b32f3	Audit: use new LSM hooks instead of SELinux exports Stop using the following exported SELinux interfaces: selinux_get_inode_sid(inode, sid) selinux_get_ipc_sid(ipcp, sid) selinux_get_task_sid(tsk, sid) selinux_sid_to_string(sid, ctx, len) kfree(ctx) and use following generic LSM equivalents respectively: security_inode_getsecid(inode, secid) security_ipc_getsecid*(ipcp, secid) security_task_getsecid(tsk, secid) security_sid_to_secctx(sid, ctx, len) security_release_secctx(ctx, len) Call security_release_secctx only if security_secid_to_secctx succeeded. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Acked-by: James Morris <jmorris@namei.org> Reviewed-by: Paul Moore <paul.moore@hp.com>	2008-04-19 09:52:34 +10:00
Linus Torvalds	73e3e6481f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt * git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: clocksource: make clocksource watchdog cycle through online CPUs Documentation: move timer related documentation to a single place clockevents: optimise tick_nohz_stop_sched_tick() a bit locking: remove unused double_spin_lock() hrtimers: simplify lockdep handling timers: simplify lockdep handling posix-timers: fix shadowed variables timer_list: add annotations to workqueue.c hrtimer: use nanosleep specific restart_block fields hrtimer: add nanosleep specific restart_block member	2008-04-18 08:37:41 -07:00
Linus Torvalds	9732b61123	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-kgdb * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-kgdb: kgdb: always use icache flush for sw breakpoints kgdb: fix SMP NMI kgdb_handle_exception exit race kgdb: documentation fixes kgdb: allow static kgdbts boot configuration kgdb: add documentation kgdb: Kconfig fix kgdb: add kgdb internal test suite kgdb: fix several kgdb regressions kgdb: kgdboc pl011 I/O module kgdb: fix optional arch functions and probe_kernel_* kgdb: add x86 HW breakpoints kgdb: print breakpoint removed on exception kgdb: clocksource watchdog kgdb: fix NMI hangs kgdb: fix kgdboc dynamic module configuration kgdb: document parameters x86: kgdb support consoles: polling support, kgdboc kgdb: core uaccess: add probe_kernel_write()	2008-04-18 08:37:01 -07:00
Linus Torvalds	d7bb545d86	Merge branch 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: Remove DEBUG_SEMAPHORE from Kconfig Improve semaphore documentation Simplify semaphore implementation Add down_timeout and change ACPI to use it Introduce down_killable() Generic semaphore implementation Add semaphore.h to kernel_lock.c Fix quota.h includes	2008-04-18 08:25:29 -07:00
Linus Torvalds	4cba84b5d6	Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6 * 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (36 commits) [S390] Remove code duplication from monreader / dcssblk. [S390] kernel: show last breaking-event-address on oops [S390] lowcore: Change type of lowcores softirq_pending to __u32. [S390] zcrypt: Comments and kernel-doc cleanup [S390] uaccess: Always access the correct address space. [S390] Fix a lot of sparse warnings. [S390] Convert s390 to GENERIC_CLOCKEVENTS. [S390] genirq/clockevents: move irq affinity prototypes/inlines to interrupt.h [S390] Convert monitor calls to function calls. [S390] qdio (new feature): enhancing info-retrieval from QDIO-adapters [S390] replace remaining __FUNCTION__ occurrences [S390] remove redundant display of free swap space in show_mem() [S390] qdio: remove outdated developerworks link. [S390] Add debug_register_mode() function to debug feature API [S390] crypto: use more descriptive function names for init/exit routines. [S390] switch sched_clock to store-clock-extended. [S390] zcrypt: add support for large random numbers [S390] hw_random: allow rng_dev_read() to return hardware errors. [S390] Vertical cpu management. [S390] cpu topology support for s390. ...	2008-04-18 08:19:15 -07:00
Roland McGrath	18c98b6527	ptrace_signal subroutine This breaks out the ptrace handling from get_signal_to_deliver into a new subroutine. The actual code there doesn't change, and it gets inlined into nearly identical compiled code. This makes the function substantially shorter and thus easier to read, and it nicely isolates the ptrace magic. Signed-off-by: Roland McGrath <roland@redhat.com> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-18 08:17:57 -07:00
Li Zefan	0e04388f01	cgroup: fix a race condition in manipulating tsk->cg_list When I ran a test program to fork mass processes and at the same time 'cat /cgroup/tasks', I got the following oops: ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:72! invalid opcode: 0000 [#1] SMP Pid: 4178, comm: a.out Not tainted (2.6.25-rc9 #72) ... Call Trace: [<c044a5f9>] ? cgroup_exit+0x55/0x94 [<c0427acf>] ? do_exit+0x217/0x5ba [<c0427ed7>] ? do_group_exit+0.65/0x7c [<c0427efd>] ? sys_exit_group+0xf/0x11 [<c0404842>] ? syscall_call+0x7/0xb [<c05e0000>] ? init_cyrix+0x2fa/0x479 ... EIP: [<c04df671>] list_del+0x35/0x53 SS:ESP 0068:ebc7df4 ---[ end trace caffb7332252612b ]--- Fixing recursive fault but reboot is needed! After digging into the code and debugging, I finlly found out a race situation: do_exit() ->cgroup_exit() ->if (!list_empty(&tsk->cg_list)) list_del(&tsk->cg_list); cgroup_iter_start() ->cgroup_enable_task_cg_list() ->list_add(&tsk->cg_list, ..); In this case the list won't be deleted though the process has exited. We got two bug reports in the past, which seem to be the same bug as this one: http://lkml.org/lkml/2008/3/5/332 http://lkml.org/lkml/2007/10/17/224 Actually sometimes I got oops on list_del, sometimes oops on list_add. And I can change my test program a bit to trigger other oops. The patch has been tested both on x86_32 and x86_64. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-18 08:17:57 -07:00
Jason Wessel	1a9a3e76dd	kgdb: always use icache flush for sw breakpoints On the ppc 4xx architecture the instruction cache must be flushed as well as the data cache. This patch just makes it generic for all architectures where CACHE_FLUSH_IS_SAFE is set to 1. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-17 20:05:43 +02:00
Jason Wessel	56fb709329	kgdb: fix SMP NMI kgdb_handle_exception exit race Fix the problem of protecting the kgdb handle_exception exit which had an NMI race condition, while trying to restore normal system operation. There was a small window after the master processor sets cpu_in_debug to zero but before it has set kgdb_active to zero where a non-master processor in an SMP system could receive an NMI and re-enter the kgdb_wait() loop. As long as the master processor sets the cpu_in_debug before sending the cpu roundup the cpu_in_debug variable can also be used to guard against the race condition. The kgdb_wait() function no longer needs to check kgdb_active because it is done in the arch specific code and handled along with the nmi traps at the low level. This also allows kgdb_wait() to exit correctly if it was entered for some unknown reason due to a spurious NMI that could not be handled by the arch specific code. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-17 20:05:43 +02:00
Jason Wessel	737a460f21	kgdb: fix several kgdb regressions kgdb core fixes: - Check to see that mm->mmap_cache is not null before calling flush_cache_range(), else on arch=ARM it will cause a fatal fault. - Breakpoints should only be restored if they are in the BP_ACTIVE state. - Fix a typo in comments to "kgdb_register_io_module" x86 kgdb fixes: - Fix the x86 arch handler such that on a kill or detach that the appropriate cleanup on the single stepping flags gets run. - Add in the DIE_NMIWATCHDOG call for x86_64 - Touch the nmi watchdog before returning the system to normal operation after performing any kind of kgdb operation, else the possibility exists to trigger the watchdog. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-17 20:05:40 +02:00
Jason Wessel	b4b8ac524d	kgdb: fix optional arch functions and probe_kernel_* Fix two regressions dealing with the kgdb core. 1) kgdb_skipexception and kgdb_post_primary_code are optional functions that are only required on archs that need special exception fixups. 2) The kernel address space scope must be set on any probe_kernel_* function or archs such as ARCH=arm will not allow access to the kernel memory space. As an example, it is required to allow the full kernel address space is when you the kernel debugger to inspect a system call. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-17 20:05:39 +02:00
Jason Wessel	64e9ee3095	kgdb: add x86 HW breakpoints Add HW breakpoints into the arch specific portion of x86 kgdb. In the current x86 kernel.org kernels HW breakpoints are changed out in lazy fashion because there is no infrastructure around changing them when changing to a kernel task or entering the kernel mode via a system call. This lazy approach means that if a user process uses HW breakpoints the kgdb will loose out. This is an acceptable trade off because the developer debugging the kernel is assumed to know what is going on system wide and would be aware of this trade off. There is a minor bug fix to the kgdb core so as to correctly call the hw breakpoint functions with a valid value from the enum. There is also a minor change to the x86_64 startup code when using early HW breakpoints. When the debugger is connected, the cpu startup code must not zero out the HW breakpoint registers or you cannot hit the breakpoints you are interested in, in the first place. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-17 20:05:39 +02:00
Jason Wessel	67baf94cd2	kgdb: print breakpoint removed on exception If kgdb does remove a breakpoint that had a problem on the recursion check, it should also print the address of the breakpoint. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-17 20:05:39 +02:00
Jason Wessel	7c3078b637	kgdb: clocksource watchdog In order to not trip the clocksource watchdog, kgdb must touch the clocksource watchdog on the return to normal system run state. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-17 20:05:38 +02:00
Jason Wessel	dc7d552705	kgdb: core kgdb core code. Handles the protocol and the arch details. [ mingo@elte.hu: heavily modified, simplified and cleaned up. ] [ xemul@openvz.org: use find_task_by_pid_ns ] Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jan Kiszka <jan.kiszka@web.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-17 20:05:37 +02:00
Matthew Wilcox	714493cd54	Improve semaphore documentation Move documentation from semaphore.h to semaphore.c as requested by Andrew Morton. Also reformat to kernel-doc style and add some more notes about the implementation. Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2008-04-17 10:43:01 -04:00
Matthew Wilcox	b17170b2fa	Simplify semaphore implementation By removing the negative values of 'count' and relying on the wait_list to indicate whether we have any waiters, we can simplify the implementation by removing the protection against an unlikely race condition. Thanks to David Howells for his suggestions. Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2008-04-17 10:42:54 -04:00
Matthew Wilcox	f1241c87a1	Add down_timeout and change ACPI to use it ACPI currently emulates a timeout for semaphores with calls to down_trylock and sleep. This produces horrible behaviour in terms of fairness and excessive wakeups. Now that we have a unified semaphore implementation, adding a real down_trylock is almost trivial. Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2008-04-17 10:42:46 -04:00
Matthew Wilcox	f06d968658	Introduce down_killable() down_killable() is the functional counterpart of mutex_lock_killable. Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2008-04-17 10:42:40 -04:00
Matthew Wilcox	64ac24e738	Generic semaphore implementation Semaphores are no longer performance-critical, so a generic C implementation is better for maintainability, debuggability and extensibility. Thanks to Peter Zijlstra for fixing the lockdep warning. Thanks to Harvey Harrison for pointing out that the unlikely() was unnecessary. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Acked-by: Ingo Molnar <mingo@elte.hu>	2008-04-17 10:42:34 -04:00
Andi Kleen	6993fc5bbc	clocksource: make clocksource watchdog cycle through online CPUs This way it checks if the clocks are synchronized between CPUs too. This might be able to detect slowly drifting TSCs which only go wrong over longer time. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-17 12:22:31 +02:00
Karsten Wiese	903b8a8d48	clockevents: optimise tick_nohz_stop_sched_tick() a bit Call ts = &per_cpu(tick_cpu_sched, cpu); and cpu = smp_processor_id(); once instead of twice. No functional change done, as changed code runs with local irq off. Reduces source lines and text size (20bytes on x86_64). [ akpm@linux-foundation.org: Build fix ] Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-17 12:22:31 +02:00
Oleg Nesterov	8e60e05fdc	hrtimers: simplify lockdep handling In order to avoid the false positive from lockdep, each per-cpu base->lock has the separate lock class and migrate_hrtimers() uses double_spin_lock(). This is overcomplicated: except for migrate_hrtimers() we never take 2 locks at once, and migrate_hrtimers() can use spin_lock_nested(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-17 12:22:31 +02:00
Oleg Nesterov	0d180406f2	timers: simplify lockdep handling In order to avoid the false positive from lockdep, each per-cpu base->lock has the separate lock class and migrate_timers() uses double_spin_lock(). This all is overcomplicated: except for migrate_timers() we never take 2 locks at once, and migrate_timers() can use spin_lock_nested(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-17 12:22:31 +02:00
WANG Cong	ee7dd205b5	posix-timers: fix shadowed variables Fix sparse warnings like this: kernel/posix-cpu-timers.c:1090:25: warning: symbol 't' shadows an earlier one kernel/posix-cpu-timers.c:1058:21: originally declared here Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-17 12:22:30 +02:00
Pavel Machek	d59b949f77	timer_list: add annotations to workqueue.c Add timer list annotations to workqueue.c so we can see the call site in the timer stats. Signed-off-by: Pavel Machek <Pavel@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-17 12:22:30 +02:00
Thomas Gleixner	029a07e031	hrtimer: use nanosleep specific restart_block fields Convert all the nanosleep related users of restart_block to the new nanosleep specific restart_block fields. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-04-17 12:22:30 +02:00
Russell King	d7b906897e	[S390] genirq/clockevents: move irq affinity prototypes/inlines to interrupt.h > Generic code is not supposed to include irq.h. Replace this include > by linux/hardirq.h instead and add/replace an include of linux/irq.h > in asm header files where necessary. > This change should only matter for architectures that make use of > GENERIC_CLOCKEVENTS. > Architectures in question are mips, x86, arm, sh, powerpc, uml and sparc64. > > I did some cross compile tests for mips, x86_64, arm, powerpc and sparc64. > This patch fixes also build breakages caused by the include replacement in > tick-common.h. I generally dislike adding optional linux/* includes in asm/* includes - I'm nervous about this causing include loops. However, there's a separate point to be discussed here. That is, what interfaces are expected of every architecture in the kernel. If generic code wants to be able to set the affinity of interrupts, then that needs to become part of the interfaces listed in linux/interrupt.h rather than linux/irq.h. So what I suggest is this approach instead (against Linus' tree of a couple of days ago) - we move irq_set_affinity() and irq_can_set_affinity() to linux/interrupt.h, change the linux/irq.h includes to linux/interrupt.h and include asm/irq_regs.h where needed (asm/irq_regs.h is supposed to be rarely used include since not much touches the stacked parent context registers.) Build tested on ARM PXA family kernels and ARM's Realview platform kernels which both use genirq. [ tglx@linutronix.de: add GENERIC_HARDIRQ dependencies ] Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>	2008-04-17 07:47:05 +02:00
Linus Torvalds	093a07e2fd	Fix locking bug in "acquire_console_semaphore_for_printk()" When I cleaned up printk() and split up the printk locking logic in commit `266c2e0abe` ("Make printk() console semaphore accesses sensible") I had incorrectly moved the call to have_callable_console() outside of the console semaphore. That was buggy. The console semaphore protects the console_drivers list that is used by have_callable_console(). Thanks go to Bongani Hlope who saw this as a hang on shutdown and reboot and bisected the bug to the right commit, and tested this patch. See http://lkml.org/lkml/2008/4/11/315 Bisected-and-tested-by: Bongani Hlope <bonganilinux@mweb.co.za> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-15 13:09:54 -07:00
Pavel Machek	6afe1a1fe8	PM: Remove legacy PM AFAICT pm_send_all is a nop when noone uses pm_register... Hmm.. can we just force CONFIG_PM_LEGACY=n, and see what happens? Or maybe this is better idea? It may break build somewhere, but it should be easy to fix... (it builds here, i386 and x86-64). Signed-off-by: Pavel Machek <pavel@suse.cz> Acked-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Len Brown <len.brown@intel.com>	2008-04-15 03:19:07 -04:00
Ingo Molnar	e2df9e0905	revert "sched: fix fair sleepers" revert "sched: fix fair sleepers" (`e22ecef1d2`), because it is causing audio skipping, see: http://bugzilla.kernel.org/show_bug.cgi?id=10428 the patch is correct and the real cause of the skipping is not understood (tracing makes it go away), but time has run out so we'll revert it and re-try in 2.6.26. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-14 14:26:23 +02:00
Paul Menage	b6c3006d20	cgroups: include hierarchy ids in /proc/<pid>/cgroup Extend the /proc/<pid>/cgroup file to include the appropriate hierarchy ID on each line. Currently this ID isn't really needed since a hierarchy can be completely identified by the set of subsystems bound to it, but this is likely to change in the near future in order to support stateless subsystems and merging/rebinding of subsystems. Getting this change into 2.6.25 reduces the need for an API change later. Signed-off-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-11 08:06:43 -07:00
Roland McGrath	54a0151041	asmlinkage_protect replaces prevent_tail_call The prevent_tail_call() macro works around the problem of the compiler clobbering argument words on the stack, which for asmlinkage functions is the caller's (user's) struct pt_regs. The tail/sibling-call optimization is not the only way that the compiler can decide to use stack argument words as scratch space, which we have to prevent. Other optimizations can do it too. Until we have new compiler support to make "asmlinkage" binding on the compiler's own use of the stack argument frame, we have work around all the manifestations of this issue that crop up. More cases seem to be prevented by also keeping the incoming argument variables live at the end of the function. This makes their original stack slots attractive places to leave those variables, so the compiler tends not clobber them for something else. It's still no guarantee, but it handles some observed cases that prevent_tail_call() did not. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-10 17:28:26 -07:00
Paul Menage	8bab8dded6	cgroups: add cgroup support for enabling controllers at boot time The effects of cgroup_disable=foo are: - foo isn't auto-mounted if you mount all cgroups in a single hierarchy - foo isn't visible as an individually mountable subsystem As a result there will only ever be one call to foo->create(), at init time; all processes will stay in this group, and the group will never be mounted on a visible hierarchy. Any additional effects (e.g. not allocating metadata) are up to the foo subsystem. This doesn't handle early_init subsystems (their "disabled" bit isn't set be, but it could easily be extended to do so if any of the early_init systems wanted it - I think it would just involve some nastier parameter processing since it would occur before the command-line argument parser had been run. Hugh said: Ballpark figures, I'm trying to get this question out rather than processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead to the affected paths, booting with cgroup_disable=memory cuts that back to 1% overhead (due to slightly bigger struct page). I'm no expert on distros, they may have no interest whatever in CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or without it, or apply the cgroup_disable=memory patches. Unix bench's execl test result on x86_64 was == just after boot without mounting any cgroup fs.== mem_cgorup=off : Execl Throughput 43.0 3150.1 732.6 mem_cgroup=on : Execl Throughput 43.0 2932.6 682.0 == [lizf@cn.fujitsu.com: fix boot option parsing] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-04 14:46:26 -07:00
Mathieu Desnoyers	6496968e6c	markers: use synchronize_sched() Markers do not mix well with CONFIG_PREEMPT_RCU because it uses preempt_disable/enable() and not rcu_read_lock/unlock for minimal intrusiveness. We would need call_sched and sched_barrier primitives. Currently, the modification (connection and disconnection) of probes from markers requires changes to the data structure done in RCU-style : a new data structure is created, the pointer is changed atomically, a quiescent state is reached and then the old data structure is freed. The quiescent state is reached once all the currently running preempt_disable regions are done running. We use the call_rcu mechanism to execute kfree() after such quiescent state has been reached. However, the new CONFIG_PREEMPT_RCU version of call_rcu and rcu_barrier does not guarantee that all preempt_disable code regions have finished, hence the race. The "proper" way to do this is to use rcu_read_lock/unlock, but we don't want to use it to minimize intrusiveness on the traced system. (we do not want the marker code to call into much of the OS code, because it would quickly restrict what can and cannot be instrumented, such as the scheduler). The temporary fix, until we get call_rcu_sched and rcu_barrier_sched in mainline, is to use synchronize_sched before each call_rcu calls, so we wait for the quiescent state in the system call code path. It will slow down batch marker enable/disable, but will make sure the race is gone. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-04-02 15:28:19 -07:00
Al Viro	8481664d37	futex_compat __user annotation Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-30 14:18:41 -07:00
Al Viro	9dce07f1a4	NULL noise: fs/, mm/, kernel/* Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-30 14:18:41 -07:00
Dave Jones	f706d5d22c	audit: silence two kerneldoc warnings in kernel/audit.c Silence two kerneldoc warnings. Warning(kernel/audit.c:1276): No description found for parameter 'string' Warning(kernel/audit.c:1276): No description found for parameter 'len' [also fix a typo for bonus points] Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-28 14:45:21 -07:00
YAMAMOTO Takashi	1d4a788f15	memcgroup: fix spurious EBUSY on memory cgroup removal Call mm_free_cgroup earlier. Otherwise a reference due to lazy mm switching can prevent cgroup removal. Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-28 14:45:21 -07:00
Benjamin Herrenschmidt	f6d107fb10	Give futex init a proper name The futex init function is called init(). This is a pain in the neck when debugging when you code dies in ... init :-) This renames it to futex_init(). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-27 08:02:13 -07:00
Linus Torvalds	8f404faa72	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: NOHZ: reevaluate idle sleep length after add_timer_on() clocksource: revert: use init_timer_deferrable for clocksource_watchdog	2008-03-26 11:29:35 -07:00
Jens Axboe	5eb7f9fa84	relay: set an spd_release() hook for splice relay doesn't reference the pages it adds, however we need a non-NULL hook or splice_to_pipe() can oops. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-03-26 12:04:09 +01:00
Lai Jiangshan	37529fe9f6	set relay file can not be read by pread(2) I found that relay files can be read by pread(2). I fix it, for relay files are not capable of seeking. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-03-26 12:01:28 +01:00
Thomas Gleixner	06d8308c61	NOHZ: reevaluate idle sleep length after add_timer_on() add_timer_on() can add a timer on a CPU which is currently in a long idle sleep, but the timer wheel is not reevaluated by the nohz code on that CPU. So a timer can be delayed for quite a long time. This triggered a false positive in the clocksource watchdog code. To avoid this we need to wake up the idle CPU and enforce the reevaluation of the timer wheel for the next timer event. Add a function, which checks a given CPU for idle state, marks the idle task with NEED_RESCHED and sends a reschedule IPI to notify the other CPU of the change in the timer wheel. Call this function from add_timer_on(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: stable@kernel.org -- include/linux/sched.h \| 6 ++++++ kernel/sched.c \| 43 +++++++++++++++++++++++++++++++++++++++++++ kernel/timer.c \| 10 +++++++++- 3 files changed, 58 insertions(+), 1 deletion(-)	2008-03-26 08:28:55 +01:00
Thomas Gleixner	898a19de15	clocksource: revert: use init_timer_deferrable for clocksource_watchdog Revert commit `1077f5a917` Author: Parag Warudkar <parag.warudkar@gmail.com> Date: Wed Jan 30 13:30:01 2008 +0100 clocksource.c: use init_timer_deferrable for clocksource_watchdog clocksource_watchdog can use a deferrable timer - reduces wakeups from idle per second. The watchdog timer needs to run with the specified interval. Otherwise it will miss the possible wrap of the watchdog clocksource. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2008-03-25 20:13:25 +01:00
Linus Torvalds	266c2e0abe	Make printk() console semaphore accesses sensible The printk() logic on when/how to get the console semaphore was unreadable, this splits the code up into a few helper functions and makes it easier to follow what is going on. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-24 19:25:08 -07:00
Pavel Emelyanov	5f7b703fe2	bsd_acct: using task_struct->tgid is not right in pid-namespaces In case we're accounting from a sub-namespace, the tgids reported will not refer to the right namespace. Save the pid_namespace we're accounting in on the acct_glbs and use it in do_acct_process. Two less :) places using the task_struct.tgid member. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-24 19:22:20 -07:00
Pavel Emelyanov	a846a1954b	bsd_acct: plain current->real_parent access is not always safe This is minor, but dereferencing even current real_parent is not safe on debug kernels, since the memory, this points to, can be unmapped - RCU protection is required. Besides, the tgid field is deprecated and is to be replaced with task_tgid_xxx call (the 2nd patch), so RCU will be required anyway. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-24 19:22:19 -07:00
Mathieu Desnoyers	58336114af	markers: remove ACCESS_ONCE As Paul pointed out, the ACCESS_ONCE are not needed because we already have the explicit surrounding memory barriers. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Mason <mmlnx@us.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: David Smith <dsmith@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Adrian Bunk <adrian.bunk@movial.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-24 19:22:19 -07:00
Mathieu Desnoyers	fd3c36f8b5	markers: update preempt_disable. call_rcu, rcu_barrier comments Add comments requested by Andrew. Updated comments about synchronize_sched(). Since we use call_rcu and rcu_barrier now, these comments were out of sync with the code. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Mason <mmlnx@us.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: David Smith <dsmith@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Adrian Bunk <adrian.bunk@movial.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-24 19:22:19 -07:00
Linus Torvalds	92896bd9fd	Don't 'printk()' while holding xtime lock for writing The printk() can deadlock because it can wake up klogd(), and task enqueueing will try to read the time in order to set a hrtimer. Reported-by: Marcin Slusarz <marcin.slusarz@gmail.com> Debugged-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-24 11:07:15 -07:00
Heiko Carstens	22e52b072d	sched: add arch_update_cpu_topology hook. Will be called each time the scheduling domains are rebuild. Needed for architectures that don't have a static cpu topology. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-21 16:43:48 +01:00
Heiko Carstens	9aefd0abd8	sched: add exported arch_reinit_sched_domains() to header file. Needed so it can be called from outside of sched.c. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-21 16:43:47 +01:00
Roel Kluin	23e3c3cd2e	sched: remove double unlikely from schedule() Combine two unlikely's Signed-off-by: Roel Kluin <12o3l@tiscali.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-21 16:43:47 +01:00
Peter Zijlstra	2070ee01d3	sched: cleanup old and rarely used 'debug' features. TREE_AVG and APPROX_AVG are initial task placement policies that have been disabled for a long while.. time to remove them. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-21 16:43:47 +01:00
Linus Torvalds	7d3628b230	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (46 commits) [NET] ifb: set separate lockdep classes for queue locks [IPV6] KCONFIG: Fix description about IPV6_TUNNEL. [TCP]: Fix shrinking windows with window scaling netpoll: zap_completion_queue: adjust skb->users counter bridge: use time_before() in br_fdb_cleanup() [TG3]: Fix build warning on sparc32. MAINTAINERS: bluez-devel is subscribers-only audit: netlink socket can be auto-bound to pid other than current->pid (v2) [NET]: Fix permissions of /proc/net [SCTP]: Fix a race between module load and protosw access [NETFILTER]: ipt_recent: sanity check hit count [NETFILTER]: nf_conntrack_h323: logical-bitwise & confusion in process_setup() [RT2X00] drivers/net/wireless/rt2x00/rt2x00dev.c: remove dead code, fix warning [IPV4]: esp_output() misannotations [8021Q]: vlan_dev misannotations xfrm: ->eth_proto is __be16 [IPV4]: ipv4_is_lbcast() misannotations [SUNRPC]: net/* NULL noise [SCTP]: fix misannotated __sctp_rcv_asconf_lookup() [PKT_SCHED]: annotate cls_u32 ...	2008-03-21 07:57:45 -07:00
Pavel Emelyanov	75c0371a2d	audit: netlink socket can be auto-bound to pid other than current->pid (v2) From: Pavel Emelyanov <xemul@openvz.org> This patch is based on the one from Thomas. The kauditd_thread() calls the netlink_unicast() and passes the audit_pid to it. The audit_pid, in turn, is received from the user space and the tool (I've checked the audit v1.6.9) uses getpid() to pass one in the kernel. Besides, this tool doesn't bind the netlink socket to this id, but simply creates it allowing the kernel to auto-bind one. That's the preamble. The problem is that netlink_autobind() _does_not_ guarantees that the socket will be auto-bound to the current pid. Instead it uses the current pid as a hint to start looking for a free id. So, in case of conflict, the audit messages can be sent to a wrong socket. This can happen (it's unlikely, but can be) in case some task opens more than one netlink sockets and then the audit one starts - in this case the audit's pid can be busy and its socket will be bound to another id. The proposal is to introduce an audit_nlk_pid in audit subsys, that will point to the netlink socket to send packets to. It will most often be equal to audit_pid. The socket id can be got from the skb's netlink CB right in the audit_receive_msg. The audit_nlk_pid reset to 0 is not required, since all the decisions are taken based on audit_pid value only. Later, if the audit tools will bind the socket themselves, the kernel will have to provide a way to setup the audit_nlk_pid as well. A good side effect of this patch is that audit_pid can later be converted to struct pid, as it is not longer safe to use pid_t-s in the presence of pid namespaces. But audit code still uses the tgid from task_struct in the audit_signal_info and in the audit_filter_syscall. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-03-20 15:39:41 -07:00
Andrew Morton	3150e63df4	revert "clocksource: make clocksource watchdog cycle through online CPUs" Revert commit `1ada5cba6a` ("clocksource: make clocksource watchdog cycle through online CPUs") due to the regression reported by Gabriel C at http://lkml.org/lkml/2008/2/24/281 (short vesion: it makes TSC be marked as always unstable on his machine). Cc: Andi Kleen <ak@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Robert Hancock <hancockr@shaw.ca> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Gabriel C <nix.or.die@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-19 18:53:37 -07:00
Ingo Molnar	74e3cd7f48	sched: retune wake granularity reduce wake-up granularity for better interactivity. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:53 +01:00
Ingo Molnar	f540a6080a	sched: wakeup-buddy tasks are cache-hot Wakeup-buddy tasks are cache-hot - this makes it a bit harder for the load-balancer to tear them apart. (but it's still possible, if the load is sufficiently assymetric) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:53 +01:00
Ingo Molnar	4ae7d5cefd	sched: improve affine wakeups improve affine wakeups. Maintain the 'overlap' metric based on CFS's sum_exec_runtime - which means the amount of time a task executes after it wakes up some other task. Use the 'overlap' for the wakeup decisions: if the 'overlap' is short, it means there's strong workload coupling between this task and the woken up task. If the 'overlap' is large then the workload is decoupled and the scheduler will move them to separate CPUs more easily. ( Also slightly move the preempt_check within try_to_wake_up() - this has no effect on functionality but allows 'early wakeups' (for still-on-rq tasks) to be correctly accounted as well.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:53 +01:00
Ingo Molnar	f48273860e	sched: clean up wakeup balancing, code flow Clean up the code flow. No code changed: kernel/sched.o: text data bss dec hex filename 42521 2858 232 45611 b22b sched.o.before 42521 2858 232 45611 b22b sched.o.after md5: 09b31c44e9aff8666f72773dc433e2df sched.o.before.asm 09b31c44e9aff8666f72773dc433e2df sched.o.after.asm Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:53 +01:00
Ingo Molnar	ac192d3921	sched: clean up wakeup balancing, rename variables rename 'cpu' to 'prev_cpu'. No code changed: kernel/sched.o: text data bss dec hex filename 42521 2858 232 45611 b22b sched.o.before 42521 2858 232 45611 b22b sched.o.after md5: 09b31c44e9aff8666f72773dc433e2df sched.o.before.asm 09b31c44e9aff8666f72773dc433e2df sched.o.after.asm Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:52 +01:00
Ingo Molnar	098fb9db2c	sched: clean up wakeup balancing, move wake_affine() split out the affine-wakeup bits. No code changed: kernel/sched.o: text data bss dec hex filename 42521 2858 232 45611 b22b sched.o.before 42521 2858 232 45611 b22b sched.o.after md5: 9d76738f1272aa82f0b7affd2f51df6b sched.o.before.asm 09b31c44e9aff8666f72773dc433e2df sched.o.after.asm (the md5's changed because stack slots changed and some registers get scheduled by gcc in a different order - but otherwise the before and after assembly is instruction for instruction equivalent.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:52 +01:00
Jens Axboe	16d5466942	relay: fix subbuf_splice_actor() adding too many pages If subbuf_pages was larger than the max number of pages the pipe buffer will hold, subbuf_splice_actor() would happily go beyond the array size. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-03-17 09:04:59 +01:00
Ingo Molnar	6a6029b8ce	sched: simplify sched_slice() Use the existing calc_delta_mine() calculation for sched_slice(). This saves a divide and simplifies the code because we share it with the other /cfs_rq->load users. It also improves code size: text data bss dec hex filename 42659 2740 144 45543 b1e7 sched.o.before 42093 2740 144 44977 afb1 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2008-03-15 03:02:50 +01:00
Ingo Molnar	e22ecef1d2	sched: fix fair sleepers Fair sleepers need to scale their latency target down by runqueue weight. Otherwise busy systems will gain ever larger sleep bonus. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2008-03-15 03:02:50 +01:00
Peter Zijlstra	aa2ac25229	sched: fix overload performance: buddy wakeups Currently we schedule to the leftmost task in the runqueue. When the runtimes are very short because of some server/client ping-pong, especially in over-saturated workloads, this will cycle through all tasks trashing the cache. Reduce cache trashing by keeping dependent tasks together by running newly woken tasks first. However, by not running the leftmost task first we could starve tasks because the wakee can gain unlimited runtime. Therefore we only run the wakee if its within a small (wakeup_granularity) window of the leftmost task. This preserves fairness, but does alternate server/client task groups. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-15 03:02:50 +01:00
Ingo Molnar	27d1172660	sched: fix calc_delta_mine() lw->weight can be 0 for a short time during bootup. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2008-03-15 03:02:50 +01:00
Ingo Molnar	e89996ae3f	sched: fix update_load_add()/sub() Clear the cached inverse value when updating load. This is needed for calc_delta_mine() to work correctly when using the rq load. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2008-03-15 03:02:49 +01:00
Peter Zijlstra	3fe69747da	sched: min_vruntime fix Current min_vruntime tracking is incorrect and will cause serious problems when we don't run the leftmost task for some reason. min_vruntime does two things; 1) it's used to determine a forward direction when the u64 vruntime wraps, 2) it's used to track the leftmost vruntime to position newly enqueued tasks from. The current logic advances min_vruntime whenever the current task's vruntime advance. Because the current task may pass the leftmost task still waiting we're failing the second goal. This causes new tasks to be placed too far ahead and thus penalizes their runtime. Fix this by making min_vruntime the min_vruntime of the waiting tasks by tracking it in enqueue/dequeue, and compare against current's vruntime to obtain the absolute minimum when placing new tasks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-15 03:02:49 +01:00
Hiroshi Shimamoto	0e1f34833b	sched: fix race in schedule() Fix a hard to trigger crash seen in the -rt kernel that also affects the vanilla scheduler. There is a race condition between schedule() and some dequeue/enqueue functions; rt_mutex_setprio(), __setscheduler() and sched_move_task(). When scheduling to idle, idle_balance() is called to pull tasks from other busy processor. It might drop the rq lock. It means that those 3 functions encounter on_rq=0 and running=1. The current task should be put when running. Here is a possible scenario: CPU0 CPU1 \| schedule() \| ->deactivate_task() \| ->idle_balance() \| -->load_balance_newidle() rt_mutex_setprio() \| \| --->double_lock_balance() get lock rel lock * on_rq=0, ruuning=1 \| * sched_class is changed \| rel lock get lock : \| : ->put_prev_task_rt() ->pick_next_task_fair() => panic The current process of CPU1(P1) is scheduling. Deactivated P1, and the scheduler looks for another process on other CPU's runqueue because CPU1 will be idle. idle_balance(), load_balance_newidle() and double_lock_balance() are called and double_lock_balance() could drop the rq lock. On the other hand, CPU0 is trying to boost the priority of P1. The result of boosting only P1's prio and sched_class are changed to RT. The sched entities of P1 and P1's group are never put. It makes cfs_rq invalid, because the cfs_rq has curr and no leaf, but pick_next_task_fair() is called, then the kernel panics. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-15 03:02:49 +01:00
Len Brown	29ea5171cb	Merge branches 'release' and 'doc' into release	2008-03-13 01:59:53 -04:00
Randy Dunlap	53471121a8	documentation: Move power-related files to Documentation/power/ Move 00-INDEX entries to power/00-INDEX (and add entry for pm_qos_interface.txt). Update references to moved filenames. Fix some trailing whitespace. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Len Brown <len.brown@intel.com>	2008-03-12 18:10:51 -04:00
Rafael J. Wysocki	a82f7119fd	Hibernation: Fix mark_nosave_pages() There is a problem in the hibernation code that triggers on some NUMA systems on which pfn_valid() returns 'true' for some PFNs that don't belong to any zone. Namely, there is a BUG_ON() in memory_bm_find_bit() that triggers for PFNs not belonging to any zone and passing the pfn_valid() test. On the affected systems it triggers when we mark PFNs reported by the platform as not saveable, because the PFNs in question belong to a region mapped directly using iorepam() (i.e. the ACPI data area) and they pass the pfn_valid() test. Modify memory_bm_find_bit() so that it returns an error if given PFN doesn't belong to any zone instead of crashing the kernel and ignore the result returned by it in mark_nosave_pages(), while marking the "nosave" memory regions. This doesn't affect the hibernation functionality, as we won't touch the PFNs in question anyway. http://bugzilla.kernel.org/show_bug.cgi?id=9966 . Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Len Brown <len.brown@intel.com>	2008-03-11 23:15:55 -04:00
Gregory Haskins	08f503b0c0	keep rd->online and cpu_online_map in sync It is possible to allow the root-domain cache of online cpus to become out of sync with the global cpu_online_map. This is because we currently trigger removal of cpus too early in the notifier chain. Other DOWN_PREPARE handlers may in fact run and reconfigure the root-domain topology, thereby stomping on our own offline handling. The end result is that rd->online may become out of sync with cpu_online_map, which results in potential task misrouting. So change the offline handling to be more tightly coupled with the global offline process by triggering on CPU_DYING intead of CPU_DOWN_PREPARE. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-11 14:02:58 +01:00
Gregory Haskins	1f94ef598e	Revert "cpu hotplug: adjust root-domain->online span in response to hotplug event" This reverts commit `393d94d98b`. Lets fix this right. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-11 14:02:58 +01:00
Paul E. McKenney	21bbb39c37	rcu: move PREEMPT_RCU config option back under PREEMPT The original preemptible-RCU patch put the choice between classic and preemptible RCU into kernel/Kconfig.preempt, which resulted in build failures on machines not supporting CONFIG_PREEMPT. This choice was therefore moved to init/Kconfig, which worked, but placed the choice between classic and preemptible RCU at the top level, a very obtuse choice indeed. This patch changes from the Kconfig "choice" mechanism to a pair of booleans, only one of which (CONFIG_PREEMPT_RCU) is user-visible, and is located in kernel/Kconfig.preempt, where one would expect it to be. The other (CONFIG_CLASSIC_RCU) is in init/Kconfig so that it is available to all architectures, hopefully avoiding build breakage. Thanks to Roman Zippel for suggesting this approach. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Josh Triplett <josh@freedesktop.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-10 18:01:20 -07:00
Alexey Dobriyan	e24e2e64c4	modules: warn about suspicious return values from module's ->init() hook Return value convention of module's init functions is 0/-E. Sometimes, e.g. during forward-porting mistakes happen and buggy module created, where result of comparison "workqueue != NULL" is propagated all the way up to sys_init_module. What happens is that some other module created workqueue in question, our module created it again and module was successfully loaded. Or it could be some other bug. Let's make such mistakes much more visible. In retrospective, such messages would noticeably shorten some of my head-scratching sessions. Note, that dump_stack() is just a way to get attention from user. Sample message: sys_init_module: 'foo'->init suspiciously returned 1, it should follow 0/-E convention sys_init_module: loading module anyway... Pid: 4223, comm: modprobe Not tainted 2.6.24-25f666300625d894ebe04bac2b4b3aadb907c861 #5 Call Trace: [<ffffffff80254b05>] sys_init_module+0xe5/0x1d0 [<ffffffff8020b39b>] system_call_after_swapgs+0x7b/0x80 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-10 18:01:20 -07:00
Rusty Russell	6c5db22d28	modules: fix module waiting for dependent modules' init Commit `c9a3ba55` (module: wait for dependent modules doing init.) didn't quite work because the waiter holds the module lock, meaning that the state of the module it's waiting for cannot change. Fortunately, it's fairly simple to update the state outside the lock and do the wakeup. Thanks to Jan Glauber for tracking this down and testing (qdio and qeth). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Jan Glauber <jang@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-10 18:01:19 -07:00
Linus Torvalds	bf5a25e1ff	Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt * git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: time: remove obsolete CLOCK_TICK_ADJUST time: don't touch an offlined CPU's ts->tick_stopped in tick_cancel_sched_timer() time: prevent the loop in timespec_add_ns() from being optimised away ntp: use unsigned input for do_div()	2008-03-09 10:06:49 -07:00
Gregory Haskins	393d94d98b	cpu hotplug: adjust root-domain->online span in response to hotplug event We currently set the root-domain online span automatically when the domain is added to the cpu if the cpu is already a member of cpu_online_map. This was done as a hack/bug-fix for s2ram, but it also causes a problem with hotplug CPU_DOWN transitioning. The right way to fix the original problem is to actually respond to CPU_UP events, instead of CPU_ONLINE, which is already too late. This solves the hung reboot regression reported by Andrew Morton and others. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-09 10:05:14 -07:00
Roman Zippel	10a398d04c	time: remove obsolete CLOCK_TICK_ADJUST The first version of the ntp_interval/tick_length inconsistent usage patch was recently merged as `bbe4d18ac2` http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bbe4d18ac2e058c56adb0cd71f49d9ed3216a405 While the fix did greatly improve the situation, it was correctly pointed out by Roman that it does have a small bug: If the users change clocksources after the system has been running and NTP has made corrections, the correctoins made against the old clocksource will be applied against the new clocksource, causing error. The second attempt, which corrects the issue in the NTP_INTERVAL_LENGTH definition has also made it up-stream as commit `e13a2e61dd` http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e13a2e61dd5152f5499d2003470acf9c838eab84 Roman has correctly pointed out that CLOCK_TICK_ADJUST is calculated based on the PIT's frequency, and isn't really relevant to non-PIT driven clocksources (that is, clocksources other then jiffies and pit). This patch reverts both of those changes, and simply removes CLOCK_TICK_ADJUST. This does remove the granularity error correction for users of PIT and Jiffies clocksource users, but the granularity error but for the majority of users, it should be within the 500ppm range NTP can accommodate for. For systems that have granularity errors greater then 500ppm, the "ntp_tick_adj=" boot option can be used to compensate. [johnstul@us.ibm.com: provided changelog] [mattilinnanvuori@yahoo.com: maek ntp_tick_adj static] Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Acked-by: john stultz <johnstul@us.ibm.com> Signed-off-by: Matti Linnanvuori <mattilinnanvuori@yahoo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: mingo@elte.hu Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-03-09 08:42:57 +01:00
Karsten Wiese	a79017660e	time: don't touch an offlined CPU's ts->tick_stopped in tick_cancel_sched_timer() Silences WARN_ONs in rcu_enter_nohz() and rcu_exit_nohz(), which appeared before caused by (repeated) calls to: $ echo 0 > /sys/devices/system/cpu/cpu1/online $ echo 1 > /sys/devices/system/cpu/cpu1/online Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de> Cc: johnstul@us.ibm.com Cc: Rafael Wysocki <rjw@sisk.pl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-03-09 08:42:57 +01:00
David Howells	e48af19f56	ntp: use unsigned input for do_div() The kernel NTP code shouldn't hand 64-bit signed values to do_div(). Make it instead hand 64-bit unsigned values. This gets rid of a couple of warnings. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-03-09 08:42:57 +01:00
Roland McGrath	6efcae4601	Fix waitid si_code regression In commit `ee7c82da83` ("wait_task_stopped: simplify and fix races with SIGCONT/SIGKILL/untrace"), the magic (short) cast when storing si_code was lost in wait_task_stopped. This leaks the in-kernel CLD_* values that do not match what userland expects. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-08 11:54:00 -08:00
Dhaval Giani	521f1a2489	sched: don't allow rt_runtime_us to be zero for groups having rt tasks This patch checks if we can set the rt_runtime_us to 0. If there is a realtime task in the group, we don't want to set the rt_runtime_us as 0 or bad things will happen. (that task wont get any CPU time despite being TASK_RUNNNG) Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-07 16:43:00 +01:00
Peter Zijlstra	2692a2406b	sched: rt-group: fixup schedulability constraints calculation it was only possible to configure the rt-group scheduling parameters beyond the default value in a very small range. that's because div64_64() has a different calling convention than do_div() :/ fix a few untidies while we are here; sysctl_sched_rt_period may overflow due to that multiplication, so cast to u64 first. Also that RUNTIME_INF juggling makes little sense although its an effective NOP. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-07 16:43:00 +01:00
Miao Xie	1868f958eb	sched: fix the wrong time slice value for SCHED_FIFO tasks Function sys_sched_rr_get_interval returns wrong time slice value for SCHED_FIFO tasks. The time slice for SCHED_FIFO tasks should be 0. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-07 16:43:00 +01:00
Pavel Roskin	150d8bede7	sched: export task_nice The API is trivial, and so is the implementation. Signed-off-by: Pavel Roskin <proski@gnu.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-07 16:43:00 +01:00
Steven Rostedt	6fa46fa526	sched: balance RT task resched only on runqueue Sripathi Kodi reported a crash in the -rt kernel: https://bugzilla.redhat.com/show_bug.cgi?id=435674 this is due to a place that can reschedule a task without holding the tasks runqueue lock. This was caused by the RT balancing code that pulls RT tasks to the current run queue and will reschedule the current task. There's a slight chance that the pulling of the RT tasks will release the current runqueue's lock and retake it (in the double_lock_balance). During this time that the runqueue is released, the current task can migrate to another runqueue. In the prio_changed_rt code, after the pull, if the current task is of lesser priority than one of the RT tasks pulled, resched_task is called on the current task. If the current task had migrated in that small window, resched_task will be called without holding the runqueue lock for the runqueue that the task is on. This race condition also exists in the mainline kernel and this patch adds a check to make sure the task hasn't migrated before calling resched_task. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Tested-by: Sripathi Kodi <sripathik@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-07 16:43:00 +01:00
Peter Zijlstra	810b38179e	sched: retain vruntime Kei Tokunaga reported an interactivity problem when moving tasks between control groups. Tasks would retain their old vruntime when moved between groups, this can cause funny lags. Re-set the vruntime on group move to fit within the new tree. Reported-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-07 16:42:59 +01:00
David Rientjes	41f7f60d31	cpusets: fix obsolete comment mm migration is no longer done in cpuset_update_task_memory_state() so it can no longer take current->mm->mmap_sem, so fix the obsolete comment. [ This changed in commit `04c19fa6f1` ("cpuset: migrate all tasks in cpuset at once") when the mm migration was moved from cpuset_update_task_memory_state() to update_nodemask() ] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-05 17:53:33 -08:00
Pavel Roskin	9b37ccfc63	module: allow ndiswrapper to use GPL-only symbols A change after 2.6.24 broke ndiswrapper by accidentally removing its access to GPL-only symbols. Revert that change and add comments about the reasons why ndiswrapper and driverloader are treated in a special way. Signed-off-by: Pavel Roskin <proski@gnu.org> Acked-by: Greg KH <gregkh@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jon Masters <jonathan@jonmasters.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 20:29:40 -08:00
Masami Hiramatsu	b2a5cd6938	kprobes: fix a null pointer bug in register_kretprobe() Fix a bug in regiseter_kretprobe() which does not check rp->kp.symbol_name == NULL before calling kprobe_lookup_name. For maintainability, this introduces kprobe_addr helper function which resolves addr field. It is used by register_kprobe and register_kretprobe. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:19 -08:00
Jesper Juhl	544adb4107	markers: don't risk NULL deref in marker get_marker() may return NULL, so test for it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:14 -08:00
Ananth N Mavinakayanahalli	9edddaa200	Kprobes: indicate kretprobe support in Kconfig Add CONFIG_HAVE_KRETPROBES to the arch/<arch>/Kconfig file for relevant architectures with kprobes support. This facilitates easy handling of in-kernel modules (like samples/kprobes/kretprobe_example.c) that depend on kretprobes being present in the kernel. Thanks to Sam Ravnborg for helping make the patch more lean. Per Mathieu's suggestion, added CONFIG_KRETPROBES and fixed up dependencies. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:11 -08:00
Balbir Singh	fb78922ce9	Memory Resource Controller use strstrip while parsing arguments The memory controller has a requirement that while writing values, we need to use echo -n. This patch fixes the problem and makes the UI more consistent. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:09 -08:00
Li Zefan	b6abdb0e6c	cgroup: fix default notify_on_release setting The documentation says the default value of notify_on_release of a child cgroup is inherited from its parent, which is reasonable, but the implementation just sets the flag disabled. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 16:35:09 -08:00
Linus Torvalds	87baa2bb90	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel: sched: revert load_balance_monitor() changes	2008-03-04 09:23:28 -08:00
Peter Zijlstra	62fb185130	sched: revert load_balance_monitor() changes The following commits cause a number of regressions: commit `58e2d4ca58` Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Date: Fri Jan 25 21:08:00 2008 +0100 sched: group scheduling, change how cpu load is calculated commit `6b2d770026` Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Date: Fri Jan 25 21:08:00 2008 +0100 sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups Namely: - very frequent wakeups on SMP, reported by PowerTop users. - cacheline trashing on (large) SMP - some latencies larger than 500ms While there is a mergeable patch to fix the latter, the former issues are not fixable in a manner suitable for .25 (we're at -rc3 now). Hence we revert them and try again in v2.6.26. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-04 17:54:06 +01:00
Roland McGrath	13b1c3d4b4	freezer vs stopped or traced This changes the "freezer" code used by suspend/hibernate in its treatment of tasks in TASK_STOPPED (job control stop) and TASK_TRACED (ptrace) states. As I understand it, the intent of the "freezer" is to hold all tasks from doing anything significant. For this purpose, TASK_STOPPED and TASK_TRACED are "frozen enough". It's possible the tasks might resume from ptrace calls (if the tracer were unfrozen) or from signals (including ones that could come via timer interrupts, etc). But this doesn't matter as long as they quickly block again while "freezing" is in effect. Some minor adjustments to the signal.c code make sure that try_to_freeze() very shortly follows all wakeups from both kinds of stop. This lets the freezer code safely leave stopped tasks unmolested. Changing this fixes the longstanding bug of seeing after resuming from suspend/hibernate your shell report "[1] Stopped" and the like for all your jobs stopped by ^Z et al, as if you had freshly fg'd and ^Z'd them. It also removes from the freezer the arcane special case treatment for ptrace'd tasks, which relied on intimate knowledge of ptrace internals. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-04 07:59:54 -08:00
Oleg Nesterov	821c7de719	exit_notify: fix kill_orphaned_pgrp() usage with mt exit 1. exit_notify() always calls kill_orphaned_pgrp(). This is wrong, we should do this only when the whole process exits. 2. exit_notify() uses "current" as "ignored_task", obviously wrong. Use ->group_leader instead. Test case: void hup(int sig) { printf("HUP received\n"); } void tfunc(void arg) { sleep(2); printf("sub-thread exited\n"); return NULL; } int main(int argc, char *argv[]) { if (!fork()) { signal(SIGHUP, hup); kill(getpid(), SIGSTOP); exit(0); } pthread_t thr; pthread_create(&thr, NULL, tfunc, NULL); sleep(1); printf("main thread exited\n"); syscall(__NR_exit, 0); return 0; } output: main thread exited HUP received Hangup With this patch the output is: main thread exited sub-thread exited HUP received Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-03 14:53:16 -08:00
Oleg Nesterov	05e83df624	will_become_orphaned_pgrp: partially fix insufficient ->exit_state check p->exit_state != 0 doesn't mean this process is dead, it may have sub-threads. Change the code to use "p->exit_state && thread_group_empty(p)" instead. Without this patch, ^Z doesn't deliver SIGTSTP to the foreground process if the main thread has exited. However, the new check is not perfect either. There is a window when exit_notify() drops tasklist and before release_task(). Suppose that the last (non-leader) thread exits. This means that entire group exits, but thread_group_empty() is not true yet. As Eric pointed out, is_global_init() is wrong as well, but I did not dare to do other changes. Just for the record, has_stopped_jobs() is absolutely wrong too. But we can't fix it now, we should first fix SIGNAL_STOP_STOPPED issues. Even with this patch ^Z doesn't play well with the dead main thread. The task is stopped correctly but do_wait(WSTOPPED) won't see it. This is another unrelated issue, will be (hopefully) fixed separately. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-03 14:53:16 -08:00
Oleg Nesterov	f49ee505b1	introduce kill_orphaned_pgrp() helper Factor out the common code in reparent_thread() and exit_notify(). No functional changes. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-03-03 14:53:16 -08:00
Steve Grubb	8d07a67cfa	[PATCH] drop EOE records from printk Hi, While we are looking at the printk issue, I see that its printk'ing the EOE (end of event) records which is really not something that we need in syslog. Its really intended for the realtime audit event stream handled by the audit daemon. So, lets avoid printk'ing that record type. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-03-01 07:16:06 -05:00
Eric Paris	b29ee87e9b	[RFC] AUDIT: do not panic when printk loses messages On the latest kernels if one was to load about 15 rules, set the failure state to panic, and then run service auditd stop the kernel will panic. This is because auditd stops, then the script deletes all of the rules. These deletions are sent as audit messages out of the printk kernel interface which is already known to be lossy. These will overun the default kernel rate limiting (10 really fast messages) and will call audit_panic(). The same effect can happen if a slew of avc's come through while auditd is stopped. This can be fixed a number of ways but this patch fixes the problem by just not panicing if auditd is not running. We know printk is lossy and if the user chooses to set the failure mode to panic and tries to use printk we can't make any promises no matter how hard we try, so why try? At least in this way we continue to get lost message accounting and will eventually know that things went bad. The other change is to add a new call to audit_log_lost() if auditd disappears. We already pulled the skb off the queue and couldn't send it so that message is lost. At least this way we will account for the last message and panic if the machine is configured to panic. This code path should only be run if auditd dies for unforeseen reasons. If auditd closes correctly audit_pid will get set to 0 and we won't walk this code path. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-03-01 07:16:06 -05:00
Paul Moore	422b03cf75	[PATCH] Audit: Fix the format type for size_t variables Fix the following compiler warning by using "%zu" as defined in C99. CC kernel/auditsc.o kernel/auditsc.c: In function 'audit_log_single_execve_arg': kernel/auditsc.c:1074: warning: format '%ld' expects type 'long int', but argument 4 has type 'size_t' Signed-off-by: Paul Moore <paul.moore@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-03-01 07:16:06 -05:00
Paul E. McKenney	c9e71002aa	rcupreempt: remove never-migrates assumption from rcu_process_callbacks() This patch fixes a potentially invalid access to a per-CPU variable in rcu_process_callbacks(). This per-CPU access needs to be done in such a way as to guarantee that the code using it cannot move to some other CPU before all uses of the value accessed have completed. Even though this code is currently only invoked from softirq context, which currrently cannot migrate to some other CPU, life would be better if this code did not silently make such an assumption. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-29 20:21:13 +01:00
Paul E. McKenney	ae778869ae	rcupreempt: fix hibernate/resume in presence of PREEMPT_RCU and hotplug This fixes a oops encountered when doing hibernate/resume in presence of PREEMPT_RCU. The problem was that the code failed to disable preemption when accessing a per-CPU variable. This is OK when called from code that already has preemption disabled, but such is not the case from the suspend/resume code path. Reported-by: Dave Young <hidave.darkstar@gmail.com> Tested-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-29 20:21:13 +01:00
Dmitry Adamushko	7be2a03e31	softlockup: fix task state setting kthread_stop() can be called when a 'watchdog' thread is executing after kthread_should_stop() but before set_task_state(TASK_INTERRUPTIBLE). Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-29 18:46:53 +01:00
Steven Rostedt	2232c2d8e0	rcu: add support for dynamic ticks and preempt rcu The PREEMPT-RCU can get stuck if a CPU goes idle and NO_HZ is set. The idle CPU will not progress the RCU through its grace period and a synchronize_rcu my get stuck. Without this patch I have a box that will not boot when PREEMPT_RCU and NO_HZ are set. That same box boots fine with this patch. This patch comes from the -rt kernel where it has been tested for several months. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-29 18:46:50 +01:00
Linus Torvalds	3fca96eed1	Merge branch 'v2.6.25-rc3-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep * 'v2.6.25-rc3-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep: Subject: lockdep: include all lock classes in all_lock_classes lockdep: increase MAX_LOCK_DEPTH	2008-02-26 07:49:15 -08:00
Linus Torvalds	37c00b84d0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: latencytop: change /proc task_struct access method latencytop: fix memory leak on latency proc file latencytop: fix kernel panic while reading latency proc file sched: add declaration of sched_tail to sched.h sched: fix signedness warnings in sched.c sched: clean up __pick_last_entity() a bit sched: remove duplicate code from sched_fair.c sched: make early bootup sched_clock() use safer	2008-02-26 07:43:31 -08:00
Tejun Heo	cf3680b90c	printk: fix possible printk overrun printk recursion detection prepends message to printk_buf and offsets printk_buf when actual message is printed but it forgets to trim buffer length accordingly. This can result in overrun in extreme cases. Fix it. [ mingo@elte.hu: bug was introduced by me via: commit `32a7600668` Author: Ingo Molnar <mingo@elte.hu> Date: Fri Jan 25 21:07:58 2008 +0100 printk: make printk more robust by not allowing recursion ] Signed-off-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-26 07:42:37 -08:00
Dale Farnsworth	1481197b50	Subject: lockdep: include all lock classes in all_lock_classes Add each lock class to the all_lock_classes list when it is first registered. Previously, lock classes were added to all_lock_classes when the lock class was first used. Since one of the uses of the list is to find unused locks, this didn't work well. Signed-off-by: Dale Farnsworth <dale@farnsworth.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-25 23:03:02 +01:00
Harvey Harrison	67ca7bde2e	sched: fix signedness warnings in sched.c Unsigned long values are always assigned to switch_count, make it unsigned long. kernel/sched.c:3897:15: warning: incorrect type in assignment (different signedness) kernel/sched.c:3897:15: expected long switch_count kernel/sched.c:3897:15: got unsigned long <noident> kernel/sched.c:3921:16: warning: incorrect type in assignment (different signedness) kernel/sched.c:3921:16: expected long switch_count kernel/sched.c:3921:16: got unsigned long <noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-25 16:34:17 +01:00
Ingo Molnar	7eee3e677d	sched: clean up __pick_last_entity() a bit Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-25 16:34:17 +01:00
Balbir Singh	70eee74b70	sched: remove duplicate code from sched_fair.c pick_task_entity() duplicates existing code. This functionality can be easily obtained using rb_last(). Avoid code duplication by using rb_last(). Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-25 16:34:17 +01:00
Ingo Molnar	6892b75e60	sched: make early bootup sched_clock() use safer do not call sched_clock() too early. Not only might rq->idle not be set up - but pure per-cpu data might not be accessible either. this solves an ia64 early bootup hang with CONFIG_PRINTK_TIME=y. Tested-by: Tony Luck <tony.luck@gmail.com> Acked-by: Tony Luck <tony.luck@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-25 16:34:16 +01:00
Linus Torvalds	04e2f1741d	Add memory barrier semantics to wake_up() & co Oleg Nesterov and others have pointed out that on some architectures, the traditional sequence of set_current_state(TASK_INTERRUPTIBLE); if (CONDITION) return; schedule(); is racy wrt another CPU doing CONDITION = 1; wake_up_process(p); because while set_current_state() has a memory barrier separating setting of the TASK_INTERRUPTIBLE state from reading of the CONDITION variable, there is no such memory barrier on the wakeup side. Now, wake_up_process() does actually take a spinlock before it reads and sets the task state on the waking side, and on x86 (and many other architectures) that spinlock is in fact equivalent to a memory barrier, but that is not generally guaranteed. The write that sets CONDITION could move into the critical region protected by the runqueue spinlock. However, adding a smp_wmb() to before the spinlock should now order the writing of CONDITION wrt the lock itself, which in turn is ordered wrt the accesses within the spinlock (which includes the reading of the old state). This should thus close the race (which probably has never been seen in practice, but since smp_wmb() is a no-op on x86, it's not like this will make anything worse either on the most common architecture where the spinlock already gave the required protection). Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 18:05:03 -08:00
Li Zefan	bc231d2a04	cgroup: remove dead code in cgroup_get_rootdir() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 17:13:25 -08:00
Li Zefan	68db38f153	cgroup: remove duplicate code in find_css_set() The list head res->tasks gets initialized twice in find_css_set(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 17:13:25 -08:00
Li Zefan	8d53d55d27	cgroup: fix subsys bitops Cgroup uses unsigned long for subsys bitops, not unsigned long long. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 17:13:24 -08:00
Li Zefan	f777073848	cgroup: fix memory leak in cgroup_get_sb() opts.release_agent is not kfree()ed in all necessary places. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 17:13:24 -08:00
Li Zefan	a043e3b2c6	cgroup: fix comments fix: - comments about need_forkexit_callback - comments about release agent - typo and comment style, etc. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 17:13:24 -08:00
Srinivasa Ds	4362758279	kprobes: refuse kprobe insertion on add/sub_preempt_counter() Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these functions inturn call add/sub_preempt_count(). So we need to refuse user from inserting probe in to these functions. This patch disallows user from probing add/sub_preempt_count(). Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 17:13:24 -08:00
Thomas Gleixner	a0c1e9073e	futex: runtime enable pi and robust functionality Not all architectures implement futex_atomic_cmpxchg_inatomic(). The default implementation returns -ENOSYS, which is currently not handled inside of the futex guts. Futex PI calls and robust list exits with a held futex result in an endless loop in the futex code on architectures which have no support. Fixing up every place where futex_atomic_cmpxchg_inatomic() is called would add a fair amount of extra if/else constructs to the already complex code. It is also not possible to disable the robust feature before user space tries to register robust lists. Compile time disabling is not a good idea either, as there are already architectures with runtime detection of futex_atomic_cmpxchg_inatomic support. Detect the functionality at runtime instead by calling cmpxchg_futex_value_locked() with a NULL pointer from the futex initialization code. This is guaranteed to fail, but the call of futex_atomic_cmpxchg_inatomic() happens with pagefaults disabled. On architectures, which use the asm-generic implementation or have a runtime CPU feature detection, a -ENOSYS return value disables the PI/robust features. On architectures with a working implementation the call returns -EFAULT and the PI/robust features are enabled. The relevant syscalls return -ENOSYS and the robust list exit code is blocked, when the detection fails. Fixes http://lkml.org/lkml/2008/2/11/149 Originally reported by: Lennart Buytenhek Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Lennert Buytenhek <buytenh@wantstofly.org> Cc: Riku Voipio <riku.voipio@movial.fi> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 17:12:15 -08:00
Thomas Gleixner	3e4ab747ef	futex: fix init order When the futex init code fails to initialize the futex pseudo file system it returns early without initializing the hash queues. Should the boot succeed then a futex syscall which tries to enqueue a waiter on the hashqueue will crash due to the unitilialized plist heads. Initialize the hash queues before the filesystem. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Lennert Buytenhek <buytenh@wantstofly.org> Cc: Riku Voipio <riku.voipio@movial.fi> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 17:12:15 -08:00
Harvey Harrison	de4fc64f0f	markers: fix sparse warnings in markers.c char can be unsigned kernel/marker.c:64:20: error: dubious one-bit signed bitfield kernel/marker.c:65:14: error: dubious one-bit signed bitfield Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 17:12:14 -08:00
Rafael J. Wysocki	3a2d5b7001	PM: Introduce PM_EVENT_HIBERNATE callback state During the last step of hibernation in the "platform" mode (with the help of ACPI) we use the suspend code, including the devices' ->suspend() methods, to prepare the system for entering the ACPI S4 system sleep state. But at least for some devices the operations performed by the ->suspend() callback in that case must be different from its operations during regular suspend. For this reason, introduce the new PM event type PM_EVENT_HIBERNATE and pass it to the device drivers' ->suspend() methods during the last phase of hibernation, so that they can distinguish this case and handle it as appropriate. Modify the drivers that handle PM_EVENT_SUSPEND in a special way and need to handle PM_EVENT_HIBERNATE in the same way. These changes are necessary to fix a hibernation regression related to the i915 driver (ref. http://lkml.org/lkml/2008/2/22/488). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Tested-by: Jeff Chua <jeff.chua.linux@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 10:40:04 -08:00
Linus Torvalds	20f8d2a493	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (26 commits) PM: Make suspend_device() static PCI ACPI: Fix comment describing acpi_pci_choose_state Hibernation: Handle DEBUG_PAGEALLOC on x86 ACPI: fix build warning ACPI: TSC breaks atkbd suspend ACPI: remove is_processor_present prototype acer-wmi: Add DMI match for mail LED on Acer TravelMate 4200 series ACPI: sparse fix, replace macro with static function ACPI: thinkpad-acpi: add tablet-mode reporting ACPI: thinkpad-acpi: minor hotkey_radio_sw fixes ACPI: thinkpad-acpi: improve thinkpad-acpi input device documentation ACPI: thinkpad-acpi: issue input events for tablet swivel events ACPI: thinkpad-acpi: make the video output feature optional ACPI: thinkpad-acpi: synchronize input device switches ACPI: thinkpad-acpi: always track input device open/close ACPI: thinkpad-acpi: trivial fix to documentation ACPI: thinkpad-acpi: trivial fix to module_desc typo intel_menlo: extract return values using PTR_ERR ACPI video: check for error from thermal_cooling_device_register ACPI thermal: extract return values using PTR_ERR ...	2008-02-21 16:33:19 -08:00
Kay Sievers	120fc3d77a	modules: do not try to add sysfs attributes if !CONFIG_SYSFS Thanks to Alexey for the testing and the fix of the fix. Cc: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-02-21 15:27:08 -08:00
Rafael J. Wysocki	8a235efad5	Hibernation: Handle DEBUG_PAGEALLOC on x86 Make hibernation work with CONFIG_DEBUG_PAGEALLOC set on x86, by checking if the pages to be copied are marked as present in the kernel mapping and temporarily marking them as present if that's not the case. No functional modifications are introduced if CONFIG_DEBUG_PAGEALLOC is unset. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Len Brown <len.brown@intel.com>	2008-02-21 02:15:28 -05:00
Thomas Gleixner	89d694b9db	genirq: do not leave interupts enabled on free_irq The default_disable() function was changed in commit: `76d2160147` genirq: do not mask interrupts by default It removed the mask function in favour of the default delayed interrupt disabling. Unfortunately this also broke the shutdown in free_irq() when the last handler is removed from the interrupt for those architectures which rely on the default implementations. Now we can end up with a enabled interrupt line after the last handler was removed, which can result in spurious interrupts. Fix this by adding a default_shutdown function, which is only installed, when the irqchip implementation does provide neither a shutdown nor a disable function. [@stable: affected versions: .21 - .24 ] Pointed-out-by: Michael Hennerich <Michael.Hennerich@analog.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: stable@kernel.org Tested-by: Michael Hennerich <Michael.Hennerich@analog.com>	2008-02-19 10:43:58 +01:00
S.Caglar Onur	188fd89d53	genirq: spurious.c: use time_* macros The functions time_before, time_before_eq, time_after, and time_after_eq are more robust for comparing jiffies against other values. So following patch implements usage of the time_after() macro, defined at linux/jiffies.h, which deals with wrapping correctly Signed-off-by: S.Caglar Onur <caglar@pardus.org.tr> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-02-19 10:43:58 +01:00
Eric Paris	b0abcfc146	Audit: use == not = in if statements Clearly this was supposed to be an == not an = in the if statement. This patch also causes us to stop processing execve args once we have failed rather than continuing to loop on failure over and over and over. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-18 18:46:28 -08:00
Pavel Machek	db4315d6f5	timer_list: print relative expiry time signed Relative expiry time can get negative, so it should be signed. Signed-off-by: Pavel Machek <Pavel@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-02-17 17:29:38 +01:00
Linus Torvalds	c24ce1d887	Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt * git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: hrtimer: catch expired CLOCK_REALTIME timers early hrtimer: check relative timeouts for overflow	2008-02-14 21:27:52 -08:00
Jan Blunck	cf28b4863f	d_path: Make d_path() use a struct path d_path() is used on a <dentry,vfsmount> pair. Lets use a struct path to reflect this. [akpm@linux-foundation.org: fix build in mm/memory.c] Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Bryan Wu <bryan.wu@analog.com> Acked-by: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-14 21:17:09 -08:00
Jan Blunck	44707fdf59	d_path: Use struct path in struct avc_audit_data audit_log_d_path() is a d_path() wrapper that is used by the audit code. To use a struct path in audit_log_d_path() I need to embed it into struct avc_audit_data. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-14 21:17:08 -08:00
Jan Blunck	6ac08c39a1	Use struct path in fs_struct * Use struct path in fs_struct. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-14 21:13:33 -08:00
Jan Blunck	1d957f9bf8	Introduce path_put() * Add path_put() functions for releasing a reference to the dentry and vfsmount of a struct path in the right order * Switch from path_release(nd) to path_put(&nd->path) * Rename dput_path() to path_put_conditional() [akpm@linux-foundation.org: fix cifs] Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Cc: <linux-fsdevel@vger.kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-14 21:13:33 -08:00
Jan Blunck	4ac9137858	Embed a struct path into struct nameidata instead of nd->{dentry,mnt} This is the central patch of a cleanup series. In most cases there is no good reason why someone would want to use a dentry for itself. This series reflects that fact and embeds a struct path into nameidata. Together with the other patches of this series - it enforced the correct order of getting/releasing the reference count on <dentry,vfsmount> pairs - it prepares the VFS for stacking support since it is essential to have a struct path in every place where the stack can be traversed - it reduces the overall code size: without patch series: text data bss dec hex filename 5321639 858418 715768 6895825 6938d1 vmlinux with patch series: text data bss dec hex filename 5320026 858418 715768 `6894212` 693284 vmlinux This patch: Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix cifs] [akpm@linux-foundation.org: fix smack] Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-14 21:13:33 -08:00
Jan Blunck	db74ece990	Dont touch fs_struct in usermodehelper This test seems to be unnecessary since we always have rootfs mounted before calling a usermodehelper. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-14 21:13:32 -08:00
Thomas Gleixner	63070a79ba	hrtimer: catch expired CLOCK_REALTIME timers early A CLOCK_REALTIME timer, which has an absolute expiry time less than the clock realtime offset calls with a negative delta into the clock events code and triggers the WARN_ON() there. This is a false positive and needs to be prevented. Check the result of timer->expires - timer->base->offset right away and return -ETIME right away. Thanks to Frans Pop, who reported the problem and tested the fixes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Frans Pop <elendil@planet.nl>	2008-02-14 22:08:30 +01:00
Thomas Gleixner	5a7780e725	hrtimer: check relative timeouts for overflow Various user space callers ask for relative timeouts. While we fixed that overflow issue in hrtimer_start(), the sites which convert relative user space values to absolute timeouts themself were uncovered. Instead of putting overflow checks into each place add a function which does the sanity checking and convert all affected callers to use it. Thanks to Frans Pop, who reported the problem and tested the fixes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Tested-by: Frans Pop <elendil@planet.nl>	2008-02-14 22:08:30 +01:00
Mathieu Desnoyers	fb40bd78b0	Linux Kernel Markers: support multiple probes RCU style multiple probes support for the Linux Kernel Markers. Common case (one probe) is still fast and does not require dynamic allocation or a supplementary pointer dereference on the fast path. - Move preempt disable from the marker site to the callback. Since we now have an internal callback, move the preempt disable/enable to the callback instead of the marker site. Since the callback change is done asynchronously (passing from a handler that supports arguments to a handler that does not setup the arguments is no arguments are passed), we can safely update it even if it is outside the preempt disable section. - Move probe arm to probe connection. Now, a connected probe is automatically armed. Remove MARK_MAX_FORMAT_LEN, unused. This patch modifies the Linux Kernel Markers API : it removes the probe "arm/disarm" and changes the probe function prototype : it now expects a va_list * instead of a "...". If we want to have more than one probe connected to a marker at a given time (LTTng, or blktrace, ssytemtap) then we need this patch. Without it, connecting a second probe handler to a marker will fail. It allow us, for instance, to do interesting combinations : Do standard tracing with LTTng and, eventually, to compute statistics with SystemTAP, or to have a special trigger on an event that would call a systemtap script which would stop flight recorder tracing. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Mason <mmlnx@us.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: David Smith <dsmith@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-13 16:21:20 -08:00

... 12 13 14 15 16 ...

4783 Commits