linux

Author	SHA1	Message	Date
Linus Torvalds	b14ea38e13	Merge branch 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: NOHZ: fix thinko in the timer restart code path	2008-10-23 09:57:16 -07:00
Linus Torvalds	f2e4bd2b37	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcupdate: fix bug of rcu_barrier*() profiling: fix !procfs build Fixed trivial conflicts in 'include/linux/profile.h'	2008-10-23 09:38:55 -07:00
Linus Torvalds	133e887f90	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: disable the hrtick for now sched: revert back to per-rq vruntime sched: fair scheduler should not resched rt tasks sched: optimize group load balancer sched: minor fast-path overhead reduction sched: fix the wrong mask_len, cleanup sched: kill unused scheduler decl. sched: fix the wrong mask_len sched: only update rq->clock while holding rq->lock	2008-10-23 09:37:16 -07:00
Linus Torvalds	e82cff752f	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: NULL struct irq_desc's member 'name' in dynamic_irq_cleanup() genirq: fix off by one and coding style genirq: fix set_irq_type() when recording trigger type	2008-10-23 09:36:55 -07:00
Ingo Molnar	66b0de3569	ftrace: fix build failure fix: kernel/trace/ftrace.c: In function 'ftrace_release': kernel/trace/ftrace.c:271: error: implicit declaration of function 'ftrace_release_hash' release_hash is not needed without dftraced. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-23 16:11:03 +02:00
Alexey Dobriyan	b5aadf7f14	proc: move /proc/schedstat boilerplate to kernel/sched_stats.h Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-23 18:06:12 +04:00
Alexey Dobriyan	3b5d5c6b0c	proc: move /proc/modules boilerplate to kernel/module.c Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-23 18:03:13 +04:00
Steven Rostedt	08f5ac906d	ftrace: remove ftrace hash The ftrace hash was used by the ftrace_daemon code. The record ip function would place the calling address (ip) into the hash. The daemon would later read the hash and modify that code. The hash complicates the code. This patch removes it. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-23 16:00:24 +02:00
Steven Rostedt	4d296c2432	ftrace: remove mcount set The arch dependent function ftrace_mcount_set was only used by the daemon start up code. This patch removes it. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-23 16:00:23 +02:00
Steven Rostedt	cb7be3b2fc	ftrace: remove daemon The ftrace daemon is complex and error prone. This patch strips it out of the code. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-23 16:00:22 +02:00
Steven Rostedt	6912896e99	ftrace: add ftrace warn on to disable ftrace Add ftrace warn on to disable ftrace as well as report a warning. [ Thanks to Andrew Morton for suggesting using the WARN_ON return value ] Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-23 16:00:20 +02:00
Steven Rostedt	81adbdc029	ftrace: only have ftrace_kill atomic When an anomaly is detected, we need a way to completely disable ftrace. Right now we have two functions: ftrace_kill and ftrace_kill_atomic. The ftrace_kill tries to do it in a "nice" way by converting everything back to a nop. The "nice" way is dangerous itself, so this patch removes it and only has the "atomic" version, which is all that is needed. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-23 16:00:19 +02:00
Steven Rostedt	593eb8a2d6	ftrace: return error on failed modified text. Have the ftrace_modify_code return error values: -EFAULT on error of reading the address -EINVAL if what is read does not match what it expected -EPERM if the write fails to update after a successful match. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-23 16:00:13 +02:00
Alexey Dobriyan	6e62775ece	proc: move /proc/execdomains to kernel/exec_domain.c Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-23 14:30:41 +04:00
Al Viro	98bc993f99	[PATCH] get rid of nameidata in audit_tree Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-23 05:12:53 -04:00
Steven Rostedt	6ae2a0765a	ring-buffer: fix free page The pages of a buffer was originally pointing to the page struct, it now points to the page address. The freeing of the page still uses the page frame free "__free_page" instead of the correct free_page to the address. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-22 17:02:09 +02:00
Li Zefan	4ce72a2c06	sched: add CONFIG_SMP consistency a patch from Henrik Austad did this: >> Do not declare select_task_rq as part of sched_class when CONFIG_SMP is >> not set. Peter observed: > While a proper cleanup, could you do it by re-arranging the methods so > as to not create an additional ifdef? Do not declare select_task_rq and some other methods as part of sched_class when CONFIG_SMP is not set. Also gather those methods to avoid CONFIG_SMP mess. Idea-by: Henrik Austad <henrik.austad@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Henrik Austad <henrik@austad.us> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-22 10:01:52 +02:00
Thomas Gleixner	268a3dcfea	Merge branch 'timers/range-hrtimers' into v28-range-hrtimers-for-linus-v2 Conflicts: kernel/time/tick-sched.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-22 09:48:06 +02:00
Peter Zijlstra	17d80fd07d	tracing: create tracers menu We seem to have plenty tracers, lets create a menu and not clutter the already cluttered debug menu more. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Frédéric Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-22 09:08:56 +02:00
Ingo Molnar	debfcaf93e	Merge branch 'tracing/ftrace' into tracing/urgent	2008-10-22 09:08:14 +02:00
roel kluin	3786fc710c	irq: make variable static This variable is only used in the source file, so make it static. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-22 07:37:17 +02:00
Heiko Carstens	8163bcac77	stop_machine: fix error code handling on multiple cpus Using \|= for updating a value which might be updated on several cpus concurrently will not always work since we need to make sure that the update happens atomically. To fix this just use a write if the called function returns an error code on a cpu. We end up writing the error code of an arbitrary cpu if multiple ones fail but that should be sufficient. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-22 10:00:26 +11:00
Heiko Carstens	c9583e55fa	stop_machine: use workqueues instead of kernel threads Convert stop_machine to a workqueue based approach. Instead of using kernel threads for stop_machine we now use a an rt workqueue to synchronize all cpus. This has the advantage that all needed per cpu threads are already created when stop_machine gets called. And therefore a call to stop_machine won't fail anymore. This is needed for s390 which needs a mechanism to synchronize all cpus without allocating any memory. As Rusty pointed out free_module() needs a non-failing stop_machine interface as well. As a side effect the stop_machine code gets simplified. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-22 10:00:26 +11:00
Heiko Carstens	0d557dc97f	workqueue: introduce create_rt_workqueue create_rt_workqueue will create a real time prioritized workqueue. This is needed for the conversion of stop_machine to a workqueue based implementation. This patch adds yet another parameter to __create_workqueue_key to tell it that we want an rt workqueue. However it looks like we rather should have something like "int type" instead of singlethread, freezable and rt. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@elte.hu>	2008-10-22 10:00:25 +11:00
Rusty Russell	f44dd164f3	Make panic= and panic_on_oops into core_params This allows them to be examined and set after boot, plus means they actually give errors if they are misused (eg. panic=yes). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-22 10:00:25 +11:00
Rusty Russell	67e67ceaac	core_param() for genuinely core kernel parameters There are a lot of one-liner uses of __setup() in the kernel: they're cumbersome and not queryable (definitely not settable) via /sys. Yet it's ugly to simplify them to module_param(), because by default that inserts a prefix of the module name (usually filename). So, introduce a "core_param". The parameter gets no prefix, but appears in /sys/module/kernel/parameters/ (if non-zero perms arg). I thought about using the name "core", but that's more common than "kernel". And if you create a module called "kernel", you will die a horrible death. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-22 10:00:23 +11:00
Rusty Russell	9b473de872	param: Fix duplicate module prefixes Instead of insisting each new module_param sysfs entry is unique, handle the case where it already exists (for builtin modules). The current code assumes that all identical prefixes are together in the section: true for normal uses, but not necessarily so if someone overrides MODULE_PARAM_PREFIX. More importantly, it's not true with the new "core_param()" code which uses "kernel" as a prefix. This simplifies the caller for the builtin case, at a slight loss of efficiency (we do the lookup every time to see if the directory exists). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-22 10:00:23 +11:00
Rusty Russell	730b69d225	module: check kernel param length at compile time, not runtime The kparam code tries to handle over-length parameter prefixes at runtime. Not only would I bet this has never been tested, it's not clear that truncating names is a good idea either. So let's check at compile time. We need to move the #define to moduleparam.h to do this, though. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-22 10:00:22 +11:00
Andi Kleen	d72b37513c	Remove stop_machine during module load v2 Remove stop_machine during module load v2 module loading currently does a stop_machine on each module load to insert the module into the global module lists. Especially on larger systems this can be quite expensive. It does that to handle concurrent lock lessmodule list readers like kallsyms. I don't think stop_machine() is actually needed to insert something into a list though. There are no concurrent writers because the module mutex is taken. And the RCU list functions know how to insert a node into a list with the right memory ordering so that concurrent readers don't go off into the wood. So remove the stop_machine for the module list insert and just do a list_add_rcu() instead. Module removal will still do a stop_machine of course, it needs that for other reasons. v2: Revised readers based on Paul's comments. All readers that only rely on disabled preemption need to be changed to list_for_each_rcu(). Done that. The others are ok because they have the modules mutex. Also added a possible missing preempt disable for print_modules(). [cc Paul McKenney for review. It's not RCU, but quite similar.] Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-22 10:00:22 +11:00
Rusty Russell	5e458cc0f4	module: simplify load_module. Linus' recent catch of stack overflow in load_module lead me to look at the code. A couple of helpers to get a section address and get objects from a section can help clean things up a little. (And in case you're wondering, the stack size also dropped from 328 to 284 bytes). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-22 10:00:15 +11:00
Thomas Gleixner	c4bd822e7b	NOHZ: fix thinko in the timer restart code path commit `fb02fbc14d` (NOHZ: restart tick device from irq_enter()) solves the problem of stale jiffies when long running softirqs happen in a long idle sleep period, but it has a major thinko in it: When the interrupt which came in _is_ the timer interrupt which should expire ts->sched_timer then we cancel and rearm the timer _before_ it gets expired in hrtimer_interrupt() to the next period. That means the call back function is not called. This game can go on for ever :( Prevent this by making sure to only rearm the timer when the expiry time is more than one tick_period away. Otherwise keep it running as it is either already expired or will expiry at the right point to update jiffies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Venkatesch Pallipadi <venkatesh.pallipadi@intel.com>	2008-10-21 20:53:24 +02:00
Lai Jiangshan	5f86515158	rcupdate: fix bug of rcu_barrier() current rcu_barrier_bh() is like this: void rcu_barrier_bh(void) { BUG_ON(in_interrupt()); / Take cpucontrol mutex to protect against CPU hotplug / mutex_lock(&rcu_barrier_mutex); init_completion(&rcu_barrier_completion); atomic_set(&rcu_barrier_cpu_count, 0); / * The queueing of callbacks in all CPUs must be atomic with * respect to RCU, otherwise one CPU may queue a callback, * wait for a grace period, decrement barrier count and call * complete(), while other CPUs have not yet queued anything. * So, we need to make sure that grace periods cannot complete * until all the callbacks are queued. / rcu_read_lock(); on_each_cpu(rcu_barrier_func, (void )RCU_BARRIER_BH, 1); rcu_read_unlock(); wait_for_completion(&rcu_barrier_completion); mutex_unlock(&rcu_barrier_mutex); } The inconsistency of the code and the comments show a bug here. rcu_read_lock() cannot make sure that "grace periods for RCU_BH cannot complete until all the callbacks are queued". it only make sure that race periods for RCU cannot complete until all the callbacks are queued. so we must use rcu_read_lock_bh() for rcu_barrier_bh(). like this: void rcu_barrier_bh(void) { ...... rcu_read_lock_bh(); on_each_cpu(rcu_barrier_func, (void *)RCU_BARRIER_BH, 1); rcu_read_unlock_bh(); ...... } and also rcu_barrier() rcu_barrier_sched() are implemented like this. it will bring a lot of duplicate code. My patch uses another way to fix this bug, please see the comment of my patch. Thank Paul E. McKenney for he rewrote the comment. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-21 15:59:53 +02:00
Dean Nelson	b6f3b7803a	genirq: NULL struct irq_desc's member 'name' in dynamic_irq_cleanup() If the member 'name' of the irq_desc structure happens to point to a character string that is resident within a kernel module, problems ensue if that module is rmmod'd (at which time dynamic_irq_cleanup() is called) and then later show_interrupts() is called by someone. It is also not a good thing if the character string resided in kmalloc'd space that has been kfree'd (after having called dynamic_irq_cleanup()). dynamic_irq_cleanup() fails to NULL the 'name' member and show_interrupts() references it on a few architectures (like h8300, sh and x86). Signed-off-by: Dean Nelson <dcn@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-21 15:59:21 +02:00
Al Viro	572c489215	[PATCH] sanitize blkdev_get() and friends * get rid of fake struct file/struct dentry in __blkdev_get() * merge __blkdev_get() and do_open() * get rid of flags argument of blkdev_get() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:49:06 -04:00
Al Viro	c2dd0dae18	[PATCH] propagate mode through swsusp_close() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:49:02 -04:00
Al Viro	9a1c354276	[PATCH] pass fmode_t to blkdev_put() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-10-21 07:48:58 -04:00
Chris Friesen	0b3682ba33	genirq: fix set_irq_type() when recording trigger type Impact: fix boot hang on a G5 In set_irq_type() we want to pass the type rather than the current interrupt state. Signed-off-by: Chris Friesen <cfriesen@nortel.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-21 10:10:08 +02:00
Luck, Tony	5f41b8cdc6	kexec: fix crash_save_vmcoreinfo_init build problem This fixes kernel/kexec.c: In function 'crash_save_vmcoreinfo_init': kernel/kexec.c:1374: error: 'vmlist' undeclared (first use in this function) kernel/kexec.c:1374: error: (Each undeclared identifier is reported only once kernel/kexec.c:1374: error: for each function it appears in.) kernel/kexec.c:1410: error: invalid use of undefined type 'struct vm_struct' make[1]: *** [kernel/kexec.o] Error 1 Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 15:28:50 -07:00
Linus Torvalds	92b29b86fe	Merge branch 'tracing-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (131 commits) tracing/fastboot: improve help text tracing/stacktrace: improve help text tracing/fastboot: fix initcalls disposition in bootgraph.pl tracing/fastboot: fix bootgraph.pl initcall name regexp tracing/fastboot: fix issues and improve output of bootgraph.pl tracepoints: synchronize unregister static inline tracepoints: tracepoint_synchronize_unregister() ftrace: make ftrace_test_p6nop disassembler-friendly markers: fix synchronize marker unregister static inline tracing/fastboot: add better resolution to initcall debug/tracing trace: add build-time check to avoid overrunning hex buffer ftrace: fix hex output mode of ftrace tracing/fastboot: fix initcalls disposition in bootgraph.pl tracing/fastboot: fix printk format typo in boot tracer ftrace: return an error when setting a nonexistent tracer ftrace: make some tracers reentrant ring-buffer: make reentrant ring-buffer: move page indexes into page headers tracing/fastboot: only trace non-module initcalls ftrace: move pc counter in irqtrace ... Manually fix conflicts: - init/main.c: initcall tracing - kernel/module.c: verbose level vs tracepoints - scripts/bootgraph.pl: fallout from cherry-picking commits.	2008-10-20 13:35:07 -07:00
Linus Torvalds	9301975ec2	Merge branch 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip This merges branches irq/genirq, irq/sparseirq-v4, timers/hpet-percpu and x86/uv. The sparseirq branch is just preliminary groundwork: no sparse IRQs are actually implemented by this tree anymore - just the new APIs are added while keeping the old way intact as well (the new APIs map 1:1 to irq_desc[]). The 'real' sparse IRQ support will then be a relatively small patch ontop of this - with a v2.6.29 merge target. * 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (178 commits) genirq: improve include files intr_remapping: fix typo io_apic: make irq_mis_count available on 64-bit too genirq: fix name space collisions of nr_irqs in arch/* genirq: fix name space collision of nr_irqs in autoprobe.c genirq: use iterators for irq_desc loops proc: fixup irq iterator genirq: add reverse iterator for irq_desc x86: move ack_bad_irq() to irq.c x86: unify show_interrupts() and proc helpers x86: cleanup show_interrupts genirq: cleanup the sparseirq modifications genirq: remove artifacts from sparseirq removal genirq: revert dynarray genirq: remove irq_to_desc_alloc genirq: remove sparse irq code genirq: use inline function for irq_to_desc genirq: consolidate nr_irqs and for_each_irq_desc() x86: remove sparse irq from Kconfig genirq: define nr_irqs for architectures with GENERIC_HARDIRQS=n ...	2008-10-20 13:23:01 -07:00
Linus Torvalds	99ebcf8285	Merge branch 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'v28-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits) fix documentation of sysrq-q really Fix documentation of sysrq-q timer_list: add base address to clock base timer_list: print cpu number of clockevents device timer_list: print real timer address NOHZ: restart tick device from irq_enter() NOHZ: split tick_nohz_restart_sched_tick() NOHZ: unify the nohz function calls in irq_enter() timers: fix itimer/many thread hang, fix timers: fix itimer/many thread hang, v3 ntp: improve adjtimex frequency rounding timekeeping: fix rounding problem during clock update ntp: let update_persistent_clock() sleep hrtimer: reorder struct hrtimer to save 8 bytes on 64bit builds posix-timers: lock_timer: make it readable posix-timers: lock_timer: kill the bogus ->it_id check posix-timers: kill ->it_sigev_signo and ->it_sigev_value posix-timers: sys_timer_create: cleanup the error handling posix-timers: move the initialization of timer->sigq from send to create path posix-timers: sys_timer_create: simplify and s/tasklist/rcu/ ... Fix trivial conflicts due to sysrq-q description clahes in Documentation/sysrq.txt and drivers/char/sysrq.c	2008-10-20 13:19:56 -07:00
Harvey Harrison	f07767fd0f	byteorder: remove direct includes of linux/byteorder/swab[b].h A consolidated implementation will provide this generically through asm/byteorder, remove direct includes to avoid breakage when the changeover to the new implementation occurs. This hunk was lost from commit `1d8cca44b6` ("byteorder: provide swabb.h generically in asm/byteorder.h") Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 12:51:53 -07:00
Steven Rostedt	81520a1b06	ftrace: stack tracer only record when on stack The stack trace API does not record if the stack is not on the current task's stack. That is, if the stack is the interrupt stack or NMI stack, the output does not show. Also, the size of those stacks are not consistent with the size of the thread stack, this makes the calculation of the stack size usually bogus. This all confuses the stack tracer. I unfortunately do not have time to fix all these problems, but this patch does record the worst stack when the stack pointer is on the tasks stack (instead of bogus numbers). The patch simply returns if the stack pointer is not on the task's stack. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 18:31:37 +02:00
Steven Rostedt	3ce83aea86	ftrace: rename the ftrace tracer to function To avoid further confusion between the ftrace infrastructure and the function tracer. This patch renames the "ftrace" function tracer to "function". Now in available_tracers, instead of "ftrace" there will be "function". This makes more sense, since people will not know exactly what the "ftrace" tracer does. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 18:27:04 +02:00
Steven Rostedt	606576ce81	ftrace: rename FTRACE to FUNCTION_TRACER Due to confusion between the ftrace infrastructure and the gcc profiling tracer "ftrace", this patch renames the config options from FTRACE to FUNCTION_TRACER. The other two names that are offspring from FTRACE DYNAMIC_FTRACE and FTRACE_MCOUNT_RECORD will stay the same. This patch was generated mostly by script, and partially by hand. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 18:27:03 +02:00
Steven Rostedt	c2db8054c1	ftrace: fix depends A lot of tracers have HAVE_FTRACE as a dependent config where it really should not. The HAVE_FTRACE is a misnomer (soon to be fixed) and describes if the architecture has the function tracer (mcount) implemented. The ftrace infrastructure is implemented in all archs. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 18:27:02 +02:00
Steven Rostedt	bd95b88d9e	ftrace: release functions from hash The x86 architecture uses a static recording of mcount caller locations and is not affected by this patch. For architectures still using the dynamic ftrace daemon, this patch is critical. It removes the race between the recording of a function that calls mcount, the unloading of a module, and the ftrace daemon updating the call sites. This patch adds the releasing of the hash functions that the daemon uses to update the mcount call sites. When a module is unloaded, not only are the replaced call site table update, but now so is the hash recorded functions that the ftrace daemon will use. Again, architectures that implement MCOUNT_RECORD are not affected by this (which currently only x86 has). Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 18:27:01 +02:00
Harvey Harrison	1a651a00e2	byteorder: remove direct includes of linux/byteorder/swab[b].h A consolidated implementation will provide this generically through asm/byteorder, remove direct includes to avoid breakage when the changeover to the new implementation occurs. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:40 -07:00
Ken'ichi Ohmichi	acd99dbf54	kdump: add vmlist.addr to vmcoreinfo for x86 vmalloc translation. Add the symbols 'vmlist' and offset 'vm_struct.addr' to the vmcoreinfo[1] data for i386 vmalloc translation. makedumpfile[2] needs VMALLOC_START value for distinguishing a vmalloc address or not, because it should choose suitable translation method. If applying this patch, makedumpfile will be able to take VMALLOC_START value from 'vmlist.addr'. vmcoreinfo[1]: The vmcoreinfo data has the minimum debugging information only for dump filtering. makedumpfile[2] uses it to distinguish unnecessary pages and creates a small dumpfile. makedumpfile[2]: dump filtering command https://sourceforge.net/projects/makedumpfile/ Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:40 -07:00
Oleg Nesterov	293adee601	kthread_bind: use wait_task_inactive(TASK_UNINTERRUPTIBLE) Now that wait_task_inactive(task, state) checks task->state == state, we can simplify the code and make this debugging check more robust. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:39 -07:00
Adrian Bunk	b747c8c102	make ptrace_untrace() static ptrace_untrace() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:39 -07:00
Lai Jiangshan	30e8e13603	cpuset: use seq_mask_ to print masks 1) seq_file excepts that m->count == m->size when it's buf is full, so current code will causes bugs when buf is overflow. 2) There is not too good that cpuset accesses struct seq_file's fields directly. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:39 -07:00
Rakib Mullick	40b6a76237	cpuset.c: remove extra variable Remove the use of int cpus_nonempty variable from 'update_flag' function. Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:39 -07:00
Paul Menage	cc31edceee	cgroups: convert tasks file to use a seq_file with shared pid array Rather than pre-generating the entire text for the "tasks" file each time the file is opened, we instead just generate/update the array of process ids and use a seq_file to report these to userspace. All open file handles on the same "tasks" file can share a pid array, which may be updated any time that no thread is actively reading the array. By sharing the array, the potential for userspace to DoS the system by opening many handles on the same "tasks" file is removed. [Based on a patch by Lai Jiangshan, extended to use seq_file] Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:38 -07:00
Lai Jiangshan	146aa1bd05	cgroups: fix probable race with put_css_set[_taskexit] and find_css_set put_css_set_taskexit may be called when find_css_set is called on other cpu. And the race will occur: put_css_set_taskexit side find_css_set side \| atomic_dec_and_test(&kref->refcount) \| /* kref->refcount = 0 / \| .................................................................... \| read_lock(&css_set_lock) \| find_existing_css_set \| get_css_set \| read_unlock(&css_set_lock); .................................................................... __release_css_set \| .................................................................... \| / use a released css_set */ \| [put_css_set is the same. But in the current code, all put_css_set are put into cgroup mutex critical region as the same as find_css_set.] [akpm@linux-foundation.org: repair comments] [menage@google.com: eliminate race in css_set refcounting] Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:38 -07:00
WANG Cong	c3b9f5afc7	kernel/configs.c: remove useless comments These comments are useless, remove them. Signed-off-by: WANG Cong <wangcong@zeuux.org> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	1aece34833	container freezer: rename check_if_frozen() check_if_frozen() sounds like it should return something when in fact it's just updating the freezer state. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	81dcf33c2a	container freezer: make freezer state names less generic Rename cgroup freezer states to be less generic to avoid any name collisions while also better describing what each state is. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	957a4eeaf4	container freezer: prevent frozen tasks or cgroups from changing Don't let frozen tasks or cgroups change. This means frozen tasks can't leave their current cgroup for another cgroup. It also means that tasks cannot be added to or removed from a cgroup in the FROZEN state. We enforce these rules by checking for frozen tasks and cgroups in the can_attach() function. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	5a06915c6d	container freezer: skip frozen cgroups during power management resume When a system is resumed after a suspend, it will also unfreeze frozen cgroups. This patchs modifies the resume sequence to skip the tasks which are part of a frozen control group. Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	dc52ddc0e6	container freezer: implement freezer cgroup subsystem This patch implements a new freezer subsystem in the control groups framework. It provides a way to stop and resume execution of all tasks in a cgroup by writing in the cgroup filesystem. The freezer subsystem in the container filesystem defines a file named freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in the cgroup. Reading will return the current state. * Examples of usage : # mkdir /containers/freezer # mount -t cgroup -ofreezer freezer /containers # mkdir /containers/0 # echo $some_pid > /containers/0/tasks to get status of the freezer subsystem : # cat /containers/0/freezer.state RUNNING to freeze all tasks in the container : # echo FROZEN > /containers/0/freezer.state # cat /containers/0/freezer.state FREEZING # cat /containers/0/freezer.state FROZEN to unfreeze all tasks in the container : # echo RUNNING > /containers/0/freezer.state # cat /containers/0/freezer.state RUNNING This is the basic mechanism which should do the right thing for user space task in a simple scenario. It's important to note that freezing can be incomplete. In that case we return EBUSY. This means that some tasks in the cgroup are busy doing something that prevents us from completely freezing the cgroup at this time. After EBUSY, the cgroup will remain partially frozen -- reflected by freezer.state reporting "FREEZING" when read. The state will remain "FREEZING" until one of these things happens: 1) Userspace cancels the freezing operation by writing "RUNNING" to the freezer.state file 2) Userspace retries the freezing operation by writing "FROZEN" to the freezer.state file (writing "FREEZING" is not legal and returns EIO) 3) The tasks that blocked the cgroup from entering the "FROZEN" state disappear from the cgroup's set of tasks. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: export thaw_process] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:34 -07:00
Matt Helsley	8174f1503f	container freezer: make refrigerator always available Now that the TIF_FREEZE flag is available in all architectures, extract the refrigerator() and freeze_task() from kernel/power/process.c and make it available to all. The refrigerator() can now be used in a control group subsystem implementing a control group freezer. Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Matt Helsley <matthltc@us.ibm.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:33 -07:00
Lee Schermerhorn	af936a1606	vmscan: unevictable LRU scan sysctl This patch adds a function to scan individual or all zones' unevictable lists and move any pages that have become evictable onto the respective zone's inactive list, where shrink_inactive_list() will deal with them. Adds sysctl to scan all nodes, and per node attributes to individual nodes' zones. Kosaki: If evictable page found in unevictable lru when write /proc/sys/vm/scan_unevictable_pages, print filename and file offset of these pages. [akpm@linux-foundation.org: fix one CONFIG_MMU=n build error] [kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-20 08:52:31 -07:00
Peter Zijlstra	c7e78cff6b	lockstat: contend with points We currently only provide points that have to wait on contention, also lists the points we have to wait for. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 15:43:10 +02:00
Ingo Molnar	0c4b83da58	sched: disable the hrtick for now David Miller reported that hrtick update overhead has tripled the wakeup overhead on Sparc64. That is too much - disable the HRTICK feature for now by default, until a faster implementation is found. Reported-by: David Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 14:27:43 +02:00
Peter Zijlstra	f9c0b0950d	sched: revert back to per-rq vruntime Vatsa rightly points out that having the runqueue weight in the vruntime calculations can cause unfairness in the face of task joins/leaves. Suppose: dv = dt * rw / w Then take 10 tasks t_n, each of similar weight. If the first will run 1 then its vruntime will increase by 10. Now, if the next 8 tasks leave after having run their 1, then the last task will get a vruntime increase of 2 after having run 1. Which will leave us with 2 tasks of equal weight and equal runtime, of which one will not be scheduled for 8/2=4 units of time. Ergo, we cannot do that and must use: dv = dt / w. This means we cannot have a global vruntime based on effective priority, but must instead go back to the vruntime per rq model we started out with. This patch was lightly tested by doing starting while loops on each nice level and observing their execution time, and a simple group scenario of 1:2:3 pinned to a single cpu. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 14:05:04 +02:00
Peter Zijlstra	a4c2f00f5c	sched: fair scheduler should not resched rt tasks With use of ftrace Steven noticed that some RT tasks got rescheduled due to sched_fair interaction. What happens is that we reprogram the hrtick from enqueue/dequeue_fair_task() because that can change nr_running, and thus a current tasks ideal runtime. However, its possible the current task isn't a fair_sched_class task, and thus doesn't have a hrtick set to change. Fix this by wrapping those hrtick_start_fair() calls in a hrtick_update() function, which will check for the right conditions. Reported-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 14:05:03 +02:00
Peter Zijlstra	ffda12a17a	sched: optimize group load balancer I noticed that tg_shares_up() unconditionally takes rq-locks for all cpus in the sched_domain. This hurts. We need the rq-locks whenever we change the weight of the per-cpu group sched entities. To allevate this a little, only change the weight when the new weight is at least shares_thresh away from the old value. This avoids the rq-lock for the top level entries, since those will never be re-weighted, and fuzzes the lower level entries a little to gain performance in semi-stable situations. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-20 14:05:02 +02:00
Thomas Gleixner	643bdf68f9	hrtimers: simplify hrtimer_peek_ahead_timers() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-20 13:38:11 +02:00
Thomas Gleixner	e1dd7bc585	hrtimers: fix docbook comments hrtimer_start() and hrtimer_start_range_ns() handle relative and absolute timers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-20 13:33:36 +02:00
Thomas Gleixner	c465a76af6	Merge branches 'timers/clocksource', 'timers/hrtimers', 'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus	2008-10-20 13:14:06 +02:00
Thomas Gleixner	870e2a2845	timer_list: add base address to clock base The base address of a (per cpu) clock base is a useful debug info. Add it and bump the version number of timer_lists. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-20 11:51:30 +02:00
Thomas Gleixner	c5b77a3d3a	timer_list: print cpu number of clockevents device The per cpu clock events device output of timer_list lacks an association of the device to the cpu which is annoying when looking at the output of /proc/timer_list from a 128 way system. Add the CPU number info and mark the broadcast device in the device list printout. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-20 11:51:30 +02:00
Thomas Gleixner	e67ef25a35	timer_list: print real timer address The current timer_list output prints the address of the on stack copy of the active hrtimer instead of the hrtimer itself. Print the address of the real timer instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-20 11:51:30 +02:00
Ingo Molnar	3e10e879a8	Merge branch 'linus' into tracing-v28-for-linus-v3 Conflicts: init/main.c kernel/module.c scripts/bootgraph.pl	2008-10-19 19:04:47 +02:00
Linus Torvalds	26e9a39777	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (25 commits) staging: at76_usb wireless driver Staging: workaround build system bug Staging: Lindent sxg.c Staging: SLICOSS: Call pci_release_regions at driver exit Staging: SLICOSS: Fix remaining type names Staging: SLICOSS: Fix warnings due to static usage Staging: SLICOSS: lots of checkpatch fixes Staging: go7007 v4l fixes Staging: Fix gcc warnings in sxg Staging: add echo cancelation module Staging: add wlan-ng prism2 usb driver Staging: add w35und wifi driver Staging: USB/IP: add host driver Staging: USB/IP: add client driver Staging: USB/IP: add common functions needed Staging: add the go7007 video driver Staging: add me4000 pci data collection driver Staging: add me4000 firmware files Staging: add sxg network driver Staging: add Alacritech slicoss network driver ... Fixed up conflicts due to taint flags changes and MAINTAINERS cleanup in MAINTAINERS, include/linux/kernel.h and kernel/panic.c.	2008-10-17 09:50:12 -07:00
Arjan van de Ven	651dab4264	Merge commit 'linus/master' into merge-linus Conflicts: arch/x86/kvm/i8254.c	2008-10-17 09:20:26 -07:00
Thomas Gleixner	fb02fbc14d	NOHZ: restart tick device from irq_enter() We did not restart the tick device from irq_enter() to avoid double reprogramming and extra events in the return immediate to idle case. But long lasting softirqs can lead to a situation where jiffies become stale: idle() tick stopped (reprogrammed to next pending timer) halt() interrupt jiffies updated from irq_enter() interrupt handler softirq function 1 runs 20ms softirq function 2 arms a 10ms timer with a stale jiffies value jiffies updated from irq_exit() timer wheel has now an already expired timer (the one added in function 2) timer fires and timer softirq runs This was discovered when debugging a timer problem which happend only when the ath5k driver is active. The debugging proved that there is a softirq function running for more than 20ms, which is a bug by itself. To solve this we restart the tick timer right from irq_enter(), but do not go through the other functions which are necessary to return from idle when need_resched() is set. Reported-by: Elias Oltmanns <eo@nebensachen.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Elias Oltmanns <eo@nebensachen.de>	2008-10-17 18:13:38 +02:00
Thomas Gleixner	c34bec5a44	NOHZ: split tick_nohz_restart_sched_tick() Split out the clock event device reprogramming. Preparatory patch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-17 18:13:38 +02:00
Thomas Gleixner	719254faa1	NOHZ: unify the nohz function calls in irq_enter() We have two separate nohz function calls in irq_enter() for no good reason. Just call a single NOHZ function from irq_enter() and call the bits in the tick code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-17 18:13:38 +02:00
Mike Galbraith	b0aa51b999	sched: minor fast-path overhead reduction Greetings, 103638d added a bit of avoidable overhead to the fast-path. Use sysctl_sched_min_granularity instead of sched_slice() to restrict buddy wakeups. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-17 15:36:58 +02:00
Peter Zijlstra	b968905292	sched: fix the wrong mask_len, cleanup Clean up the division in show_schedstat(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-17 13:05:22 +02:00
Miao Xie	c851c8676b	sched: fix the wrong mask_len If NR_CPUS isn't a multiple of 32, we get a truncated string of sched domains by catting /proc/schedstat. This is caused by the wrong mask_len. This patch fixes it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-17 12:26:33 +02:00
Ingo Molnar	0f1f6dec95	Merge branch 'linus' into sched/urgent	2008-10-17 12:25:43 +02:00
David S. Miller	54514a70ad	softirq: Add support for triggering softirq work on softirqs. This is basically a genericization of Jens Axboe's block layer remote softirq changes. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2008-10-17 08:46:56 +02:00
Linus Torvalds	8cde1ad668	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched_clock: prevent scd->clock from moving backwards	2008-10-16 15:38:48 -07:00
Linus Torvalds	1c95e1b690	Fix kernel/softirq.c printk format warning properly This fixes the broken `77af7e3403` ("softirq, warning fix: correct a format to avoid a warning") fix correctly. The type of a pointer subtraction is not "int", nor is it "long". It can be either (or something else). It's "ptrdiff_t", and the printk format for it is "%td". Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 15:32:46 -07:00
Linus Torvalds	e533b22705	Merge branch 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: do_generic_file_read: s/EINTR/EIO/ if lock_page_killable() fails softirq, warning fix: correct a format to avoid a warning softirqs, debug: preemption check x86, pci-hotplug, calgary / rio: fix EBDA ioremap() IO resources, x86: ioremap sanity check to catch mapping requests exceeding, fix IO resources, x86: ioremap sanity check to catch mapping requests exceeding the BAR sizes softlockup: Documentation/sysctl/kernel.txt: fix softlockup_thresh description dmi scan: warn about too early calls to dmi_check_system() generic: redefine resource_size_t as phys_addr_t generic: make PFN_PHYS explicitly return phys_addr_t generic: add phys_addr_t for holding physical addresses softirq: allocate less vectors IO resources: fix/remove printk printk: robustify printk, update comment printk: robustify printk, fix #2 printk: robustify printk, fix printk: robustify printk Fixed up conflicts in: arch/powerpc/include/asm/types.h arch/powerpc/platforms/Kconfig.cputype manually.	2008-10-16 15:17:40 -07:00
Linus Torvalds	c813b4e16e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (46 commits) UIO: Fix mapping of logical and virtual memory UIO: add automata sercos3 pci card support UIO: Change driver name of uio_pdrv UIO: Add alignment warnings for uio-mem Driver core: add bus_sort_breadthfirst() function NET: convert the phy_device file to use bus_find_device_by_name kobject: Cleanup kobject_rename and !CONFIG_SYSFS kobject: Fix kobject_rename and !CONFIG_SYSFS sysfs: Make dir and name args to sysfs_notify() const platform: add new device registration helper sysfs: use ilookup5() instead of ilookup5_nowait() PNP: create device attributes via default device attributes Driver core: make bus_find_device_by_name() more robust usb: turn dev_warn+WARN_ON combos into dev_WARN debug: use dev_WARN() rather than WARN_ON() in device_pm_add() debug: Introduce a dev_WARN() function sysfs: fix deadlock device model: Do a quickcheck for driver binding before doing an expensive check Driver core: Fix cleanup in device_create_vargs(). Driver core: Clarify device cleanup. ...	2008-10-16 12:40:26 -07:00
Linus Torvalds	c8d8a2321f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: remove CONFIG_KMOD in comment after #endif remove CONFIG_KMOD from fs remove CONFIG_KMOD from drivers Manually fix conflict due to include cleanups in drivers/md/md.c	2008-10-16 12:38:34 -07:00
Adrian Bunk	2b252c5411	make kprobes.c:kretprobe_table_lock() static Make the needlessly global kretprobe_table_lock() static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:52 -07:00
Bjorn Helgaas	c26ec88ea8	resources: tidy __request_region() No functional change. Just return NULL for kzalloc failure immediately, rather than wrapping the whole function body in the body of an "if". Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:51 -07:00
Thomas Petazzoni	ebf3f09c63	Configure out AIO support This patchs adds the CONFIG_AIO option which allows to remove support for asynchronous I/O operations, that are not necessarly used by applications, particularly on embedded devices. As this is a size-reduction option, it depends on CONFIG_EMBEDDED. It allows to save ~7 kilobytes of kernel code/data: text data bss dec hex filename 1115067 119180 217088 1451335 162547 vmlinux 1108025 119048 217088 1444161 160941 vmlinux.new -7042 -132 0 -7174 -1C06 +/- This patch has been originally written by Matt Mackall <mpm@selenic.com>, and is part of the Linux Tiny project. [randy.dunlap@oracle.com: build fix] Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Zach Brown <zach.brown@oracle.com> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:51 -07:00
Alexey Dobriyan	f221e726bf	sysctl: simplify ->strategy name and nlen parameters passed to ->strategy hook are unused, remove them. In general ->strategy hook should know what it's doing, and don't do something tricky for which, say, pointer to original userspace array may be needed (name). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> [ networking bits ] Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:47 -07:00
Christoph Hellwig	b418da16dd	compat: generic compat get/settimeofday Nothing arch specific in get/settimeofday. The details of the timeval conversion varied a little from arch to arch, but all with the same results. Also add an extern declaration for sys_tz to linux/time.h because externs in .c files are fowned upon. I'll kill the externs in various other files in a sparate patch. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> [ sparc bits ] Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:33 -07:00
Andi Kleen	20036fdcaf	Add kerneldoc documentation for new printk format extensions Add documentation in kerneldoc for new printk format extensions This patch documents the new %pS/%pF options in printk in kernel doc. Hope I didn't miss any other extension. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
Randy Dunlap	6d5cd6effe	taint: fix kernel-doc Move print_tainted() kernel-doc to avoid the following error: Error(/var/linsrc/mmotm-2008-1002-1617//kernel/panic.c:155): cannot understand prototype: 'struct tnt ' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
Francois Cami	e1f8e87449	Remove Andrew Morton's old email accounts People can use the real name an an index into MAINTAINERS to find the current email address. Signed-off-by: Francois Cami <francois.cami@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
WANG Cong	7968b3d9a6	kernel/kallsyms.c: fix double return Commit `6dd06c9fbe` ("module: make module_address_lookup safe") introduced double returns in the function kallsyms_lookup(), it's weird. The second one should be removed. Signed-off-by: WANG Cong <wangcong@zeuux.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:32 -07:00
Andrew Morton	9679e4dd62	kernel/sys.c: improve code generation utsname() is quite expensive to calculate. Cache it in a local. text data bss dec hex filename before: 11136 720 16 11872 2e60 kernel/sys.o after: 11096 720 16 11832 2e38 kernel/sys.o Acked-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: "Serge E. Hallyn" <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Vegard Nossum	8798881507	utsname: completely overwrite prior information On sethostname() and setdomainname(), previous information may be retained if it was longer than than the new hostname/domainname. This can be demonstrated trivially by calling sethostname() first with a long name, then with a short name, and then calling uname() to retrieve the full buffer that contains the hostname (and possibly parts of the old hostname), one just has to look past the terminating zero. I don't know if we should really care that much (hence the RFC); the only scenarios I can possibly think of is administrator putting something sensitive in the hostname (or domain name) by accident, and changing it back will not undo the mistake entirely, though it's not like we can recover gracefully from "rm -rf /" either... The other scenario is namespaces (CLONE_NEWUTS) where some information may be unintentionally "inherited" from the previous namespace (a program wants to hide the original name and does clone + sethostname, but some information is still left). I think the patch may be defended on grounds of the principle of least surprise. But I am not adamant :-) (I guess the question now is whether userspace should be able to write embedded NULs into the buffer or not...) At least the observation has been made and the patch has been presented. Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Dave Hansen	22b8ce9470	profiling: dynamically enable readprofile at runtime Way too often, I have a machine that exhibits some kind of crappy behavior. The CPU looks wedged in the kernel or it is spending way too much system time and I wonder what is responsible. I try to run readprofile. But, of course, Ubuntu doesn't enable it by default. Dang! The reason we boot-time enable it is that it takes a big bufffer that we generally can only bootmem alloc. But, does it hurt to at least try and runtime-alloc it? To use: echo 2 > /sys/kernel/profile Then run readprofile like normal. This should fix the compile issue with allmodconfig. I've compile-tested on a bunch more configs now including a few more architectures. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Adam Tkac	0c2d64fb6c	rlimit: permit setting RLIMIT_NOFILE to RLIM_INFINITY When a process wants to set the limit of open files to RLIM_INFINITY it gets EPERM even if it has CAP_SYS_RESOURCE capability. For example, BIND does: ... #elif defined(NR_OPEN) && defined(__linux__) /* * Some Linux kernels don't accept RLIM_INFINIT; the maximum * possible value is the NR_OPEN defined in linux/fs.h. */ if (resource == isc_resource_openfiles && rlim_value == RLIM_INFINITY) { rl.rlim_cur = rl.rlim_max = NR_OPEN; unixresult = setrlimit(unixresource, &rl); if (unixresult == 0) return (ISC_R_SUCCESS); } #elif ... If we allow setting RLIMIT_NOFILE to RLIM_INFINITY we increase portability - you don't have to check if OS is linux and then use different schema for limits. The spec says "Specifying RLIM_INFINITY as any resource limit value on a successful call to setrlimit() shall inhibit enforcement of that resource limit." and we're presently not doing that. Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Andi Kleen	25ddbb18aa	Make the taint flags reliable It's somewhat unlikely that it happens, but right now a race window between interrupts or machine checks or oopses could corrupt the tainted bitmap because it is modified in a non atomic fashion. Convert the taint variable to an unsigned long and use only atomic bit operations on it. Unfortunately this means the intvec sysctl functions cannot be used on it anymore. It turned out the taint sysctl handler could actually be simplified a bit (since it only increases capabilities) so this patch actually removes code. [akpm@linux-foundation.org: remove unneeded include] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Jan Beulich	9ba16087d9	Kconfig: eliminate "def_bool n" constructs Using "def_bool n" is pointless, simply using bool here appears more appropriate. Further, retaining such options that don't have a prompt and aren't selected by anything seems also at least questionable. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Tejun Heo	a25d644fc0	wait: kill is_sync_wait() is_sync_wait() is used to distinguish between sync and async waits. Basically sync waits are the ones initialized with init_waitqueue_entry() and async ones with init_waitqueue_func_entry(). The sync/async distinction is used only in prepare_to_wait[_exclusive]() and its only function is to skip setting the current task state if the wait is async. This has a few problems. * No one uses it. None of func_entry users use prepare_to_wait() functions, so the code path never gets executed. * The distinction is bogus. Maybe back when func_entry is used only by aio but it's now also used by epoll and in future possibly by 9p and poll/select. * Taking @state as argument and ignoring it silenly depending on how @wait is initialized is just a bad error-prone API. * It prevents func_entry waits from using wait->private for no good reason. This patch kills is_sync_wait() and the associated code paths from prepare_to_wait[_exclusive](). As there was no user of these code paths, this patch doesn't cause any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:31 -07:00
Adrian Bunk	d9f3216b47	kernel/dma.c: remove a CVS keyword Remove a CVS keyword that wasn't updated for a long time from a comment. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:30 -07:00
Rafael J. Wysocki	1bfcf1304e	pm: rework disabling of user mode helpers during suspend/hibernation We currently use a PM notifier to disable user mode helpers before suspend and hibernation and to re-enable them during resume. However, this is not an ideal solution, because if any drivers want to upload firmware into memory before suspend, they have to use a PM notifier for this purpose and there is no guarantee that the ordering of PM notifiers will be as expected (ie. the notifier that disables user mode helpers has to be run after the driver's notifier used for uploading the firmware). For this reason, it seems better to move the disabling and enabling of user mode helpers to separate functions that will be called by the PM core as necessary. [akpm@linux-foundation.org: remove unneeded ifdefs] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Alan Stern <stern@rowland.harvard.edu> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:29 -07:00
Balbir Singh	9363b9f23c	memrlimit: cgroup mm owner callback changes to add task info This patch adds an additional field to the mm_owner callbacks. This field is required to get to the mm that changed. Hold mmap_sem in write mode before calling the mm_owner_changed callback [hugh@veritas.com: fix mmap_sem deadlock] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-16 11:21:28 -07:00
Jason Baron	346e15beb5	driver core: basic infrastructure for per-module dynamic debug messages Base infrastructure to enable per-module debug messages. I've introduced CONFIG_DYNAMIC_PRINTK_DEBUG, which when enabled centralizes control of debugging statements on a per-module basis in one /proc file, currently, <debugfs>/dynamic_printk/modules. When, CONFIG_DYNAMIC_PRINTK_DEBUG, is not set, debugging statements can still be enabled as before, often by defining 'DEBUG' for the proper compilation unit. Thus, this patch set has no affect when CONFIG_DYNAMIC_PRINTK_DEBUG is not set. The infrastructure currently ties into all pr_debug() and dev_dbg() calls. That is, if CONFIG_DYNAMIC_PRINTK_DEBUG is set, all pr_debug() and dev_dbg() calls can be dynamically enabled/disabled on a per-module basis. Future plans include extending this functionality to subsystems, that define their own debug levels and flags. Usage: Dynamic debugging is controlled by the debugfs file, <debugfs>/dynamic_printk/modules. This file contains a list of the modules that can be enabled. The format of the file is as follows: <module_name> <enabled=0/1> . . . <module_name> : Name of the module in which the debug call resides <enabled=0/1> : whether the messages are enabled or not For example: snd_hda_intel enabled=0 fixup enabled=1 driver enabled=0 Enable a module: $echo "set enabled=1 <module_name>" > dynamic_printk/modules Disable a module: $echo "set enabled=0 <module_name>" > dynamic_printk/modules Enable all modules: $echo "set enabled=1 all" > dynamic_printk/modules Disable all modules: $echo "set enabled=0 all" > dynamic_printk/modules Finally, passing "dynamic_printk" at the command line enables debugging for all modules. This mode can be turned off via the above disable command. [gkh: minor cleanups and tweaks to make the build work quietly] Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:47 -07:00
Alexey Dobriyan	e94320939f	modules: fix module "notes" kobject leak Fix "notes" kobject leak It happens every rmmod if KALLSYMS=y and SYSFS=y. # modprobe foo kobject: 'foo' (ffffffffa00743d0): kobject_add_internal: parent: 'module', set: 'module' kobject: 'holders' (ffff88017e7c5770): kobject_add_internal: parent: 'foo', set: '<NULL>' kobject: 'foo' (ffffffffa00743d0): kobject_uevent_env kobject: 'foo' (ffffffffa00743d0): fill_kobj_path: path = '/module/foo' kobject: 'notes' (ffff88017fa9b668): kobject_add_internal: parent: 'foo', set: '<NULL>' ^^^^^ # rmmod foo kobject: 'holders' (ffff88017e7c5770): kobject_cleanup kobject: 'holders' (ffff88017e7c5770): auto cleanup kobject_del kobject: 'holders' (ffff88017e7c5770): calling ktype release kobject: (ffff88017e7c5770): dynamic_kobj_release kobject: 'holders': free name kobject: 'foo' (ffffffffa00743d0): kobject_cleanup kobject: 'foo' (ffffffffa00743d0): does not have a release() function, it is broken and must be fixed. kobject: 'foo' (ffffffffa00743d0): auto cleanup 'remove' event kobject: 'foo' (ffffffffa00743d0): kobject_uevent_env kobject: 'foo' (ffffffffa00743d0): fill_kobj_path: path = '/module/foo' kobject: 'foo' (ffffffffa00743d0): auto cleanup kobject_del kobject: 'foo': free name [whooops] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-16 09:24:41 -07:00
Rusty Russell	118a9069f0	module: remove CONFIG_KMOD in comment after #endif Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-10-17 02:38:37 +11:00
Thomas Gleixner	63d659d556	genirq: fix name space collision of nr_irqs in autoprobe.c probe_irq_off() is disfunctional as the local nr_irqs is referenced instead of the global one for the for_each_irq_desc() iterator. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:30 +02:00
Thomas Gleixner	10e580842e	genirq: use iterators for irq_desc loops Use for_each_irq_desc[_reverse] for all the iteration loops. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:30 +02:00
Thomas Gleixner	d3c60047bd	genirq: cleanup the sparseirq modifications Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:29 +02:00
Thomas Gleixner	d6c88a507e	genirq: revert dynarray Revert the dynarray changes. They need more thought and polishing. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:15 +02:00
Thomas Gleixner	ee32c97322	genirq: remove irq_to_desc_alloc Remove the leftover of sparseirqs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:15 +02:00
Thomas Gleixner	2cc21ef843	genirq: remove sparse irq code This code is not ready, but we need to rip it out instead of rebasing as we would lose the APIC/IO_APIC unification otherwise. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:15 +02:00
Thomas Gleixner	c6b7674f32	genirq: use inline function for irq_to_desc For the non sparse irq case an inline function is perfectly fine. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-16 16:53:14 +02:00
Yinghai Lu	aac3f2b6f6	x86: fix typo in irq_desc array when SPARSE_IRQ is not used, should still use irq_desc->lock Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:11 +02:00
Andrew Morton	2976fe2012	fix warning: "x86: sparse_irq needs spin_lock in allocations" caused by commit a532e19680ada3b8579b81e67e76d3ebd19c340f Author: Yinghai Lu <yhlu.kernel@gmail.com> Date: Wed Aug 20 20:46:25 2008 -0700 x86: sparse_irq needs spin_lock in allocations Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:11 +02:00
Yinghai Lu	9d98598d2f	sparseirq: remove some debug print out Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:11 +02:00
Yinghai Lu	e00585bb7f	irq: fix irqpoll && sparseirq Steven Noonan reported a boot hang when using irqpoll and CONFIG_HAVE_SPARSE_IRQ=y. The irqpoll loop needs to be updated to not iterate from 1 to nr_irqs but to iterate via for_each_irq_desc(). (in the former case desc can be NULL which crashes the box) Reported-by: Steven Noonan <steven@uplinklabs.net> Tested-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:10 +02:00
venkatesh.pallipadi@intel.com	932775a4ab	x86: HPET_MSI change IRQ affinity in process context when it is disabled Change the IRQ affinity in the process context when the IRQ is disabled. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:07 +02:00
Dean Nelson	21056830c4	irq: set_irq_chip() has redundant call to irq_to_desc() Extraneous call to irq_to_desc(). Signed-off-by: Dean Nelson <dcn@sgi.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:07 +02:00
Yinghai Lu	8c464a4b23	sparseirq: move kstat_irqs from kstat to irq_desc - fix fix non-sparseirq architectures. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:53:04 +02:00
Yinghai Lu	e89eb43863	x86: sparse_irq needs spin_lock in allocations Suresh Siddha noticed that we should have a spinlock around it. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:59 +02:00
Ingo Molnar	e955b5398b	sparseirq: fix lockdep -tip testing found this lockdep splat: [ 0.000000] Initializing CPU#0 [ 0.000000] found new irq_desc for irq 0 [ 0.000000] INFO: trying to register non-static key. [ 0.000000] the code is fine but needs lockdep annotation. [ 0.000000] turning off the locking correctness validator. [ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.27-rc3-tip-00191-g98ccb89-dirty #1 [ 0.000000] [<c0153c22>] register_lock_class+0x3d2/0x400 [ 0.000000] [<c0104d87>] ? mcount_call+0x5/0xa [ 0.000000] [<c0154f3a>] __lock_acquire+0x22a/0x5d0 [ 0.000000] [<c0104d87>] ? mcount_call+0x5/0xa [ 0.000000] [<c0155351>] lock_acquire+0x71/0xa0 [ 0.000000] [<c016d61f>] ? set_irq_chip+0x3f/0x90 [ 0.000000] [<c070f148>] _spin_lock_irqsave+0x58/0x90 [ 0.000000] [<c016d61f>] ? set_irq_chip+0x3f/0x90 [ 0.000000] [<c016d61f>] set_irq_chip+0x3f/0x90 [ 0.000000] [<c016d7e0>] ? handle_level_irq+0x0/0xe0 [ 0.000000] [<c016da1a>] set_irq_chip_and_handler_name+0x1a/0x40 [ 0.000000] [<c0a396c1>] init_ISA_irqs+0x51/0xa0 [ 0.000000] [<c0a4a365>] pre_intr_init_hook+0x25/0x30 [ 0.000000] [<c0a39723>] native_init_IRQ+0x13/0x370 [ 0.000000] [<c015569c>] ? lock_release+0xcc/0x1d0 [ 0.000000] [<c0104d87>] ? mcount_call+0x5/0xa [ 0.000000] [<c070dc22>] ? __mutex_unlock_slowpath+0x92/0x110 [ 0.000000] [<c070dcad>] ? mutex_unlock+0xd/0x10 [ 0.000000] [<c0135f62>] ? cpu_maps_update_done+0x12/0x20 [ 0.000000] [<c06c6743>] ? register_cpu_notifier+0x23/0x30 [ 0.000000] [<c011e8ae>] init_IRQ+0xe/0x10 [ 0.000000] [<c0a357a5>] start_kernel+0x1c5/0x340 [ 0.000000] [<c0a35280>] ? unknown_bootoption+0x0/0x210 [ 0.000000] [<c0a3506b>] i386_start_kernel+0x6b/0x80 [ 0.000000] ======================= [ 0.000000] found new irq_desc for irq 1 [ 0.000000] found new irq_desc for irq 2 [ 0.000000] found new irq_desc for irq 3 this: static void init_one_irq_desc(struct irq_desc *desc) { memcpy(desc, &irq_desc_init, sizeof(struct irq_desc)); #ifdef CONFIG_TRACE_IRQFLAGS lockdep_set_class(&desc->lock, &irq_desc_lock_class); #endif } should be unconditional. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:54 +02:00
Yinghai Lu	8b8e8c1bf7	x86: remove irqbalance in kernel for 32 bit This has been deprecated for years, the user space irqbalanced utility works better with numa, has configurable policies, etc... Signed-off-by: Yinghai Lu <yhlu.kernel@gmai.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:52 +02:00
Yinghai Lu	67fb283e14	irq: separate sparse_irqs from sparse_irqs_free so later don't need compare with -1U Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:51 +02:00
Yinghai Lu	cb5bc83225	x86_64: rename irq_desc/irq_desc_alloc change names: irq_desc() ==> irq_desc_alloc __irq_desc() ==> irq_desc Also split a few of the uses in lowlevel x86 code. v2: need to check if desc is null in smp_irq_move_cleanup Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:51 +02:00
Yinghai Lu	46926b67fc	generic: add irq_desc in function in parameter So we could remove some duplicated calling to irq_desc v2: make sure irq_desc in init/main.c is not used without generic_hardirqs Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:50 +02:00
Yinghai Lu	7d94f7ca40	irq: remove >= nr_irqs checking with config_have_sparse_irq remove irq limit checks - nr_irqs is dynamic and we expand anytime. v2: fix checking about result irq_cfg_without_new, so could use msi again v3: use irq_desc_without_new to check irq is valid Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:50 +02:00
Yinghai Lu	2c6927a38f	irq: replace loop with nr_irqs with for_each_irq_desc There are a handful of loops that go from 0 to nr_irqs and use get_irq_desc() on them. These would allocate all the irq_desc entries, regardless of the need for them. Use the smarter for_each_irq_desc() iterator that will only iterate over the present ones. v2: make sure arch without GENERIC_HARDIRQS work too Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:33 +02:00
Yinghai Lu	9059d8fa4a	irq: add irq_desc_without_new add an irq_desc accessor that will not allocate any sparse entry but returns failure if there's no entry present. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:32 +02:00
Yinghai Lu	7f95ec9e4c	x86: move kstat_irqs from kstat to irq_desc based on Eric's patch ... together mold it with dyn_array for irq_desc, will allcate kstat_irqs for nr_irq_desc alltogether if needed. -- at that point nr_cpus is known already. v2: make sure system without generic_hardirqs works they don't have irq_desc v3: fix merging v4: [mingo@elte.hu] fix typo [ mingo@elte.hu ] irq: build fix fix: arch/x86/xen/spinlock.c: In function 'xen_spin_lock_slow': arch/x86/xen/spinlock.c:90: error: 'struct kernel_stat' has no member named 'irqs' Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:32 +02:00
Ingo Molnar	3bf52a4df3	irq: sparse irqs, fix IRQ auto-probe crash fix: [ 10.631533] calling yenta_socket_init+0x0/0x20 [ 10.631533] Yenta: CardBus bridge found at 0000:15:00.0 [17aa:2012] [ 10.631533] Yenta: Using INTVAL to route CSC interrupts to PCI [ 10.631533] Yenta: Routing CardBus interrupts to PCI [ 10.631533] Yenta TI: socket 0000:15:00.0, mfunc 0x01d01002, devctl 0x64 [ 10.731599] BUG: unable to handle kernel NULL pointer dereference at 00000040 [ 10.731838] IP: [<c0c95b5f>] _spin_lock_irq+0xf/0x20 [ 10.732221] *pde = 00000000 [ 10.732741] Oops: 0002 [#1] SMP [ 10.733453] [ 10.734253] Pid: 1, comm: swapper Tainted: G W (2.6.27-rc3-tip-00173-gd7eaa4f-dirty #1) [ 10.735188] EIP: 0060:[<c0c95b5f>] EFLAGS: 00010002 CPU: 0 [ 10.735523] EIP is at _spin_lock_irq+0xf/0x20 [ 10.735523] EAX: 00000040 EBX: 00000000 ECX: f6e04c90 EDX: 00000100 [ 10.735523] ESI: 000000df EDI: f6e04c90 EBP: f7867df0 ESP: f7867df0 [ 10.735523] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 10.735523] Process swapper (pid: 1, ti=f7867000 task=f7870000 task.ti=f7867000) [ 10.735523] Stack: f7867e04 c0155fbd 00000000 00000000 f6e04c90 f7867e5c c0c6e319 c0f6a074 [ 10.735523] f6e04c90 000017aa 00002012 c112b648 f791f240 c112b5e0 f7867e44 c010440b [ 10.735523] f791f240 f791f29c c112b8ec f791f240 00000000 f7867e5c c048f893 03c0b648 [ 10.735523] Call Trace: [ 10.735523] [<c0155fbd>] ? probe_irq_on+0x3d/0x140 [ 10.735523] [<c0c6e319>] ? yenta_probe+0x529/0x640 [ 10.735523] [<c010440b>] ? mcount_call+0x5/0xa [ 10.735523] [<c048f893>] ? pci_match_device+0xa3/0xb0 [ 10.735523] [<c048fc1e>] ? pci_device_probe+0x5e/0x80 [ 10.735523] [<c0515423>] ? driver_probe_device+0x83/0x180 [ 10.735523] [<c0515594>] ? __driver_attach+0x74/0x80 [ 10.735523] [<c0514b69>] ? bus_for_each_dev+0x49/0x70 [ 10.735523] [<c051528e>] ? driver_attach+0x1e/0x20 [ 10.735523] [<c0515520>] ? __driver_attach+0x0/0x80 [ 10.735523] [<c05150d3>] ? bus_add_driver+0x1a3/0x220 [ 10.735523] [<c048fb60>] ? pci_device_remove+0x0/0x40 [ 10.735523] [<c05157f4>] ? driver_register+0x54/0x130 [ 10.735523] [<c048fe2f>] ? __pci_register_driver+0x4f/0x90 [ 10.735523] [<c11e9419>] ? yenta_socket_init+0x19/0x20 [ 10.735523] [<c0101125>] ? do_one_initcall+0x35/0x160 [ 10.735523] [<c11e9400>] ? yenta_socket_init+0x0/0x20 [ 10.735523] [<c01391a6>] ? __queue_work+0x36/0x50 [ 10.735523] [<c013922d>] ? queue_work_on+0x3d/0x50 [ 10.735523] [<c11a2758>] ? kernel_init+0x148/0x210 [ 10.735523] [<c11a2610>] ? kernel_init+0x0/0x210 [ 10.735523] [<c01043f3>] ? kernel_thread_helper+0x7/0x10 [ 10.735523] ======================= [ 10.735523] Code: 10 38 f2 74 06 f3 90 8a 10 eb f6 5d 89 c8 c3 8d b6 00 00 00 00 8d bc 27 00 00 00 00 55 89 e5 e8 a4 e8 46 ff fa ba 00 01 00 00 90 <66> 0f c1 10 38 f2 74 06 f3 90 8a 10 eb f6 5d c3 90 55 89 e5 53 as auto-probing wants to iterate over existing irqs. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:30 +02:00
Yinghai Lu	08678b0841	generic: sparse irqs: use irq_desc() together with dyn_array, instead of irq_desc[] add CONFIG_HAVE_SPARSE_IRQ to for use condensed array. Get rid of irq_desc[] array assumptions. Preallocate 32 irq_desc, and irq_desc() will try to get more. ( No change in functionality is expected anywhere, except the odd build failure where we missed a code site or where a crossing commit itroduces new irq_desc[] usage. ) v2: according to Eric, change get_irq_desc() to irq_desc() Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:29 +02:00
Yinghai Lu	d17a55ded3	irq: make irqs in kernel stat use per_cpu_dyn_array Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:08 +02:00
Ingo Molnar	fa42d10dd5	irq: sparse irqs, export nr_irqs fix: Building modules, stage 2. MODPOST 458 modules ERROR: "nr_irqs" [drivers/serial/8250.ko] undefined! Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:08 +02:00
Yinghai Lu	d60458b224	irq: make irq_desc to use dyn_array Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:07 +02:00
Yinghai Lu	85c0f90978	irq: introduce nr_irqs at this point nr_irqs is equal NR_IRQS convert a few easy users from NR_IRQS to dynamic nr_irqs. v2: according to Eric, we need to take care of arch without generic_hardirqs Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-16 16:52:05 +02:00
Ingo Molnar	5fef06e8c8	Merge branch 'linus' into genirq	2008-10-16 16:51:32 +02:00
Peter Zijlstra	8cd162ce23	sched: only update rq->clock while holding rq->lock Vatsa noticed rq->clock going funny and tracked it down to an update_rq_clock() outside a rq->lock section. This is a problem because things like double_rq_lock() update the rq->clock value for both rqs. Therefore disabling interrupts isn't strong enough. Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-15 20:43:27 +02:00
Ingo Molnar	6b2ada8210	Merge branches 'core/softlockup', 'core/softirq', 'core/resources', 'core/printk' and 'core/misc' into core-v28-for-linus	2008-10-15 12:48:44 +02:00
Linus Torvalds	8acd3a60bc	Merge branch 'for-2.6.28' of git://linux-nfs.org/~bfields/linux * 'for-2.6.28' of git://linux-nfs.org/~bfields/linux: (59 commits) svcrdma: Fix IRD/ORD polarity svcrdma: Update svc_rdma_send_error to use DMA LKEY svcrdma: Modify the RPC reply path to use FRMR when available svcrdma: Modify the RPC recv path to use FRMR when available svcrdma: Add support to svc_rdma_send to handle chained WR svcrdma: Modify post recv path to use local dma key svcrdma: Add a service to register a Fast Reg MR with the device svcrdma: Query device for Fast Reg support during connection setup svcrdma: Add FRMR get/put services NLM: Remove unused argument from svc_addsock() function NLM: Remove "proto" argument from lockd_up() NLM: Always start both UDP and TCP listeners lockd: Remove unused fields in the nlm_reboot structure lockd: Add helper to sanity check incoming NOTIFY requests lockd: change nlmclnt_grant() to take a "struct sockaddr *" lockd: Adjust nlmsvc_lookup_host() to accomodate AF_INET6 addresses lockd: Adjust nlmclnt_lookup_host() signature to accomodate non-AF_INET lockd: Support non-AF_INET addresses in nlm_lookup_host() NLM: Convert nlm_lookup_host() to use a single argument svcrdma: Add Fast Reg MR Data Types ...	2008-10-14 12:31:14 -07:00
Ingo Molnar	98d9c66ab0	tracing/fastboot: improve help text Improve the help text of the boot tracer. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 14:27:20 +02:00
Ingo Molnar	4519d9e54d	tracing/stacktrace: improve help text Improve the help text that is displayed for CONFIG_STACK_TRACER. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 14:15:43 +02:00
Harvey Harrison	ad0a3b6811	trace: add build-time check to avoid overrunning hex buffer Remove the runtime BUG_ON and change to a compile-time check in the macro that calls the hex format routine [Noticed by Joe Perches] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:26 +02:00
Harvey Harrison	2fbc474901	ftrace: fix hex output mode of ftrace Fix the output of ftrace in hex mode as the hi/lo nibbles are output in reverse order. Without this patch, the output of ftrace is: raw mode : 6474 0 141531612444 0 140 + 6402 120 S hex mode : 000091a4 00000000 000000023f1f50c1 00000000 c8 000000b2 00009120 87 ffff00c8 00000035 There is an inversion on ouput hex(6474) is 194a [based on a patch by Philippe Reynes <tremyfr@yahoo.fr>] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:25 +02:00
Arjan van de Ven	8a5d900cca	tracing/fastboot: fix printk format typo in boot tracer When printing nanoseconds, the right printk format string is %09 not %06... Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Frédéric Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:23 +02:00
Frederic Weisbecker	c2931e05ec	ftrace: return an error when setting a nonexistent tracer When one try to set a nonexistent tracer, no error is returned as if the name of the tracer was correct. We should return -EINVAL. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:22 +02:00
Steven Rostedt	3ea2e6d71a	ftrace: make some tracers reentrant Now that the ring buffer is reentrant, some of the ftrace tracers (sched_swich, debugging traces) can also be reentrant. Note: Never make the function tracer reentrant, that can cause recursion problems all over the kernel. The function tracer must disable reentrancy. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:20 +02:00
Steven Rostedt	bf41a158ca	ring-buffer: make reentrant This patch replaces the local_irq_save/restore with preempt_disable/ enable. This allows for interrupts to enter while recording. To write to the ring buffer, you must reserve data, and then commit it. During this time, an interrupt may call a trace function that will also record into the buffer before the commit is made. The interrupt will reserve its entry after the first entry, even though the first entry did not finish yet. The time stamp delta of the interrupt entry will be zero, since in the view of the trace, the interrupt happened during the first field anyway. Locking still takes place when the tail/write moves from one page to the next. The reader always takes the locks. A new page pointer is added, called the commit. The write/tail will always point to the end of all entries. The commit field will point to the last committed entry. Only this commit entry may update the write time stamp. The reader can only go up to the commit. It cannot go past it. If a lot of interrupts come in during a commit that fills up the buffer, and it happens to make it all the way around the buffer back to the commit, then a warning is printed and new events will be dropped. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:19 +02:00
Steven Rostedt	6f807acd27	ring-buffer: move page indexes into page headers Remove the global head and tail indexes and move them into the page header. Each page will now keep track of where the last write and read was made. We also rename the head and tail to read and write for better clarification. This patch is needed for future enhancements to move the ring buffer to a lockless solution. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:18 +02:00
Frederic Weisbecker	097d036a2f	tracing/fastboot: only trace non-module initcalls At this time, only built-in initcalls interest us. We can't really produce a relevant graph if we include the modules initcall too. I had good results after this patch (see svg in attachment). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:17 +02:00
Steven Rostedt	6450c1d321	ftrace: move pc counter in irqtrace The assigning of the pc counter is in the wrong spot in the check_critical_timing function. The pc variable is used in the out jump. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:16 +02:00
Steven Rostedt	aa1e0e3bcf	ring_buffer: map to cpu not page My original patch had a compile bug when NUMA was configured. I referenced cpu when it should have been cpu_buffer->cpu. Ingo quickly fixed this bug by replacing cpu with 'i' because that was the loop counter. Unfortunately, the 'i' was the counter of pages, not CPUs. This caused a crash when the number of pages allocated for the buffers exceeded the number of pages, which would usually be the case. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:15 +02:00
Frederic Weisbecker	5601020feb	tracing/fastboot: get the initcall name before it disappears After some initcall traces, some initcall names may be inconsistent. That's because these functions will disappear from the .init section and also their name from the symbols table. So we have to copy the name of the function in a buffer large enough during the trace appending. It is not costly for the ring_buffer because the number of initcall entries is commonly not really large. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:12 +02:00
Frederic Weisbecker	cb5ab74204	tracing/fastboot: change the printing of boot tracer according to bootgraph.pl Change the boot tracer printing to make it parsable for the scripts/bootgraph.pl script. We have now to output two lines for each initcall, according to the printk in do_one_initcall() in init/main.c We need now the call's time and the return's time. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:11 +02:00
Ingo Molnar	77ae11f63b	ring-buffer: fix build error fix: kernel/trace/ring_buffer.c: In function ‘rb_allocate_pages’: kernel/trace/ring_buffer.c:235: error: ‘cpu’ undeclared (first use in this function) kernel/trace/ring_buffer.c:235: error: (Each undeclared identifier is reported only once kernel/trace/ring_buffer.c:235: error: for each function it appears in.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:10 +02:00
Steven Rostedt	38697053fa	ftrace: preempt disable over interrupt disable With the new ring buffer infrastructure in ftrace, I'm trying to make ftrace a little more light weight. This patch converts a lot of the local_irq_save/restore into preempt_disable/enable. The original preempt count in a lot of cases has to be sent in as a parameter so that it can be recorded correctly. Some places were recording it incorrectly before anyway. This is also laying the ground work to make ftrace a little bit more reentrant, and remove all locking. The function tracers must still protect from reentrancy. Note: All the function tracers must be careful when using preempt_disable. It must do the following: resched = need_resched(); preempt_disable_notrace(); [...] if (resched) preempt_enable_no_resched_notrace(); else preempt_enable_notrace(); The reason is that if this function traces schedule() itself, the preempt_enable_notrace() will cause a schedule, which will lead us into a recursive failure. If we needed to reschedule before calling preempt_disable, we should have already scheduled. Since we did not, this is most likely that we should not and are probably inside a schedule function. If resched was not set, we still need to catch the need resched flag being set when preemption was off and the if case at the end will catch that for us. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:09 +02:00
Steven Rostedt	e4c2ce82ca	ring_buffer: allocate buffer page pointer The current method of overlaying the page frame as the buffer page pointer can be very dangerous and limits our ability to do other things with a page from the buffer, like send it off to disk. This patch allocates the buffer_page instead of overlaying the page's page frame. The use of the buffer_page has hardly changed due to this. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:08 +02:00
Steven Rostedt	7104f300c5	ftrace: type cast filter+verifier The mmiotrace map had a bug that would typecast the entry from the trace to the wrong type. That is a known danger of C typecasts, there's absolutely zero checking done on them. Help that problem a bit by using a GCC extension to implement a type filter that restricts the types that a trace record can be cast into, and by adding a dynamic check (in debug mode) to verify the type of the entry. This patch adds a macro to assign all entries of ftrace using the type of the variable and checking the entry id. The typecasts are now done in the macro for only those types that it knows about, which should be all the types that are allowed to be read from the tracer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:07 +02:00
Frederic Weisbecker	797d3712a9	tracing/ftrace: adapt mmiotrace to the new type of print_line, fix Correct the value's type of trace_empty function Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:06 +02:00
Steven Rostedt	d769041f86	ring_buffer: implement new locking The old "lock always" scheme had issues with lockdep, and was not very efficient anyways. This patch does a new design to be partially lockless on writes. Writes will add new entries to the per cpu pages by simply disabling interrupts. When a write needs to go to another page than it will grab the lock. A new "read page" has been added so that the reader can pull out a page from the ring buffer to read without worrying about the writer writing over it. This allows us to not take the lock for all reads. The lock is now only taken when a read needs to go to a new page. This is far from lockless, and interrupts still need to be disabled, but it is a step towards a more lockless solution, and it also solves a lot of the issues that were noticed by the first conversion of ftrace to the ring buffers. Note: the ring_buffer_{un}lock API has been removed. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:05 +02:00
Steven Rostedt	70255b5e3f	ring_buffer: remove raw from local_irq_save The raw_local_irq_save causes issues with lockdep. We don't need it so replace them with local_irq_save. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:04 +02:00
Frederic Weisbecker	9e9efffb78	tracing/ftrace: adapt the boot tracer to the new print_line type This patch adapts the boot tracer to the new type of the print_line callback. It still relays entries it doesn't support to default output functions. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:03 +02:00
Frederic Weisbecker	07f4e4f790	tracing/ftrace: adapt mmiotrace to the new type of print_line Adapt mmiotrace to the new print_line type. By default, it ignores (and consumes) types it doesn't support. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:02 +02:00
Pekka Paalanen	9ff4b9744c	tracing/ftrace: fix pipe breaking This patch fixes a bug which break the pipe when the seq is empty. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:01 +02:00
Frederic Weisbecker	2c4f035f6c	tracing/ftrace: change the type of the print_line callback We need a kind of disambiguation when a print_line callback returns 0. _There is not enough space to print all the entry. Please flush the seq and retry. _I can't handle this type of entry This patch changes the type of this callback for better information. Also some changes have been made in this V2. _ Only relay to default functions after the print_line callback fails. _ This patch doesn't fix the issue with the broken pipe (see patch 2/4 for that) Some things are still in discussion: _ Find better names for the enum print_line_t values _ Change the type of print_trace_line into boolean. Patches to change that can be sent later. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Pekka Paalanen <pq@iki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:39:00 +02:00
Steven Rostedt	777e208d40	ftrace: take advantage of variable length entries Now that the underlining ring buffer for ftrace now hold variable length entries, we can take advantage of this by only storing the size of the actual event into the buffer. This happens to increase the number of entries in the buffer dramatically. We can also get rid of the "trace_cont" operation, but I'm keeping that until we have no more users. Some of the ftrace tracers can now change their code to adapt to this new feature. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:59 +02:00
Steven Rostedt	3928a8a2d9	ftrace: make work with new ring buffer This patch ports ftrace over to the new ring buffer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:57 +02:00
Steven Rostedt	ed56829cb3	ring_buffer: reset buffer page when freeing Mathieu Desnoyers pointed out that the freeing of the page frame needs to be reset otherwise we might trigger BUG_ON in the page free code. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:56 +02:00
Steven Rostedt	a7b1374333	ring_buffer: add paranoid check for buffer page If for some strange reason the buffer_page gets bigger, or the page struct gets smaller, I want to know this ASAP. The best way is to not let the kernel compile. This patch adds code to test the size of the struct buffer_page against the page struct and will cause compile issues if the buffer_page ever gets bigger than the page struct. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:55 +02:00
Steven Rostedt	7a8e76a382	tracing: unified trace buffer This is a unified tracing buffer that implements a ring buffer that hopefully everyone will eventually be able to use. The events recorded into the buffer have the following structure: struct ring_buffer_event { u32 type:2, len:3, time_delta:27; u32 array[]; }; The minimum size of an event is 8 bytes. All events are 4 byte aligned inside the buffer. There are 4 types (all internal use for the ring buffer, only the data type is exported to the interface users). RINGBUF_TYPE_PADDING: this type is used to note extra space at the end of a buffer page. RINGBUF_TYPE_TIME_EXTENT: This type is used when the time between events is greater than the 27 bit delta can hold. We add another 32 bits, and record that in its own event (8 byte size). RINGBUF_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to help keep the buffer timestamps in sync. RINGBUF_TYPE_DATA: The event actually holds user data. The "len" field is only three bits. Since the data must be 4 byte aligned, this field is shifted left by 2, giving a max length of 28 bytes. If the data load is greater than 28 bytes, the first array field holds the full length of the data load and the len field is set to zero. Example, data size of 7 bytes: type = RINGBUF_TYPE_DATA len = 2 time_delta: <time-stamp> - <prev_event-time-stamp> array[0..1]: <7 bytes of data> <1 byte empty> This event is saved in 12 bytes of the buffer. An event with 82 bytes of data: type = RINGBUF_TYPE_DATA len = 0 time_delta: <time-stamp> - <prev_event-time-stamp> array[0]: 84 (Note the alignment) array[1..14]: <82 bytes of data> <2 bytes empty> The above event is saved in 92 bytes (if my math is correct). 82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length. Do not reference the above event struct directly. Use the following functions to gain access to the event table, since the ring_buffer_event structure may change in the future. ring_buffer_event_length(event): get the length of the event. This is the size of the memory used to record this event, and not the size of the data pay load. ring_buffer_time_delta(event): get the time delta of the event This returns the delta time stamp since the last event. Note: Even though this is in the header, there should be no reason to access this directly, accept for debugging. ring_buffer_event_data(event): get the data from the event This is the function to use to get the actual data from the event. Note, it is only a pointer to the data inside the buffer. This data must be copied to another location otherwise you risk it being written over in the buffer. ring_buffer_lock: A way to lock the entire buffer. ring_buffer_unlock: unlock the buffer. ring_buffer_alloc: create a new ring buffer. Can choose between overwrite or consumer/producer mode. Overwrite will overwrite old data, where as consumer producer will throw away new data if the consumer catches up with the producer. The consumer/producer is the default. ring_buffer_free: free the ring buffer. ring_buffer_resize: resize the buffer. Changes the size of each cpu buffer. Note, it is up to the caller to provide that the buffer is not being used while this is happening. This requirement may go away but do not count on it. ring_buffer_lock_reserve: locks the ring buffer and allocates an entry on the buffer to write to. ring_buffer_unlock_commit: unlocks the ring buffer and commits it to the buffer. ring_buffer_write: writes some data into the ring buffer. ring_buffer_peek: Look at a next item in the cpu buffer. ring_buffer_consume: get the next item in the cpu buffer and consume it. That is, this function increments the head pointer. ring_buffer_read_start: Start an iterator of a cpu buffer. For now, this disables the cpu buffer, until you issue a finish. This is just because we do not want the iterator to be overwritten. This restriction may change in the future. But note, this is used for static reading of a buffer which is usually done "after" a trace. Live readings would want to use the ring_buffer_consume above, which will not disable the ring buffer. ring_buffer_read_finish: Finishes the read iterator and reenables the ring buffer. ring_buffer_iter_peek: Look at the next item in the cpu iterator. ring_buffer_read: Read the iterator and increment it. ring_buffer_iter_reset: Reset the iterator to point to the beginning of the cpu buffer. ring_buffer_iter_empty: Returns true if the iterator is at the end of the cpu buffer. ring_buffer_size: returns the size in bytes of each cpu buffer. Note, the real size is this times the number of CPUs. ring_buffer_reset_cpu: Sets the cpu buffer to empty ring_buffer_reset: sets all cpu buffers to empty ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a cpu buffer of another buffer. This is handy when you want to take a snap shot of a running trace on just one cpu. Having a backup buffer, to swap with facilitates this. Ftrace max latencies use this. ring_buffer_empty: Returns true if the ring buffer is empty. ring_buffer_empty_cpu: Returns true if the cpu buffer is empty. ring_buffer_record_disable: disable all cpu buffers (read only) ring_buffer_record_disable_cpu: disable a single cpu buffer (read only) ring_buffer_record_enable: enable all cpu buffers. ring_buffer_record_enabl_cpu: enable a single cpu buffer. ring_buffer_entries: The number of entries in a ring buffer. ring_buffer_overruns: The number of entries removed due to writing wrap. ring_buffer_time_stamp: Get the time stamp used by the ring buffer ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp into nanosecs. I still need to implement the GTOD feature. But we need support from the cpu frequency infrastructure. But this can be done at a later time without affecting the ring buffer interface. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:54 +02:00
Steven Rostedt	5aa60c6073	ftrace: give time for wakeup test to run It is possible that the testing thread in the ftrace wakeup test does not run before we stop the trace. This will cause the trace to fail since nothing will be in the buffers. This patch adds a small wait in the wakeup test to allow for the woken task to run and be traced. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:53 +02:00
Frédéric Weisbecker	7c572ac0cf	tracing/ftrace: don't consume unhandled entries by boot tracer When the boot tracer can't handle an entry output, it returns 1. It should return 0 to relay on other output functions. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:52 +02:00
Frédéric Weisbecker	3ce2b9200d	ftrace/fastboot: disable tracers self-tests when boot tracer is selected The tracing engine resets the ring buffer and the tracers touch it too during self-tests. These self-tests happen during tracers registering and work against boot tracing which is logging initcalls. We have to disable tracing self-tests if the boot-tracer is selected. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:51 +02:00
Frédéric Weisbecker	1f5c2abbde	tracing/ftrace: give an entry on the config for boot tracer Bring the entry to choose the boot tracer on the kernel config. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:49 +02:00
Frédéric Weisbecker	b5ad384e79	tracing/ftrace: make tracing suitable to run the boot tracer The tracing engine have now to be init in early_initcall to set the boot tracer. Only the debugfs settings will be initialized at fs_initcall time. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:48 +02:00
Frédéric Weisbecker	d13744cd6e	tracing/ftrace: add the boot tracer Add the boot/initcall tracer. It's primary purpose is to be able to trace the initcalls. It is intended to be used with scripts/bootgraph.pl after some small improvements. Note that it is not active after its init. To avoid tracing (and so crashing) before the whole tracing engine init, you have to explicitly call start_boot_trace() after do_pre_smp_initcalls() to enable it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:47 +02:00
Lai Jiangshan	1b7ae37c03	markers: bit-field is not thread-safe nor smp-safe bit-field is not thread-safe nor smp-safe. struct marker_entry.rcu_pending is not protected by any lock in rcu-callback free_old_closure(). so we must turn it into a safe type. detail: I suppose rcu_pending and ptype are store in struct marker_entry.tmp1 free_old_closure() side: change ptype side: \| load struct marker_entry.tmp1 --------------------------------\|-------------------------------- \| change ptype bit in tmp1 load struct marker_entry.tmp1 \| change rcu_pending bit in tmp1 \| store tmp1 \| --------------------------------\|-------------------------------- \| store tmp1 now this result equals that free_old_closure() do not change rcu_pending bit, bug! This bug will cause redundant rcu_barrier_sched() called. not too harmful. ----- corresponding: free_old_closure() side: change ptype side: load struct marker_entry.tmp1 \| --------------------------------\|-------------------------------- \| load struct marker_entry.tmp1 change rcu_pending bit in tmp1 \| \| change ptype bit in tmp1 \| store tmp1 --------------------------------\|-------------------------------- store tmp1 \| now this result equals that change ptype side do not change ptype bit, bug! this bug cause marker_probe_cb() access to invalid memory. oops! see also: http://en.wikipedia.org/wiki/Bit_field Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:45 +02:00
Lai Jiangshan	48043bcdf8	markers: fix unchecked format when the second, third... probe is registered, its format is not checked, this patch fixes it. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:42 +02:00
Mathieu Desnoyers	ed86a59071	markers: re-enable fast batch registration Lai Jiangshan discovered a reentrancy issue with markers and fixed it by adding synchronize_sched() calls at each registration/unregistraiton. It works, but it removes the ability to do batch registration/unregistration and can cause registration of ~100 markers to take about 30 seconds on a loaded machine (synchronize_sched() is much slower on such workloads). This patch implements a version of the fix which won't slow down marker batch registration/unregistration. It also go back to the original non-synchronized reg/unreg. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:38 +02:00
Mathieu Desnoyers	e2d3b75dbc	markers: fix unregister bug and reenter bug, cleanup Use the new rcu_read_lock_sched/unlock_sched() in marker code around the call site instead of preempt_disable/enable(). It helps reviewing the code more easily. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:28 +02:00
Mathieu Desnoyers	9a1e9693f5	tracepoints: fix reentrancy The tracepoints had the same problem markers did have wrt reentrancy. Apply a similar fix using a rcu_barrier after each tracepoint mutex lock. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:23 +02:00
Mathieu Desnoyers	ca2db6cf30	tracepoints: use rcu sched Make tracepoints use rcu sched. (cleanup) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:21 +02:00
Lai Jiangshan	d74185ed27	markers: fix unregister bug and reenter bug unregister bug: codes using makers are typically calling marker_probe_unregister() and then destroying the data that marker_probe_func needs(or unloading this module). This is bug when the corresponding marker_probe_func is still running(on other cpus), it is using the destroying/ed data. we should call synchronize_sched() after marker_update_probes(). reenter bug: marker_probe_register(), marker_probe_unregister() and marker_probe_unregister_private_data() are not reentrant safe functions. these 3 functions release markers_mutex and then require it again and do "entry->oldptr = old; ...", but entry->oldptr maybe is using now for these 3 functions may reenter when markers_mutex is released. we use synchronize_sched() instead of call_rcu_sched() to fix this bug. actually we can do: " if (entry->rcu_pending) rcu_barrier_sched(); " after require markers_mutex again. but synchronize_sched() is better and simpler. For these 3 functions are not critical path. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:19 +02:00
Steven Rostedt	05736a427f	ftrace: warn on failure to disable mcount callers With the recent updates to ftrace, there should not be any failures when modifying the code. If there is, then we need to warn about it. This patch has a cleaned up version of the code that I used to discover that the weak symbols were causing failures. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:11 +02:00
Frédéric Weisbecker	43a15386c4	tracing/ftrace: replace none tracer by nop tracer Replace "none" tracer by the recently created "nop" tracer. Both are pretty similar except that nop accepts TRACE_PRINT or TRACE_SPECIAL entries. And as a consequence, changing the size of the ring buffer now requires that tracing has already been disabled. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:09 +02:00
Frédéric Weisbecker	2a3a4f669d	tracing/ftrace: tracing engine depends on Nop Tracer Now that the nop tracer is used as the default tracer by replacing the "none" tracer, tracing engine depends on it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:06 +02:00
Frédéric Weisbecker	35cb5ed012	tracing/ftrace: make nop tracer reset previous entries If nop tracer is selected, some old entries from the previous tracer could still be enqueued. Tracing have to be reset. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:04 +02:00
Steven Noonan	8925b394ec	trace: remove pointless ifdefs The functions are already 'extern' anyway, so there's no problem with linkage. Removing these ifdefs also helps find any potential compiler errors. Suggested by Andrew Morton. Signed-off-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:38:01 +02:00
Steven Noonan	71c67d58b5	ftrace: mcount_addr defined but not used When CONFIG_DYNAMIC_FTRACE isn't used, neither is mcount_addr. This patch eliminates that warning. Signed-off-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:37:58 +02:00
Steven Noonan	fb1b6d8b51	ftrace: add nop tracer A no-op tracer which can serve two purposes: 1. A template for development of a new tracer. 2. A convenient way to see ftrace_printk() calls without an irrelevant trace making the output messy. [ mingo@elte.hu: resolved conflicts ] Signed-off-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:37:43 +02:00
Pekka Paalanen	5bf9a1ee35	ftrace: inject markers via trace_marker file Allow a user to inject a marker (TRACE_PRINT entry) into the trace ring buffer. The related file operations are derived from code by Frédéric Weisbecker <fweisbec@gmail.com>. Signed-off-by: Pekka Paalanen <pq@iki.fi> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:37:20 +02:00
Pekka Paalanen	fc5e27ae4b	mmiotrace: handle TRACE_PRINT entries Also make trace_seq_print_cont() non-static, and add a newline if the seq buffer can't hold all data. Signed-off-by: Pekka Paalanen <pq@iki.fi> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:37:14 +02:00
Pekka Paalanen	9e57fb35d7	x86 mmiotrace: implement mmiotrace_printk() Offer mmiotrace users a function to inject markers from inside the kernel. This depends on the trace_vprintk() patch. Signed-off-by: Pekka Paalanen <pq@iki.fi> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:37:11 +02:00
Pekka Paalanen	801fe40001	ftrace: add trace_vprintk() trace_vprintk() for easier implementation of tracer specific *_printk functions. Add check check for no_tracer, and implement __ftrace_printk() as a wrapper. Signed-off-by: Pekka Paalanen <pq@iki.fi> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:37:07 +02:00
Pekka Paalanen	45dcd8b8a8	ftrace: move mmiotrace functions out of trace.c Moves the mmiotrace specific functions from trace.c to trace_mmiotrace.c. Functions trace_wake_up(), tracing_get_trace_entry(), and tracing_generic_entry_update() are therefore made available outside trace.c. Signed-off-by: Pekka Paalanen <pq@iki.fi> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:37:04 +02:00
Steven Rostedt	644f991d4b	ftrace: fix unlocking of hash This must be brown paper bag week for Steven Rostedt! While working on ftrace for PPC, I discovered that the hash locking done when CONFIG_FTRACE_MCOUNT_RECORD is not set, is totally incorrect. With a cut and paste error, I had the hash lock macro to lock for both hash_lock _and_ hash_unlock! This bug did not affect x86 since this bug was introduced when CONFIG_FTRACE_MCOUNT_RECORD was added to x86. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:58 +02:00
Ingo Molnar	d3ee6d9928	ftrace: make it depend on DEBUG_KERNEL make most of the tracers depend on DEBUG_KERNEL - that's their intended purpose. (most distributions have DEBUG_KERNEL enabled anyway so this is not a practical limitation - but it simplifies the tracing menu in the normal case) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:51 +02:00
Peter Zijlstra	80b5e94005	ftrace: sched_switch: show the wakee's cpu While profiling the smp behaviour of the scheduler it was needed to know to which cpu a task got woken. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:48 +02:00
Peter Zijlstra	f09ce573f5	ftrace: make ftrace_printk usable with the other tracers Currently ftrace_printk only works with the ftrace tracer, switch it to an iter_ctrl setting so we can make us of them with other tracers too. [rostedt@redhat.com: tweak to the disable condition] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:45 +02:00
Steven Rostedt	5a90f577e5	ftrace: print continue index fix An item in the trace buffer that is bigger than one entry may be split up using the TRACE_CONT entry. This makes it a virtual single entry. The current code increments the iterator index even while traversing TRACE_CONT entries, making it look like the iterator is further than it actually is. This patch adds code to not increment the iterator index while skipping over TRACE_CONT entries. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:42 +02:00
Steven Rostedt	652567aa20	ftrace: binary and not logical for continue test Peter Zijlstra provided me with a nice brown paper bag while letting me know that I was doing a logical AND and not a binary one, making a condition true more often than it should be. Luckily, a false true is handled by the calling function and no harm is done. But this needs to be fixed regardless. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:39 +02:00
Michael Ellerman	a6168353d1	ftrace: make output nicely spaced for up to 999 cpus Currently some of the ftrace output goes skewiff if you have more than 9 cpus, and some if you have more than 99. Twiddle with the headers and format strings to make up to 999 cpus display without causing spacing problems. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:34 +02:00
Ingo Molnar	2ff01c6a17	stack tracer: depends on DEBUG_KERNEL Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:31 +02:00
Steven Rostedt	1b6cced6ec	ftrace: stack trace add indexes This patch adds indexes into the stack that the functions in the stack dump were found at. As an added bonus, I also added a diff to show which function is the most notorious consumer of the stack. The output now looks like this: # cat /debug/tracing/stack_trace Depth Size Location (48 entries) ----- ---- -------- 0) 2476 212 blk_recount_segments+0x39/0x59 1) 2264 12 bio_phys_segments+0x16/0x1d 2) 2252 20 blk_rq_bio_prep+0x23/0xaf 3) 2232 12 init_request_from_bio+0x74/0x77 4) 2220 56 __make_request+0x294/0x331 5) 2164 136 generic_make_request+0x34f/0x37d 6) 2028 56 submit_bio+0xe7/0xef 7) 1972 28 submit_bh+0xd1/0xf0 8) 1944 112 block_read_full_page+0x299/0x2a9 9) 1832 8 blkdev_readpage+0x14/0x16 10) 1824 28 read_cache_page_async+0x7e/0x109 11) 1796 16 read_cache_page+0x11/0x49 12) 1780 32 read_dev_sector+0x3c/0x72 13) 1748 48 read_lba+0x4d/0xaa 14) 1700 168 efi_partition+0x85/0x61b 15) 1532 72 rescan_partitions+0x10e/0x266 16) 1460 40 do_open+0x1c7/0x24e 17) 1420 292 __blkdev_get+0x79/0x84 18) 1128 12 blkdev_get+0x12/0x14 19) 1116 20 register_disk+0xd1/0x11e 20) 1096 28 add_disk+0x34/0x90 21) 1068 52 sd_probe+0x2b1/0x366 22) 1016 20 driver_probe_device+0xa5/0x120 23) 996 8 __device_attach+0xd/0xf 24) 988 32 bus_for_each_drv+0x3e/0x68 25) 956 24 device_attach+0x56/0x6c 26) 932 16 bus_attach_device+0x26/0x4d 27) 916 64 device_add+0x380/0x4b4 28) 852 28 scsi_sysfs_add_sdev+0xa1/0x1c9 29) 824 160 scsi_probe_and_add_lun+0x919/0xa2a 30) 664 36 __scsi_add_device+0x88/0xae 31) 628 44 ata_scsi_scan_host+0x9e/0x21c 32) 584 28 ata_host_register+0x1cb/0x1db 33) 556 24 ata_host_activate+0x98/0xb5 34) 532 192 ahci_init_one+0x9bd/0x9e9 35) 340 20 pci_device_probe+0x3e/0x5e 36) 320 20 driver_probe_device+0xa5/0x120 37) 300 20 __driver_attach+0x3f/0x5e 38) 280 36 bus_for_each_dev+0x40/0x62 39) 244 12 driver_attach+0x19/0x1b 40) 232 28 bus_add_driver+0x9c/0x1af 41) 204 28 driver_register+0x76/0xd2 42) 176 20 __pci_register_driver+0x44/0x71 43) 156 8 ahci_init+0x14/0x16 44) 148 100 _stext+0x42/0x122 45) 48 20 kernel_init+0x175/0x1dc 46) 28 28 kernel_thread_helper+0x7/0x10 The first column is simply an index starting from the inner most function and counting down to the outer most. The next column is the location that the function was found on the stack. The next column is the size of the stack for that function. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:28 +02:00
Steven Rostedt	3b47bfc1fc	ftrace: remove direct reference to mcount in trace code The mcount record method of ftrace scans objdump for references to mcount. Using mcount as the reference to test if the calls to mcount being replaced are indeed calls to mcount, this use of mcount was also caught as a location to change. Using a variable that points to the mcount address moves this reference into the data section that is not scanned, and we do not use a false location to try and modify. The warn on code was what was used to detect this bug. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:22 +02:00
Steven Rostedt	e5a81b629e	ftrace: add stack tracer This is another tracer using the ftrace infrastructure, that examines at each function call the size of the stack. If the stack use is greater than the previous max it is recorded. You can always see (and set) the max stack size seen. By setting it to zero will start the recording again. The backtrace is also available. For example: # cat /debug/tracing/stack_max_size 1856 # cat /debug/tracing/stack_trace [<c027764d>] stack_trace_call+0x8f/0x101 [<c021b966>] ftrace_call+0x5/0x8 [<c02553cc>] clocksource_get_next+0x12/0x48 [<c02542a5>] update_wall_time+0x538/0x6d1 [<c0245913>] do_timer+0x23/0xb0 [<c0257657>] tick_do_update_jiffies64+0xd9/0xf1 [<c02576b9>] tick_sched_timer+0x4a/0xad [<c0250fe6>] __run_hrtimer+0x3e/0x75 [<c02518ed>] hrtimer_interrupt+0xf1/0x154 [<c022c870>] smp_apic_timer_interrupt+0x71/0x84 [<c021b7e9>] apic_timer_interrupt+0x2d/0x34 [<c0238597>] finish_task_switch+0x29/0xa0 [<c05abd13>] schedule+0x765/0x7be [<c05abfca>] schedule_timeout+0x1b/0x90 [<c05ab4d4>] wait_for_common+0xab/0x101 [<c05ab5ac>] wait_for_completion+0x12/0x14 [<c033cfc3>] blk_execute_rq+0x84/0x99 [<c0402470>] scsi_execute+0xc2/0x105 [<c040250a>] scsi_execute_req+0x57/0x7f [<c043afe0>] sr_test_unit_ready+0x3e/0x97 [<c043bbd6>] sr_media_change+0x43/0x205 [<c046b59f>] media_changed+0x48/0x77 [<c046b5ff>] cdrom_media_changed+0x31/0x37 [<c043b091>] sr_block_media_changed+0x16/0x18 [<c02b9e69>] check_disk_change+0x1b/0x63 [<c046f4c3>] cdrom_open+0x7a1/0x806 [<c043b148>] sr_block_open+0x78/0x8d [<c02ba4c0>] do_open+0x90/0x257 [<c02ba869>] blkdev_open+0x2d/0x56 [<c0296a1f>] __dentry_open+0x14d/0x23c [<c0296b32>] nameidata_to_filp+0x24/0x38 [<c02a1c68>] do_filp_open+0x347/0x626 [<c02967ef>] do_sys_open+0x47/0xbc [<c02968b0>] sys_open+0x23/0x2b [<c021aadd>] sysenter_do_call+0x12/0x26 I've tested this on both x86_64 and i386. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:19 +02:00
Ingo Molnar	ac8825ec6d	ftrace: clean up macro usage enclose the argument in parenthesis. (especially since we cast it, which is a high prio operation) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:09 +02:00
Stephen Rothwell	2d7da80f71	ftrace: fix build failure After disabling FTRACE_MCOUNT_RECORD via a patch, a dormant build failure surfaced: kernel/trace/ftrace.c: In function 'ftrace_record_ip': kernel/trace/ftrace.c:416: error: incompatible type for argument 1 of '_spin_lock_irqsave' kernel/trace/ftrace.c:433: error: incompatible type for argument 1 of '_spin_lock_irqsave' Introduced by commit 6dad8e07f4c10b17b038e84d29f3ca41c2e55cd0 ("ftrace: add necessary locking for ftrace records"). Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:36:06 +02:00
Steven Rostedt	99ecdc43bc	ftrace: add necessary locking for ftrace records The new design of pre-recorded mcounts and updating the code outside of kstop_machine has changed the way the records themselves are protected. This patch uses the ftrace_lock to protect the records. Note, the lock still does not need to be taken within calls that are only called via kstop_machine, since the that code can not run while the spin lock is held. Also removed the hash_lock needed for the daemon when MCOUNT_RECORD is configured. Also did a slight cleanup of an unused variable. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:35:47 +02:00
Steven Rostedt	00fd61aee1	ftrace: do not init module on ftrace disabled If one of the self tests of ftrace has disabled the function tracer, do not run the code to convert the mcount calls in modules. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:35:43 +02:00
Frédéric Weisbecker	98a983aad2	ftrace: fix some mistakes in error messages This patch fixes some mistakes on the tracer in warning messages when debugfs fails to create tracing files. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: srostedt@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:35:40 +02:00
Steven Rostedt	3f5a54e371	ftrace: dump out ftrace buffers to console on panic At OLS I had a lot of interest to be able to have the ftrace buffers dumped on panic. Usually one would expect to uses kexec and examine the buffers after a new kernel is loaded. But sometimes the resources do not permit kdump and kexec, so having an option to still see the sequence of events up to the crash is very advantageous. This patch adds the option to have the ftrace buffers dumped to the console in the latency_trace format on a panic. When the option is set, the default entries per CPU buffer are lowered to 16384, since the writing to the serial (if that is the console) may take an awful long time otherwise. [ Changes since -v1: Got alpine to send correctly (as well as spell check working). Removed config option. Moved the static variables into ftrace_dump itself. Gave printk a log level. ] Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:35:26 +02:00
Steven Rostedt	2f2c99dba2	ftrace: ftrace_printk doc moved Based on Randy Dunlap's suggestion, the ftrace_printk kernel-doc belongs with the ftrace_printk macro that should be used. Not with the __ftrace_printk internal function. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:35:22 +02:00
Steven Rostedt	dd0e545f06	ftrace: printk formatting infrastructure This patch adds a feature that can help kernel developers debug their code using ftrace. int ftrace_printk(const char *fmt, ...); This records into the ftrace buffer using printf formatting. The entry size in the buffers are still a fixed length. A new type has been added that allows for more entries to be used for a single recording. The start of the print is still the same as the other entries. It returns the number of characters written to the ftrace buffer. For example: Having a module with the following code: static int __init ftrace_print_test(void) { ftrace_printk("jiffies are %ld\n", jiffies); return 0; } Gives me: insmod-5441 3...1 7569us : ftrace_print_test: jiffies are 4296626666 for the latency_trace file and: insmod-5441 [03] 1959.370498: ftrace_print_test jiffies are 4296626666 for the trace file. Note: Only the infrastructure should go into the kernel. It is to help facilitate debugging for other kernel developers. Calls to ftrace_printk is not intended to be left in the kernel, and should be frowned upon just like scattering printks around in the code. But having this easily at your fingertips helps the debugging go faster and bugs be solved quicker. Maybe later on, we can hook this with markers and have their printf format be sucked into ftrace output. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:35:19 +02:00
Steven Rostedt	2e2ca155cd	ftrace: new continue entry - separate out from trace_entry Some tracers will need to work with more than one entry. In order to do this the trace_entry structure was split into two fields. One for the start of all entries, and one to continue an existing entry. The trace_entry structure now has a "field" entry that consists of the previous content of the trace_entry, and a "cont" entry that is just a string buffer the size of the "field" entry. Thanks to Andrew Morton for suggesting this idea. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:35:15 +02:00
Steven Rostedt	fed1939c64	ftrace: remove old pointers to mcount When a mcount pointer is recorded into a table, it is used to add or remove calls to mcount (replacing them with nops). If the code is removed via removing a module, the pointers still exist. At modifying the code a check is always made to make sure the code being replaced is the code expected. In-other-words, the code being replaced is compared to what it is expected to be before being replaced. There is a very small chance that the code being replaced just happens to look like code that calls mcount (very small since the call to mcount is relative). To remove this chance, this patch adds ftrace_release to allow module unloading to remove the pointers to mcount within the module. Another change for init calls is made to not trace calls marked with __init. The tracing can not be started until after init is done anyway. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:35:12 +02:00
Steven Rostedt	a9fdda33cd	ftrace: do not show freed records in available_filter_functions Seems that freed records can appear in the available_filter_functions list. This patch fixes that. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:35:05 +02:00
Steven Rostedt	90d595fe5c	ftrace: enable mcount recording for modules This patch enables the loading of the __mcount_section of modules and changing all the callers of mcount into nops. The modification is done before the init_module function is called, so again, we do not need to use kstop_machine to make these changes. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:34:47 +02:00
Steven Rostedt	68bf21aa15	ftrace: mcount call site on boot nops core This is the infrastructure to the converting the mcount call sites recorded by the __mcount_loc section into nops on boot. It also allows for using these sites to enable tracing as normal. When the __mcount_loc section is used, the "ftraced" kernel thread is disabled. This uses the current infrastructure to record the mcount call sites as well as convert them to nops. The mcount function is kept as a stub on boot up and not converted to the ftrace_record_ip function. We use the ftrace_record_ip to only record from the table. This patch does not handle modules. That comes with a later patch. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:34:44 +02:00
Steven Rostedt	8da3821ba5	ftrace: create __mcount_loc section This patch creates a section in the kernel called "__mcount_loc". This will hold a list of pointers to the mcount relocation for each call site of mcount. For example: objdump -dr init/main.o [...] Disassembly of section .text: 0000000000000000 <do_one_initcall>: 0: 55 push %rbp [...] 000000000000017b <init_post>: 17b: 55 push %rbp 17c: 48 89 e5 mov %rsp,%rbp 17f: 53 push %rbx 180: 48 83 ec 08 sub $0x8,%rsp 184: e8 00 00 00 00 callq 189 <init_post+0xe> 185: R_X86_64_PC32 mcount+0xfffffffffffffffc [...] We will add a section to point to each function call. .section __mcount_loc,"a",@progbits [...] .quad .text + 0x185 [...] The offset to of the mcount call site in init_post is an offset from the start of the section, and not the start of the function init_post. The mcount relocation is at the call site 0x185 from the start of the .text section. .text + 0x185 == init_post + 0xa We need a way to add this __mcount_loc section in a way that we do not lose the relocations after final link. The .text section here will be attached to all other .text sections after final link and the offsets will be meaningless. We need to keep track of where these .text sections are. To do this, we use the start of the first function in the section. do_one_initcall. We can make a tmp.s file with this function as a reference to the start of the .text section. .section __mcount_loc,"a",@progbits [...] .quad do_one_initcall + 0x185 [...] Then we can compile the tmp.s into a tmp.o gcc -c tmp.s -o tmp.o And link it into back into main.o. ld -r main.o tmp.o -o tmp_main.o mv tmp_main.o main.o But we have a problem. What happens if the first function in a section is not exported, and is a static function. The linker will not let the tmp.o use it. This case exists in main.o as well. Disassembly of section .init.text: 0000000000000000 <set_reset_devices>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: e8 00 00 00 00 callq 9 <set_reset_devices+0x9> 5: R_X86_64_PC32 mcount+0xfffffffffffffffc The first function in .init.text is a static function. 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices The lowercase 't' means that set_reset_devices is local and is not exported. If we simply try to link the tmp.o with the set_reset_devices we end up with two symbols: one local and one global. .section __mcount_loc,"a",@progbits .quad set_reset_devices + 0x10 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices U set_reset_devices We still have an undefined reference to set_reset_devices, and if we try to compile the kernel, we will end up with an undefined reference to set_reset_devices, or even worst, it could be exported someplace else, and then we will have a reference to the wrong location. To handle this case, we make an intermediate step using objcopy. We convert set_reset_devices into a global exported symbol before linking it with tmp.o and set it back afterwards. 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 T set_reset_devices 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 T set_reset_devices 00000000000000a8 t __setup_set_reset_devices 000000000000105f t __setup_str_set_reset_devices 0000000000000000 t set_reset_devices Now we have a section in main.o called __mcount_loc that we can place somewhere in the kernel using vmlinux.ld.S and access it to convert all these locations that call mcount into nops before starting SMP and thus, eliminating the need to do this with kstop_machine. Note, A well documented perl script (scripts/recordmcount.pl) is used to do all this in one location. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:34:40 +02:00
Ingo Molnar	36dcd67ae9	ftrace: ignore functions that cannot be kprobe-ed kprobes already has an extensive list of annotations for functions that should not be instrumented. Add notrace annotations to these functions as well. This is particularly useful for functions called by the NMI path. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:34:22 +02:00
Mathieu Desnoyers	9795302acf	tracepoints: use TABLE_SIZE macro Steven Rostedt suggested: \| Wouldn't it look nicer to have: (TRACEPOINT_TABLE_SIZE - 1) ? Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:34:07 +02:00
Ingo Molnar	5f87f11218	tracing: clean up tracepoints kconfig structure do not expose users to CONFIG_TRACEPOINTS - tracers can select it just fine. update ftrace to select CONFIG_TRACEPOINTS. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:33:32 +02:00
Mathieu Desnoyers	b07c3f193a	ftrace: port to tracepoints Porting the trace_mark() used by ftrace to tracepoints. (cleanup) Changelog : - Change error messages : marker -> tracepoint [ mingo@elte.hu: conflict resolutions ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: 'Peter Zijlstra' <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:32:26 +02:00
Mathieu Desnoyers	0a16b60758	tracing, sched: LTTng instrumentation - scheduler Instrument the scheduler activity (sched_switch, migration, wakeups, wait for a task, signal delivery) and process/thread creation/destruction (fork, exit, kthread stop). Actually, kthread creation is not instrumented in this patch because it is architecture dependent. It allows to connect tracers such as ftrace which detects scheduling latencies, good/bad scheduler decisions. Tools like LTTng can export this scheduler information along with instrumentation of the rest of the kernel activity to perform post-mortem analysis on the scheduler activity. About the performance impact of tracepoints (which is comparable to markers), even without immediate values optimizations, tests done by Hideo Aoki on ia64 show no regression. His test case was using hackbench on a kernel where scheduler instrumentation (about 5 events in code scheduler code) was added. See the "Tracepoints" patch header for performance result detail. Changelog : - Change instrumentation location and parameter to match ftrace instrumentation, previously done with kernel markers. [ mingo@elte.hu: conflict resolutions ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: 'Peter Zijlstra' <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:30:52 +02:00
Mathieu Desnoyers	97e1c18e8d	tracing: Kernel Tracepoints Implementation of kernel tracepoints. Inspired from the Linux Kernel Markers. Allows complete typing verification by declaring both tracing statement inline functions and probe registration/unregistration static inline functions within the same macro "DEFINE_TRACE". No format string is required. See the tracepoint Documentation and Samples patches for usage examples. Taken from the documentation patch : "A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or "off" (no probe is attached). When a tracepoint is "off" it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a tracepoint is "on", the function you provide is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site). You can put tracepoints at important locations in the code. They are lightweight hooks that can pass an arbitrary number of parameters, which prototypes are described in a tracepoint declaration placed in a header file." Addition and removal of tracepoints is synchronized by RCU using the scheduler (and preempt_disable) as guarantees to find a quiescent state (this is really RCU "classic"). The update side uses rcu_barrier_sched() with call_rcu_sched() and the read/execute side uses "preempt_disable()/preempt_enable()". We make sure the previous array containing probes, which has been scheduled for deletion by the rcu callback, is indeed freed before we proceed to the next update. It therefore limits the rate of modification of a single tracepoint to one update per RCU period. The objective here is to permit fast batch add/removal of probes on _different_ tracepoints. Changelog : - Use #name ":" #proto as string to identify the tracepoint in the tracepoint table. This will make sure not type mismatch happens due to connexion of a probe with the wrong type to a tracepoint declared with the same name in a different header. - Add tracepoint_entry_free_old. - Change __TO_TRACE to get rid of the 'i' iterator. Masami Hiramatsu <mhiramat@redhat.com> : Tested on x86-64. Performance impact of a tracepoint : same as markers, except that it adds about 70 bytes of instructions in an unlikely branch of each instrumented function (the for loop, the stack setup and the function call). It currently adds a memory read, a test and a conditional branch at the instrumentation site (in the hot path). Immediate values will eventually change this into a load immediate, test and branch, which removes the memory read which will make the i-cache impact smaller (changing the memory read for a load immediate removes 3-4 bytes per site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it also saves the d-cache hit). About the performance impact of tracepoints (which is comparable to markers), even without immediate values optimizations, tests done by Hideo Aoki on ia64 show no regression. His test case was using hackbench on a kernel where scheduler instrumentation (about 5 events in code scheduler code) was added. Quoting Hideo Aoki about Markers : I evaluated overhead of kernel marker using linux-2.6-sched-fixes git tree, which includes several markers for LTTng, using an ia64 server. While the immediate trace mark feature isn't implemented on ia64, there is no major performance regression. So, I think that we don't have any issues to propose merging marker point patches into Linus's tree from the viewpoint of performance impact. I prepared two kernels to evaluate. The first one was compiled without CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS. I downloaded the original hackbench from the following URL: http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c I ran hackbench 5 times in each condition and calculated the average and difference between the kernels. The parameter of hackbench: every 50 from 50 to 800 The number of CPUs of the server: 2, 4, and 8 Below is the results. As you can see, major performance regression wasn't found in any case. Even if number of processes increases, differences between marker-enabled kernel and marker- disabled kernel doesn't increase. Moreover, if number of CPUs increases, the differences doesn't increase either. Curiously, marker-enabled kernel is better than marker-disabled kernel in more than half cases, although I guess it comes from the difference of memory access pattern. * 2 CPUs Number of \| without \| with \| diff \| diff \| processes \| Marker [Sec] \| Marker [Sec] \| [Sec] \| [%] \| -------------------------------------------------------------- 50 \| 4.811 \| 4.872 \| +0.061 \| +1.27 \| 100 \| 9.854 \| 10.309 \| +0.454 \| +4.61 \| 150 \| 15.602 \| 15.040 \| -0.562 \| -3.6 \| 200 \| 20.489 \| 20.380 \| -0.109 \| -0.53 \| 250 \| 25.798 \| 25.652 \| -0.146 \| -0.56 \| 300 \| 31.260 \| 30.797 \| -0.463 \| -1.48 \| 350 \| 36.121 \| 35.770 \| -0.351 \| -0.97 \| 400 \| 42.288 \| 42.102 \| -0.186 \| -0.44 \| 450 \| 47.778 \| 47.253 \| -0.526 \| -1.1 \| 500 \| 51.953 \| 52.278 \| +0.325 \| +0.63 \| 550 \| 58.401 \| 57.700 \| -0.701 \| -1.2 \| 600 \| 63.334 \| 63.222 \| -0.112 \| -0.18 \| 650 \| 68.816 \| 68.511 \| -0.306 \| -0.44 \| 700 \| 74.667 \| 74.088 \| -0.579 \| -0.78 \| 750 \| 78.612 \| 79.582 \| +0.970 \| +1.23 \| 800 \| 85.431 \| 85.263 \| -0.168 \| -0.2 \| -------------------------------------------------------------- * 4 CPUs Number of \| without \| with \| diff \| diff \| processes \| Marker [Sec] \| Marker [Sec] \| [Sec] \| [%] \| -------------------------------------------------------------- 50 \| 2.586 \| 2.584 \| -0.003 \| -0.1 \| 100 \| 5.254 \| 5.283 \| +0.030 \| +0.56 \| 150 \| 8.012 \| 8.074 \| +0.061 \| +0.76 \| 200 \| 11.172 \| 11.000 \| -0.172 \| -1.54 \| 250 \| 13.917 \| 14.036 \| +0.119 \| +0.86 \| 300 \| 16.905 \| 16.543 \| -0.362 \| -2.14 \| 350 \| 19.901 \| 20.036 \| +0.135 \| +0.68 \| 400 \| 22.908 \| 23.094 \| +0.186 \| +0.81 \| 450 \| 26.273 \| 26.101 \| -0.172 \| -0.66 \| 500 \| 29.554 \| 29.092 \| -0.461 \| -1.56 \| 550 \| 32.377 \| 32.274 \| -0.103 \| -0.32 \| 600 \| 35.855 \| 35.322 \| -0.533 \| -1.49 \| 650 \| 39.192 \| 38.388 \| -0.804 \| -2.05 \| 700 \| 41.744 \| 41.719 \| -0.025 \| -0.06 \| 750 \| 45.016 \| 44.496 \| -0.520 \| -1.16 \| 800 \| 48.212 \| 47.603 \| -0.609 \| -1.26 \| -------------------------------------------------------------- * 8 CPUs Number of \| without \| with \| diff \| diff \| processes \| Marker [Sec] \| Marker [Sec] \| [Sec] \| [%] \| -------------------------------------------------------------- 50 \| 2.094 \| 2.072 \| -0.022 \| -1.07 \| 100 \| 4.162 \| 4.273 \| +0.111 \| +2.66 \| 150 \| 6.485 \| 6.540 \| +0.055 \| +0.84 \| 200 \| 8.556 \| 8.478 \| -0.078 \| -0.91 \| 250 \| 10.458 \| 10.258 \| -0.200 \| -1.91 \| 300 \| 12.425 \| 12.750 \| +0.325 \| +2.62 \| 350 \| 14.807 \| 14.839 \| +0.032 \| +0.22 \| 400 \| 16.801 \| 16.959 \| +0.158 \| +0.94 \| 450 \| 19.478 \| 19.009 \| -0.470 \| -2.41 \| 500 \| 21.296 \| 21.504 \| +0.208 \| +0.98 \| 550 \| 23.842 \| 23.979 \| +0.137 \| +0.57 \| 600 \| 26.309 \| 26.111 \| -0.198 \| -0.75 \| 650 \| 28.705 \| 28.446 \| -0.259 \| -0.9 \| 700 \| 31.233 \| 31.394 \| +0.161 \| +0.52 \| 750 \| 34.064 \| 33.720 \| -0.344 \| -1.01 \| 800 \| 36.320 \| 36.114 \| -0.206 \| -0.57 \| -------------------------------------------------------------- Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: 'Peter Zijlstra' <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-14 10:28:28 +02:00
Linus Torvalds	20272c8994	Merge branch 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc * 'proc' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc: proc: remove kernel.maps_protect proc: remove now unneeded ADDBUF macro [PATCH] proc: show personality via /proc/pid/personality [PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock() proc: move PROC_PAGE_MONITOR to fs/proc/Kconfig proc: make grab_header() static proc: remove unused get_dma_list() proc: remove dummy vmcore_open() proc: proc_sys_root tweak proc: fix return value of proc_reg_open() in "too late" case Fixed up trivial conflict in removed file arch/sparc/include/asm/dma_32.h	2008-10-13 10:04:04 -07:00
Alan Cox	dbda4c0b97	tty: Fix abusers of current->sighand->tty Various people outside the tty layer still stick their noses in behind the scenes. We need to make sure they also obey the locking and referencing rules. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:42 -07:00
Alan Cox	95f9bfc6b7	tty: Move tty_write_message out of kernel/printk This is pure tty code so put it in the tty layer where it can be with the locking relevant material it uses Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:41 -07:00
Alan Cox	9c9f4ded90	tty: Add a kref count Introduce a kref to the tty structure and use it to protect the tty->signal tty references. For now we don't introduce it for anything else. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-13 09:51:40 -07:00
Arjan van de Ven	dc4304f7de	rangetimers: fix the bug reported by Ingo for real and please hand me a brown paper bag (thanks to Thomas for pointing out this very obvious bug) Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-10-13 11:08:34 -04:00
David S. Miller	56c5d900db	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 Conflicts: sound/core/memalloc.c	2008-10-11 12:39:35 -07:00
Arjan van de Ven	030aebd2e4	rangetimer: fix BUG_ON reported by Ingo There's a small race/chance that, while hrtimers are enabled globally, they're later not enabled when we're calling the hrtimer_interrupt() function, which then BUG_ON()'s for that. This patch closes that race/gap. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-10-11 12:25:45 -07:00
Linus Torvalds	ead9d23d80	Merge phase #4 (X2APIC, APIC unification, CPU identification unification) of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-v28-for-linus-phase4-D' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (186 commits) x86, debug: print more information about unknown CPUs x86 setup: handle more than 8 CPU flag words x86: cpuid, fix typo x86: move transmeta cap read to early_init_transmeta() x86: identify_cpu_without_cpuid v2 x86: extended "flags" to show virtualization HW feature in /proc/cpuinfo x86: move VMX MSRs to msr-index.h x86: centaur_64.c remove duplicated setting of CONSTANT_TSC x86: intel.c put workaround for old cpus together x86: let intel 64-bit use intel.c x86: make intel_64.c the same as intel.c x86: make intel.c have 64-bit support code x86: little clean up of intel.c/intel_64.c x86: make 64 bit to use amd.c x86: make amd_64 have 32 bit code x86: make amd.c have 64bit support code x86: merge header in amd_64.c x86: add srat_detect_node for amd64 x86: remove duplicated force_mwait x86: cpu make amd.c more like amd_64.c v2 ...	2008-10-11 11:51:16 -07:00
Ingo Molnar	0afe2db213	Merge branch 'x86/unify-cpu-detect' into x86-v28-for-linus-phase4-D Conflicts: arch/x86/kernel/cpu/common.c arch/x86/kernel/signal_64.c include/asm-x86/cpufeature.h	2008-10-11 20:23:20 +02:00
Ingo Molnar	d84705969f	Merge branch 'x86/apic' into x86-v28-for-linus-phase4-B Conflicts: arch/x86/kernel/apic_32.c arch/x86/kernel/apic_64.c arch/x86/kernel/setup.c drivers/pci/intel-iommu.c include/asm-x86/cpufeature.h include/asm-x86/dma-mapping.h	2008-10-11 20:17:36 +02:00
Linus Torvalds	bf6f51e3a4	Merge phase #3 (IOMMU) of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-v28-for-linus-phase3-B' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (74 commits) AMD IOMMU: use iommu_device_max_index, fix AMD IOMMU: use iommu_device_max_index x86: add PCI IDs for AMD Barcelona PCI devices x86/iommu: use __GFP_ZERO instead of memset for GART x86/iommu: convert GART need_flush to bool x86/iommu: make GART driver checkpatch clean x86 gart: remove unnecessary initialization x86: restore old GART alloc_coherent behavior revert "x86: make GART to respect device's dma_mask about virtual mappings" x86: export pci-nommu's alloc_coherent iommu: remove fullflush and nofullflush in IOMMU generic option x86: remove set_bit_string() iommu: export iommu_area_reserve helper function AMD IOMMU: use coherent_dma_mask in alloc_coherent add AMD IOMMU tree to MAINTAINERS file AMD IOMMU: use cmd_buf_size when freeing the command buffer AMD IOMMU: calculate IVHD size with a function AMD IOMMU: remove unnecessary cast to u64 in the init code AMD IOMMU: free domain bitmap with its allocation order AMD IOMMU: simplify dma_mask_to_pages ...	2008-10-11 11:03:12 -07:00
Ingo Molnar	5b16a2212f	Merge branch 'sched/clock' into sched/urgent	2008-10-11 18:50:45 +02:00
Linus Torvalds	098ef215b1	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq: [CPUFREQ] Fix BUG: using smp_processor_id() in preemptible code [CPUFREQ] Don't export governors for default governor [CPUFREQ][6/6] cpufreq: Add idle microaccounting in ondemand governor [CPUFREQ][5/6] cpufreq: Changes to get_cpu_idle_time_us(), used by ondemand governor [CPUFREQ][4/6] cpufreq_ondemand: Parameterize down differential [CPUFREQ][3/6] cpufreq: get_cpu_idle_time() changes in ondemand for idle-microaccounting [CPUFREQ][2/6] cpufreq: Change load calculation in ondemand for software coordination [CPUFREQ][1/6] cpufreq: Add cpu number parameter to __cpufreq_driver_getavg() [CPUFREQ] use deferrable delayed work init in conservative governor [CPUFREQ] drivers/cpufreq/cpufreq.c: Adjust error handling code involving cpufreq_cpu_put [CPUFREQ] add error handling for cpufreq_register_governor() error [CPUFREQ] acpi-cpufreq: add error handling for cpufreq_register_driver() error [CPUFREQ] Coding style fixes to arch/x86/kernel/cpu/cpufreq/powernow-k6.c [CPUFREQ] Coding style fixes to arch/x86/kernel/cpu/cpufreq/elanfreq.c	2008-10-11 08:49:34 -07:00
Greg Kroah-Hartman	061b1bd394	Staging: add TAINT_CRAP for all drivers/staging code We need to add a flag for all code that is in the drivers/staging/ directory to prevent all other kernel developers from worrying about issues here, and to notify users that the drivers might not be as good as they are normally used to. Based on code from Andreas Gruenbacher and Jeff Mahoney to provide a TAINT flag for the support level of a kernel module in the Novell enterprise kernel release. This is the kernel portion of this feature, the ability for the flag to be set needs to be done in the build process and will happen in a follow-up patch. Cc: Andreas Gruenbacher <agruen@suse.de> Cc: Jeff Mahoney <jeffm@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-10-10 15:31:05 -07:00
Linus Torvalds	b922df7383	Merge branch 'rcu-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'rcu-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (21 commits) rcu: RCU-based detection of stalled CPUs for Classic RCU, fix rcu: RCU-based detection of stalled CPUs for Classic RCU rcu: add rcu_read_lock_sched() / rcu_read_unlock_sched() rcu: fix sparse shadowed variable warning doc/RCU: fix pseudocode in rcuref.txt rcuclassic: fix compiler warning rcu: use irq-safe locks rcuclassic: fix compilation NG rcu: fix locking cleanup fallout rcu: remove redundant ACCESS_ONCE definition from rcupreempt.c rcu: fix classic RCU locking cleanup lockdep problem rcu: trace fix possible mem-leak rcu: just rename call_rcu_bh instead of making it a macro rcu: remove list_for_each_rcu() rcu: fixes to include/linux/rcupreempt.h rcu: classic RCU locking and memory-barrier cleanups rcu: prevent console flood when one CPU sees another AWOL via RCU rcu, debug: detect stalled grace periods, cleanups rcu, debug: detect stalled grace periods rcu classic: new algorithm for callbacks-processing(v2) ...	2008-10-10 13:10:51 -07:00
Ingo Molnar	725c25819e	Merge branches 'core/iommu', 'x86/amd-iommu' and 'x86/iommu' into x86-v28-for-linus-phase3-B Conflicts: arch/x86/kernel/pci-gart_64.c include/asm-x86/dma-mapping.h	2008-10-10 19:47:12 +02:00
Dave Kleikamp	5b7dba4ff8	sched_clock: prevent scd->clock from moving backwards When sched_clock_cpu() couples the clocks between two cpus, it may increment scd->clock beyond the GTOD tick window that __update_sched_clock() uses to clamp the clock. A later call to __update_sched_clock() may move the clock back to scd->tick_gtod + TICK_NSEC, violating the clock's monotonic property. This patch ensures that scd->clock will not be set backward. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-10 11:17:04 +02:00
Alexey Dobriyan	3bbfe05967	proc: remove kernel.maps_protect After commit `831830b5a2` aka "restrict reading from /proc/<pid>/maps to those who share ->mm or can ptrace" sysctl stopped being relevant because commit moved security checks from ->show time to ->start time (mm_for_maps()). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Kees Cook <kees.cook@canonical.com>	2008-10-10 04:24:51 +04:00
Lai Jiangshan	a6bebbc87a	[PATCH] signal, procfs: some lock_task_sighand() users do not need rcu_read_lock() lock_task_sighand() make sure task->sighand is being protected, so we do not need rcu_read_lock(). [ exec() will get task->sighand->siglock before change task->sighand! ] But code using rcu_read_lock() _just_ to protect lock_task_sighand() only appear in procfs. (and some code in procfs use lock_task_sighand() without such redundant protection.) Other subsystem may put lock_task_sighand() into rcu_read_lock() critical region, but these rcu_read_lock() are used for protecting "for_each_process()", "find_task_by_vpid()" etc. , not for protecting lock_task_sighand(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> [ok from Oleg] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>	2008-10-10 04:18:57 +04:00
venkatesh.pallipadi@intel.com	8083e4ad97	[CPUFREQ][5/6] cpufreq: Changes to get_cpu_idle_time_us(), used by ondemand governor export get_cpu_idle_time_us() for it to be used in ondemand governor. Last update time can be current time when the CPU is currently non-idle, accounting for the busy time since last idle. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Dave Jones <davej@redhat.com>	2008-10-09 13:52:44 -04:00
Ingo Molnar	a5d8c3483a	sched debug: add name to sched_domain sysctl entries add /proc/sys/kernel/sched_domain/cpu0/domain0/name, to make it easier to see which specific scheduler domain remained at that entry. Since we process the scheduler domain tree and simplify it, it's not always immediately clear during debugging which domain came from where. depends on CONFIG_SCHED_DEBUG=y. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-09 17:13:06 +02:00
Ingo Molnar	cdbb92b31d	Merge branch 'linus' into core/rcu	2008-10-09 00:17:25 +02:00
Peter Zijlstra	2fb7635c4c	sched: sync wakeups vs avg_overlap While looking at the code I wondered why we always do: sync && avg_overlap < migration_cost Which is a bit odd, since the overlap test was meant to detect sync wakeups so using it to specialize sync wakeups doesn't make much sense. Hence change the code to do: sync \|\| avg_overlap < migration_cost Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-08 12:20:26 +02:00
Ingo Molnar	990d0f2ced	Merge branches 'sched/devel', 'sched/cpu-hotplug', 'sched/cpusets' and 'sched/urgent' into sched/core	2008-10-08 11:31:02 +02:00
Jason Wessel	cc1e0f4f7a	kgdb: call touch_softlockup_watchdog on resume The softlockup watchdog needs to be touched when resuming the from the kgdb stopped state to avoid the printk that a CPU is stuck if the debugger was active for longer than the softlockup threshold. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2008-10-06 13:50:59 -05:00
Li Zefan	34b3ede235	sched: remove redundant code in cpu_cgroup_create() css will be initialized by cgroup core. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-06 08:13:34 +02:00
Ingo Molnar	2c10c22af0	Merge branch 'linus' into sched/devel	2008-10-06 08:13:18 +02:00
Dario Faggioli	f6121f4f87	sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq While working on the new version of the code for SCHED_SPORADIC I noticed something strange in the present throttling mechanism. More specifically in the throttling timer handler in sched_rt.c (do_sched_rt_period_timer()) and in rt_rq_enqueue(). The problem is that, when unthrottling a runqueue, rt_rq_enqueue() only asks for rescheduling if the runqueue has a sched_entity associated to it (i.e., rt_rq->rt_se != NULL). Now, if the runqueue is the root rq (which has a rt_se = NULL) rescheduling does not take place, and it is delayed to some undefined instant in the future. This imply some random bandwidth usage by the RT tasks under throttling. For instance, setting rt_runtime_us/rt_period_us = 950ms/1000ms an RT task will get less than 95%. In our tests we got something varying between 70% to 95%. Using smaller time values, e.g., 95ms/100ms, things are even worse, and I can see values also going down to 20-25%!! The tests we performed are simply running 'yes' as a SCHED_FIFO task, and checking the CPU usage with top, but we can investigate thoroughly if you think it is needed. Things go much better, for us, with the attached patch... Don't know if it is the best approach, but it solved the issue for us. Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-04 14:31:54 +02:00
Thomas Gleixner	07454bfff1	clockevents: check broadcast tick device not the clock events device Impact: jiffies increment too fast. Hugh Dickins noted that with NOHZ=n and HIGHRES=n jiffies get incremented too fast. The reason is a wrong check in the broadcast enter/exit code, which keeps the local apic timer in periodic mode when the switch happens. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-10-04 10:51:07 +02:00
Frederic Weisbecker	d294eb83d8	cpusets: scan_for_empty_cpusets(), cpuset doesn't seem to be so const This fixes a warning on latest -tip: kernel/cpuset.c: Dans la fonction «scan_for_empty_cpusets» : kernel/cpuset.c:1932: attention : passing argument 1 of «list_add_tail» discards qualifiers from pointer target type Actually the struct cpuset *root passed in parameter to scan_for_empty_cpusets is not supposed to be const since an entry is added on the tail of its list. Just correct the qualifier. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-03 13:39:50 +02:00
Frederic Weisbecker	77af7e3403	softirq, warning fix: correct a format to avoid a warning Last -tip gives this warning: kernel/softirq.c: Dans la fonction «__do_softirq» : kernel/softirq.c:216: attention : format «%ld» expects type «long int», but argument 2 has type «int» This patch corrects the format type, and a small mistake in the "softirq" word. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-03 11:41:41 +02:00
Ingo Molnar	2ec2b482b1	rcu: RCU-based detection of stalled CPUs for Classic RCU, fix fix the !CONFIG_RCU_CPU_STALL_DETECTOR path: kernel/rcuclassic.c: In function '__rcu_pending': kernel/rcuclassic.c:609: error: too few arguments to function 'check_cpu_stall' Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-03 10:41:00 +02:00
Paul E. McKenney	2133b5d7ff	rcu: RCU-based detection of stalled CPUs for Classic RCU This patch adds stalled-CPU detection to Classic RCU. This capability is enabled by a new config variable CONFIG_RCU_CPU_STALL_DETECTOR, which defaults disabled. This is a debugging feature to detect infinite loops in kernel code, not something that non-kernel-hackers would be expected to care about. This feature can detect looping CPUs in !PREEMPT builds and looping CPUs with preemption disabled in PREEMPT builds. This is essentially a port of this functionality from the treercu patch, replacing the stall debug patch that is already in tip/core/rcu (commit `67182ae1c4`). The changes from the patch in tip/core/rcu include making the config variable name match that in treercu, changing from seconds to jiffies to avoid spurious warnings, and printing a boot message when this feature is enabled. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-03 10:36:08 +02:00
Ingo Molnar	b5259d9442	Merge commit 'v2.6.27-rc8' into core/rcu	2008-10-03 10:34:36 +02:00
Dan Carpenter	aa94fbd5cc	fix error-path NULL deref in alloc_posix_timer() Found by static checker (http://repo.or.cz/w/smatch.git). Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-10-02 15:53:13 -07:00
Thomas Gleixner	8e85b4b553	softirqs, debug: preemption check if a preempt count leaks out of a softirq handler it can be very hard to figure it out. Add a debug check for this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-02 10:58:04 +02:00
David Brownell	0c5d1eb77a	genirq: record trigger type Genirq hasn't previously recorded the trigger type used by any given IRQ, although some irq_chip support has done so. That data can be useful when troubleshooting. This patch records it in the relevant irq_desc.status bits, and improves consistency between the two driver-visible calls affected: - Make set_irq_type() usage match request_irq() usage: * IRQ_TYPE_NONE should be a NOP; succeed, so irq_chip methods won't have to handle that case any more (many do it wrong). * IRQ_TYPE_PROBE is ignored; any buggy out-of-tree callers might need to switch over to the real IRQ probing code. * emit the same diagnostics (from shared utility code) - Their kerneldoc now reflects usage: * request_irq() flags include IRQF_TRIGGER_* to specify active edge(s)/level ... docs previously omitted that * set_irq_type() is declared in <linux/irq.h> so callers should use the (bit-equivalent) IRQ_TYPE_* symbols there Also: adds a warning about shared IRQs that don't end up using the requested trigger mode; and fix an unrelated "sparse" warning. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-10-02 10:24:09 +02:00
Ingo Molnar	d6d5aeb661	Merge commit 'v2.6.27-rc8' into genirq	2008-10-02 10:21:26 +02:00
Linus Torvalds	cf4b0b2c95	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: prevent migration of per CPU hrtimers hrtimer: mark migration state hrtimer: fix migration of CB_IRQSAFE_NO_SOFTIRQ hrtimers hrtimer: migrate pending list on cpu offline Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2008-09-30 08:39:28 -07:00
Amit K. Arora	64b9e0294d	sched: minor optimizations in wake_affine and select_task_rq_fair This patch does following: o Removes unused variable and argument "rq". o Optimizes one of the "if" conditions in wake_affine() - i.e. if "balanced" is true, we need not do rest of the calculations in the condition. o If this cpu is same as the previous cpu (on which woken up task was running when it went to sleep), no need to call wake_affine at all. Signed-off-by: Amit K Arora <aarora@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-30 15:25:44 +02:00
Nick Piggin	7317d7b87e	sched: improve preempt debugging This patch helped me out with a problem I recently had.... Basically, when the kernel lock is held, then preempt_count underflow does not get detected until it is released which may be a long time (and arbitrarily, eg at different points it may be rescheduled). If the bkl is released at schedule, the resulting output is actually fairly cryptic... With any other lock that elevates preempt_count, it is illegal to schedule under it (which would get found pretty quickly). bkl allows scheduling with preempt_count elevated, which makes underflows hard to debug. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-30 12:56:25 +02:00
Peter Zijlstra	42569c3991	futex: fixup get_futex_key() for private futexes With the get_user_pages_fast() patches we made get_futex_key() obtain a reference on the returned key, but failed to do so for private futexes. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-30 12:36:02 +02:00
Peter Zijlstra	c2f9f20154	futex: cleanup fshared fshared doesn't need to be a rw_sem pointer anymore, so clean that up. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-30 12:35:54 +02:00
Peter Zijlstra	734b05b10e	futex: use fast_gup() Change the get_user_pages() call with fast_gup() which doesn't require holding the mmap_sem thereby removing the mmap_sem from all fast paths. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-30 12:35:46 +02:00
Peter Zijlstra	61270708ec	futex: reduce mmap_sem usage now that we rely on get_user_pages() for the shared key handling move all the mmap_sem stuff closely around the slow paths. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-30 12:35:36 +02:00
Peter Zijlstra	38d47c1b70	futex: rely on get_user_pages() for shared futexes On the way of getting rid of the mmap_sem requirement for shared futexes, start by relying on get_user_pages(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-30 12:35:20 +02:00
Ingo Molnar	1508487e7f	timers: fix itimer/many thread hang, fix fix bogus rq dereference: v3 removed the locking but also removed the rq initialization. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-30 08:28:17 +02:00
Thomas Petazzoni	bfcd17a6c5	Configure out file locking features This patch adds the CONFIG_FILE_LOCKING option which allows to remove support for advisory locks. With this patch enabled, the flock() system call, the F_GETLK, F_SETLK and F_SETLKW operations of fcntl() and NFS support are disabled. These features are not necessarly needed on embedded systems. It allows to save ~11 Kb of kernel code and data: text data bss dec hex filename `1125436` 118764 212992 1457192 163c28 vmlinux.old 1114299 118564 212992 1445855 160fdf vmlinux -11137 -200 0 -11337 -2C49 +/- This patch has originally been written by Matt Mackall <mpm@selenic.com>, and is part of the Linux Tiny project. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: matthew@wil.cx Cc: linux-fsdevel@vger.kernel.org Cc: mpm@selenic.com Cc: akpm@linux-foundation.org Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-29 17:56:57 -04:00
Balbir Singh	31a78f23ba	mm owner: fix race between swapoff and exit There's a race between mm->owner assignment and swapoff, more easily seen when task slab poisoning is turned on. The condition occurs when try_to_unuse() runs in parallel with an exiting task. A similar race can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats> or ptrace or page migration. CPU0 CPU1 try_to_unuse looks at mm = task0->mm increments mm->mm_users task 0 exits mm->owner needs to be updated, but no new owner is found (mm_users > 1, but no other task has task->mm = task0->mm) mm_update_next_owner() leaves mmput(mm) decrements mm->mm_users task0 freed dereferencing mm->owner fails The fix is to notify the subsystem via mm_owner_changed callback(), if no new owner is found, by specifying the new task as NULL. Jiri Slaby: mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but must be set after that, so as not to pass NULL as old owner causing oops. Daisuke Nishimura: mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task() and its callers need to take account of this situation to avoid oops. Hugh Dickins: Lockdep warning and hang below exec_mmap() when testing these patches. exit_mm() up_reads mmap_sem before calling mm_update_next_owner(), so exec_mmap() now needs to do the same. And with that repositioning, there's now no point in mm_need_new_owner() allowing for NULL mm. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-29 08:41:47 -07:00
Thomas Gleixner	ccc7dadf73	hrtimer: prevent migration of per CPU hrtimers Impact: per CPU hrtimers can be migrated from a dead CPU The hrtimer code has no knowledge about per CPU timers, but we need to prevent the migration of such timers and warn when such a timer is active at migration time. Explicitely mark the timers as per CPU and use a more understandable mode descriptor for the interrupts safe unlocked callback mode, which is used by hrtimer_sleeper and the scheduler code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-29 17:09:14 +02:00
Thomas Gleixner	b00c1a99e7	hrtimer: mark migration state Impact: during migration active hrtimers can be seen as inactive The migration code removes the hrtimers from the queues of the dead CPU and sets the state temporary to INACTIVE. The enqueue code sets it to ACTIVE/PENDING again. Prevent that the wrong state can be seen by using a separate migration state bit. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-29 17:09:14 +02:00
Thomas Gleixner	41e1022eae	hrtimer: fix migration of CB_IRQSAFE_NO_SOFTIRQ hrtimers Impact: Stale timers after a CPU went offline. commit `37bb6cb409` hrtimer: unlock hrtimer_wakeup changed the hrtimer sleeper callback mode to CB_IRQSAFE_NO_SOFTIRQ due to locking problems. A result of this change is that when enqueue is called for an already expired hrtimer the callback function is not longer called directly from the enqueue code. The normal callers have been fixed in the code, but the migration code which moves hrtimers from a dead CPU to a live CPU was not made aware of this. This can be fixed by checking the timer state after the call to enqueue in the migration code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-29 17:09:14 +02:00
Thomas Gleixner	7659e34967	hrtimer: migrate pending list on cpu offline Impact: hrtimers which are on the pending list are not migrated at cpu offline and can be stale forever Add the pending list migration when CONFIG_HIGH_RES_TIMERS is enabled Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-29 17:09:13 +02:00
Frank Mayhar	7086efe1c1	timers: fix itimer/many thread hang, v3 - fix UP lockup - another set of UP/SMP cleanups and simplifications Signed-off-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-27 20:04:45 +02:00
Jason Wessel	d7161a6534	kgdb, x86, arm, mips, powerpc: ignore user space single stepping On the x86 arch, user space single step exceptions should be ignored if they occur in the kernel space, such as ptrace stepping through a system call. First check if it is kgdb that is executing a single step, then ensure it is not an accidental traversal into the user space, while in kgdb, any other time the TIF_SINGLESTEP is set, kgdb should ignore the exception. On x86, arm, mips and powerpc, the kgdb_contthread usage was inconsistent with the way single stepping is implemented in the kgdb core. The arch specific stub should always set the kgdb_cpu_doing_single_step correctly if it is single stepping. This allows kgdb to correctly process an instruction steps if ptrace happens to be requesting an instruction step over a system call. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2008-09-26 10:36:41 -05:00
Atsuo Igarashi	18d6522b86	kgdb: could not write to the last of valid memory with kgdb On the ARM architecture, kgdb will crash the kernel if the last byte of valid memory is written due to a flush_icache_range flushing beyond the memory boundary. Signed-off-by: Atsuo Igarashi <atsuo_igarashi@tripeaks.co.jp> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2008-09-26 10:36:41 -05:00
Ingo Molnar	13eb83754b	IO resources, x86: ioremap sanity check to catch mapping requests exceeding, fix fix this build error: kernel/resource.c: In function 'iomem_map_sanity_check': kernel/resource.c:842: error: implicit declaration of function 'r_next' kernel/resource.c:842: warning: assignment makes pointer from integer without a cast r_next() was only available if CONFIG_PROCFS was enabled. and fix this build warning: kernel/resource.c:855: warning: format '%llx' expects type 'long long unsigned int', but argument 2 has type 'resource_size_t' kernel/resource.c:855: warning: format '%llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int' kernel/resource.c:855: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'resource_size_t' kernel/resource.c:855: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'resource_size_t' resource_t can be 32 bits. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-26 10:10:12 +02:00
Suresh Siddha	379daf6290	IO resources, x86: ioremap sanity check to catch mapping requests exceeding the BAR sizes Go through the iomem resource tree to check if any of the ioremap() requests span more than any slot in the iomem resource tree and do a WARN_ON() if we hit this check. This will raise a red-flag, if some driver is mapping more than what is needed. And hopefully identify possible corruptions much earlier. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-26 09:42:20 +02:00
Bharata B Rao	b87f17242d	sched: maintain only task entities in cfs_rq->tasks list cfs_rq->tasks list is used by the load balancer to iterate over all the tasks. Currently it holds all the entities (both task and group entities) because of which there is a need to check for group entities explicitly during load balancing. This patch changes the cfs_rq->tasks list to hold only task entities. Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-25 11:24:11 +02:00
Roman Zippel	d40e944c25	ntp: improve adjtimex frequency rounding Change PPM_SCALE_INV_SHIFT so that it doesn't throw away any input bits (19 is the amount of the factor 2 in PPM_SCALE), the output frequency can then be calculated back to its input value, as the inverse divide produce a slightly larger value, which is then correctly rounded by the final shift. Reported-by: Martin Ziegler <ziegler@uni-freiburg.de> Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Cc: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 17:33:13 +02:00
Roman Zippel	5cd1c9c5cf	timekeeping: fix rounding problem during clock update Due to a rounding problem during a clock update it's possible for readers to observe the clock jumping back by 1nsec. The following simplified example demonstrates the problem: cycle xtime 0 0 1000 999999.6 2000 1999999.2 3000 2999998.8 ... 1500 = 1499999.4 = 0.0 + 1499999.4 = 999999.6 + 499999.8 When reading the clock only the full nanosecond part is used, while timekeeping internally keeps nanosecond fractions. If the clock is now updated at cycle 1500 here, a nanosecond is missing due to the truncation. The simple fix is to round up the xtime value during the update, this also changes the distance to the reference time, but the adjustment will automatically take care that it stays under control. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 17:33:13 +02:00
Maciej W. Rozycki	eb3f938fd6	ntp: let update_persistent_clock() sleep This is a change that makes the 11-minute RTC update be run in the process context. This is so that update_persistent_clock() can sleep, which may be required for certain types of RTC hardware -- most notably I2C devices. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Rik van Riel <riel@redhat.com> Cc: David Brownell <david-b@pacbell.net> Acked-by: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 17:33:12 +02:00
Oleg Nesterov	31d9284569	posix-timers: lock_timer: make it readable Cleanup. Imho makes the code much more understandable. At least this patch lessens both the source and compiled code. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: mingo@elte.hu Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 15:45:48 +02:00
Oleg Nesterov	5a51b713cc	posix-timers: lock_timer: kill the bogus ->it_id check lock_timer() checks that the timer found by idr_find(timer_id) has ->it_id == timer_id. This buys nothing. This check can fail only if sys_timer_create() unlocked idr_lock after idr_get_new(), but didn't set ->it_id = new_timer_id yet. But in that case ->it_process == NULL so lock_timer() can't succeed anyway. Also remove a couple of unneeded typecasts. Note that with or without this patch we have a small problem. sys_timer_create() doesn't ensure that the result of setting (say) ->it_sigev_notify must be visible if lock_timer() succeeds. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: mingo@elte.hu Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 15:45:48 +02:00
Oleg Nesterov	5a9fa73072	posix-timers: kill ->it_sigev_signo and ->it_sigev_value With the recent changes ->it_sigev_signo and ->it_sigev_value are only used in sys_timer_create(), kill them. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: mingo@elte.hu Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 15:45:48 +02:00
Oleg Nesterov	ef864c9588	posix-timers: sys_timer_create: cleanup the error handling Cleanup. - sys_timer_create() is big and complicated. The code above the "out:" label relies on the fact that "error" must be == 0. This is not very robust, make the code more explicit. Remove the unneeded initialization of error. - If idr_get_new() succeeds (as it normally should), we check the returned value twice. Move the "-EAGAIN" check under "if (error)". Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: mingo@elte.hu Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 15:45:48 +02:00
Oleg Nesterov	717835d94d	posix-timers: move the initialization of timer->sigq from send to create path posix_timer_event() always populates timer->sigq with the same numbers, move this code into sys_timer_create(). Note that with this patch we can kill it_sigev_signo and it_sigev_value. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: mingo@elte.hu Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 15:45:48 +02:00
Oleg Nesterov	36b2f04600	posix-timers: sys_timer_create: simplify and s/tasklist/rcu/ - Change the code to do rcu_read_lock() instead of taking tasklist_lock, it is safe to get_task_struct(p) if p was found under RCU. However, now we must not use process's sighand/signal, they may be NULL. We can use current->sighand/signal instead, this "process" must belong to the current's thread-group. - Factor out the common code for 2 "if (timer_event_spec)" branches, the !timer_event_spec case can use current too. - use spin_lock_irq() instead of _irqsave(), kill "flags". Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: mingo@elte.hu Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 15:45:48 +02:00
Oleg Nesterov	2cd499e38e	posix-timers: sys_timer_create: remove the buggy PF_EXITING check sys_timer_create() return -EINVAL if the target thread has PF_EXITING. This doesn't really make sense, the sub-thread can die right after unlock. And in fact, this is just wrong. Without SIGEV_THREAD_ID good_sigevent() returns ->group_leader, and it is very possible that the leader is already dead. This is OK, we shouldn't return the error in this case. Remove this check and the comment. Note that the "process" was found under tasklist_lock, it must have ->sighand != NULL. Also, remove a couple of unneeded initializations. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: mingo@elte.hu Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 15:45:47 +02:00
Oleg Nesterov	918fc03728	posix-timers: always do get_task_struct(timer->it_process) Change the code to get/put timer->it_process regardless of SIGEV_THREAD_ID. This streamlines the create/destroy paths and allows us to simplify the usage of exit_itimers() in de_thread(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: mingo@elte.hu Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 15:45:47 +02:00
Oleg Nesterov	4aa7361179	posix-timers: don't switch to ->group_leader if ->it_process dies posix_timer_event() drops SIGEV_THREAD_ID and switches to ->group_leader if send_sigqueue() fails. This is not very useful and doesn't work reliably. send_sigqueue() can only fail if ->it_process is dead. But it can die before it dequeues the SI_TIMER signal, in that case the timer stops anyway. Remove this code. I guess it was needed a long ago to ensure that the timer is not destroyed when when its creator thread dies. Q: perhaps it makes sense to change sys_timer_settime() to return an error if ->it_process is dead? Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: mingo@elte.hu Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-24 15:45:47 +02:00
Linus Torvalds	8553f321e0	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers: fix build error in !oneshot case x86: c1e_idle: don't mark TSC unstable if CPU has invariant TSC x86: prevent C-states hang on AMD C1E enabled machines clockevents: prevent mode mismatch on cpu online clockevents: check broadcast device not tick device clockevents: prevent stale tick_next_period for onlining CPUs x86: prevent stale state of c1e_mask across CPU offline/online clockevents: prevent cpu online to interfere with nohz	2008-09-23 14:57:36 -07:00
Linus Torvalds	be3be89058	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: fix init_hrtick() section mismatch warning	2008-09-23 14:57:22 -07:00
Jonathan Steel	f9092f358b	kexec: fix segmentation fault in kimage_add_entry A segmentation fault can occur in kimage_add_entry in kexec.c when loading a kernel image into memory. The fault occurs because a page is requested by calling kimage_alloc_page with gfp_mask GFP_KERNEL and the function may actually return a page with gfp_mask GFP_HIGHUSER. The high mem page is returned because it was swapped with the kernel page due to the kernel page being a page that will shortly be copied to. This patch ensures that kimage_alloc_page returns a page that was created with the correct gfp flags. I have verified the change and fixed the whitespace damage of the original patch. Jonathan did a great job of tracking this down after he hit the problem. -- Eric Signed-off-by: Jonathan Steel <jon.steel@esentire.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-23 08:09:14 -07:00
Peter Zijlstra	57fdc26d4a	sched: fixup buddy selection We should set the buddy even though we might already have the TIF_RESCHED flag set. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 16:23:17 +02:00
Peter Zijlstra	4653f803e6	sched: more sanity checks on the bandwidth settings While playing around with it, I noticed we missed some sanity checks. Also add some comments while we're there. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 16:23:16 +02:00
Peter Zijlstra	78333cdd0e	sched: add some comments to the bandwidth code Hopefully clarify some of this code a little. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 16:23:16 +02:00
Peter Zijlstra	940959e939	sched: fixlet for group load balance We should not only correct the increment for the initial group, but should be consistent and do so for all the groups we encounter. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 16:23:15 +02:00
Ingo Molnar	63e5c39859	Merge branches 'sched/urgent' and 'sched/rt' into sched/devel	2008-09-23 16:23:05 +02:00
Peter Zijlstra	6918bc5c83	lockstat: fixup signed division Some recent modification to this code made me notice the little todo mark. Now that we have more elaborate 64-bit division functions this isn't hard. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 16:19:37 +02:00
Peter Zijlstra	6956985009	sched: rework wakeup preemption Rework the wakeup preemption to work on real runtime instead of the virtual runtime. This greatly simplifies the code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 14:54:23 +02:00
Frank Mayhar	bb34d92f64	timers: fix itimer/many thread hang, v2 This is the second resubmission of the posix timer rework patch, posted a few days ago. This includes the changes from the previous resubmittion, which addressed Oleg Nesterov's comments, removing the RCU stuff from the patch and un-inlining the thread_group_cputime() function for SMP. In addition, per Ingo Molnar it simplifies the UP code, consolidating much of it with the SMP version and depending on lower-level SMP/UP handling to take care of the differences. It also cleans up some UP compile errors, moves the scheduler stats-related macros into kernel/sched_stats.h, cleans up a merge error in kernel/fork.c and has a few other minor fixes and cleanups as suggested by Oleg and Ingo. Thanks for the review, guys. Signed-off-by: Frank Mayhar <fmayhar@google.com> Cc: Roland McGrath <roland@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 13:38:44 +02:00
Ingo Molnar	f8e256c687	timers: fix build error in !oneshot case kernel/time/tick-common.c: In function ‘tick_setup_periodic’: kernel/time/tick-common.c:113: error: implicit declaration of function ‘tick_broadcast_oneshot_active’ Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 12:57:00 +02:00
Thomas Gleixner	27ce4cb4a0	clockevents: prevent mode mismatch on cpu online Impact: timer hang on CPU online observed on AMD C1E systems When a CPU is brought online then the broadcast machinery can be in the one shot state already. Check this and setup the timer device of the new CPU in one shot mode so the broadcast code can pick up the next_event value correctly. Another AMD C1E oddity, as we switch to broadcast immediately and not after the full bring up via the ACPI cpu idle code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-23 11:38:53 +02:00
Thomas Gleixner	302745699c	clockevents: check broadcast device not tick device Impact: Possible hang on CPU online observed on AMD C1E machines. The broadcast setup code looks at the mode of the tick device to determine whether it needs to be shut down or setup. This is wrong when the broadcast mode is set to one shot already. This can happen when a CPU is brought online as it goes through the periodic setup first. The problem went unnoticed as sane systems do not call into that code before the switch to one shot for the clock event device happens. The AMD C1E idle routine switches over immediately and thereby shuts down the just setup device before the first interrupt happens. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-23 11:38:53 +02:00
Thomas Gleixner	49d670fb8d	clockevents: prevent stale tick_next_period for onlining CPUs Impact: possible hang on CPU onlining in timer one shot mode. The tick_next_period variable is only used during boot on nohz/highres enabled systems, but for CPU onlining it needs to be maintained when the per cpu clock events device operates in one shot mode. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-23 11:38:53 +02:00
Thomas Gleixner	6441402b1f	clockevents: prevent cpu online to interfere with nohz Impact: rare hang which can be triggered on CPU online. tick_do_timer_cpu keeps track of the CPU which updates jiffies via do_timer. The value -1 is used to signal, that currently no CPU is doing this. There are two cases, where the variable can have this state: boot: necessary for systems where the boot cpu id can be != 0 nohz long idle sleep: When the CPU which did the jiffies update last goes into a long idle sleep it drops the update jiffies duty so another CPU which is not idle can pick it up and keep jiffies going. Using the same value for both situations is wrong, as the CPU online code can see the -1 state when the timer of the newly onlined CPU is setup. The setup for a newly onlined CPU goes through periodic mode and can pick up the do_timer duty without being aware of the nohz / highres mode of the already running system. Use two separate states and make them constants to avoid magic numbers confusion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-23 11:38:52 +02:00
Harvey Harrison	3a72dc8eb5	rcu: fix sparse shadowed variable warning kernel/rcuclassic.c:564:18: warning: symbol 'flags' shadows an earlier one kernel/rcuclassic.c:527:16: originally declared here Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 11:04:30 +02:00
Andrew Morton	006c75f146	sched: clarify ifdef tangle - Add some comments to try to make the ifdef puzzle a bit clearer - Explicitly inline one of the three init_hrtick() implementations. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 11:02:31 +02:00
Rakib Mullick	fa74820317	sched: fix init_hrtick() section mismatch warning LD kernel/built-in.o WARNING: kernel/built-in.o(.text+0x326): Section mismatch in reference from the function init_hrtick() to the variable .cpuinit.data:hotplug_hrtick_nb.8 The function init_hrtick() references the variable __cpuinitdata hotplug_hrtick_nb.8. This is often because init_hrtick lacks a __cpuinitdata annotation or the annotation of hotplug_hrtick_nb.8 is wrong. Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-23 11:02:13 +02:00
Chris Friesen	caea8a0370	sched: fix list traversal to use _rcu variant load_balance_fair() calls rcu_read_lock() but then traverses the list using the regular list traversal routine. This patch converts the list traversal to use the _rcu version. Signed-off-by: Chris Friesen <cfriesen@nortel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-22 19:43:10 +02:00
Ingo Molnar	f681bbd656	sched: turn off WAKEUP_OVERLAP WAKEUP_OVERLAP is not a winner on a 16way box, running psql+sysbench: .27-rc7-NO_WAKEUP_OVERLAP .27-rc7-WAKEUP_OVERLAP ------------------------------------------------- 1: 694 811 +14.39% 2: 1454 1427 -1.86% 4: 3017 3070 +1.70% 8: 5694 5808 +1.96% 16: 10592 10612 +0.19% 32: 9693 9647 -0.48% 64: 8507 8262 -2.97% 128: 8402 7087 -18.55% 256: 8419 5124 -64.30% 512: 7990 3671 -117.62% ------------------------------------------------- SUM: 64466 55524 -16.11% ... so turn it off by default. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-22 16:29:00 +02:00
Peter Zijlstra	15afe09bf4	sched: wakeup preempt when small overlap Lin Ming reported a 10% OLTP regression against 2.6.27-rc4. The difference seems to come from different preemption agressiveness, which affects the cache footprint of the workload and its effective cache trashing. Aggresively preempt a task if its avg overlap is very small, this should avoid the task going to sleep and find it still running when we schedule back to it - saving a wakeup. Reported-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-22 16:28:32 +02:00
Mark McLoughlin	d7cfb60c5c	hrtimer: remove hrtimer_clock_base::get_softirq_time() Peter Zijlstra noticed this 8 months ago and I just noticed it again. hrtimer_clock_base::get_softirq_time() is currently unused in the entire tree. In fact, looking at the logs, it appears as if it was never used. Remove it. Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-22 13:52:42 +02:00
Linus Torvalds	b4df9d88a6	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: fix deadlock in setting scheduler parameter to zero sched: fix 2.6.27-rc5 couldn't boot on tulsa machine randomly	2008-09-19 16:17:12 -07:00
Linus Torvalds	902f2ac9da	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clockevents: make device shutdown robust clocksource, acpi_pm.c: fix check for monotonicity clockevents: remove WARN_ON which was used to gather information	2008-09-19 16:16:50 -07:00
David S. Miller	2e57572a50	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6 Conflicts: arch/sparc64/kernel/pci_psycho.c	2008-09-16 14:11:43 -07:00
Thomas Gleixner	2344abbcbd	clockevents: make device shutdown robust The device shut down does not cleanup the next_event variable of the clock event device. So when the device is reactivated the possible stale next_event value can prevent the device to be reprogrammed as it claims to wait on a event already. This is the root cause of the resurfacing suspend/resume problem, where systems need key press to come back to life. Fix this by setting next_event to KTIME_MAX when the device is shut down. Use a separate function for shutdown which takes care of that and only keep the direct set mode call in the broadcast code, where we can not touch the next_event value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-16 13:47:02 -07:00
Ingo Molnar	9dfed08eb4	Merge commit 'v2.6.27-rc6' into core/resources	2008-09-14 17:23:29 +02:00
Ingo Molnar	5ce73a4a5a	timers: fix itimer/many thread hang, cleanups Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-14 17:11:46 +02:00
Ingo Molnar	430b5294bd	timers: fix itimer/many thread hang, fix fix: kernel/fork.c:843: error: ‘struct signal_struct’ has no member named ‘sum_sched_runtime’ kernel/irq/handle.c:117: warning: ‘sparse_irq_lock’ defined but not used Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-14 16:40:11 +02:00
Frank Mayhar	f06febc96b	timers: fix itimer/many thread hang Overview This patch reworks the handling of POSIX CPU timers, including the ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together with the help of Roland McGrath, the owner and original writer of this code. The problem we ran into, and the reason for this rework, has to do with using a profiling timer in a process with a large number of threads. It appears that the performance of the old implementation of run_posix_cpu_timers() was at least O(n3) (where "n" is the number of threads in a process) or worse. Everything is fine with an increasing number of threads until the time taken for that routine to run becomes the same as or greater than the tick time, at which point things degrade rather quickly. This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF." Code Changes This rework corrects the implementation of run_posix_cpu_timers() to make it run in constant time for a particular machine. (Performance may vary between one machine and another depending upon whether the kernel is built as single- or multiprocessor and, in the latter case, depending upon the number of running processors.) To do this, at each tick we now update fields in signal_struct as well as task_struct. The run_posix_cpu_timers() function uses those fields to make its decisions. We define a new structure, "task_cputime," to contain user, system and scheduler times and use these in appropriate places: struct task_cputime { cputime_t utime; cputime_t stime; unsigned long long sum_exec_runtime; }; This is included in the structure "thread_group_cputime," which is a new substructure of signal_struct and which varies for uniprocessor versus multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as a simple substructure, while for multiprocessor kernels it is a pointer: struct thread_group_cputime { struct task_cputime totals; }; struct thread_group_cputime { struct task_cputime totals; }; We also add a new task_cputime substructure directly to signal_struct, to cache the earliest expiration of process-wide timers, and task_cputime also replaces the it_*_expires fields of task_struct (used for earliest expiration of thread timers). The "thread_group_cputime" structure contains process-wide timers that are updated via account_user_time() and friends. In the non-SMP case the structure is a simple aggregator; unfortunately in the SMP case that simplicity was not achievable due to cache-line contention between CPUs (in one measured case performance was actually _worse_ on a 16-cpu system than the same test on a 4-cpu system, due to this contention). For SMP, the thread_group_cputime counters are maintained as a per-cpu structure allocated using alloc_percpu(). The timer functions update only the timer field in the structure corresponding to the running CPU, obtained using per_cpu_ptr(). We define a set of inline functions in sched.h that we use to maintain the thread_group_cputime structure and hide the differences between UP and SMP implementations from the rest of the kernel. The thread_group_cputime_init() function initializes the thread_group_cputime structure for the given task. The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the out-of-line function thread_group_cputime_alloc_smp() to allocate and fill in the per-cpu structures and fields. The thread_group_cputime_free() function, also a no-op for UP, in SMP frees the per-cpu structures. The thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls thread_group_cputime_alloc() if the per-cpu structures haven't yet been allocated. The thread_group_cputime() function fills the task_cputime structure it is passed with the contents of the thread_group_cputime fields; in UP it's that simple but in SMP it must also safely check that tsk->signal is non-NULL (if it is it just uses the appropriate fields of task_struct) and, if so, sums the per-cpu values for each online CPU. Finally, the three functions account_group_user_time(), account_group_system_time() and account_group_exec_runtime() are used by timer functions to update the respective fields of the thread_group_cputime structure. Non-SMP operation is trivial and will not be mentioned further. The per-cpu structure is always allocated when a task creates its first new thread, via a call to thread_group_cputime_clone_thread() from copy_signal(). It is freed at process exit via a call to thread_group_cputime_free() from cleanup_signal(). All functions that formerly summed utime/stime/sum_sched_runtime values from from all threads in the thread group now use thread_group_cputime() to snapshot the values in the thread_group_cputime structure or the values in the task structure itself if the per-cpu structure hasn't been allocated. Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit. The run_posix_cpu_timers() function has been split into a fast path and a slow path; the former safely checks whether there are any expired thread timers and, if not, just returns, while the slow path does the heavy lifting. With the dedicated thread group fields, timers are no longer "rebalanced" and the process_timer_rebalance() function and related code has gone away. All summing loops are gone and all code that used them now uses the thread_group_cputime() inline. When process-wide timers are set, the new task_cputime structure in signal_struct is used to cache the earliest expiration; this is checked in the fast path. Performance The fix appears not to add significant overhead to existing operations. It generally performs the same as the current code except in two cases, one in which it performs slightly worse (Case 5 below) and one in which it performs very significantly better (Case 2 below). Overall it's a wash except in those two cases. I've since done somewhat more involved testing on a dual-core Opteron system. Case 1: With no itimer running, for a test with 100,000 threads, the fixed kernel took 1428.5 seconds, 513 seconds more than the unfixed system, all of which was spent in the system. There were twice as many voluntary context switches with the fix as without it. Case 2: With an itimer running at .01 second ticks and 4000 threads (the most an unmodified kernel can handle), the fixed kernel ran the test in eight percent of the time (5.8 seconds as opposed to 70 seconds) and had better tick accuracy (.012 seconds per tick as opposed to .023 seconds per tick). Case 3: A 4000-thread test with an initial timer tick of .01 second and an interval of 10,000 seconds (i.e. a timer that ticks only once) had very nearly the same performance in both cases: 6.3 seconds elapsed for the fixed kernel versus 5.5 seconds for the unfixed kernel. With fewer threads (eight in these tests), the Case 1 test ran in essentially the same time on both the modified and unmodified kernels (5.2 seconds versus 5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds versus 5.4 seconds but again with much better tick accuracy, .013 seconds per tick versus .025 seconds per tick for the unmodified kernel. Since the fix affected the rlimit code, I also tested soft and hard CPU limits. Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer running), the modified kernel was very slightly favored in that while it killed the process in 19.997 seconds of CPU time (5.002 seconds of wall time), only .003 seconds of that was system time, the rest was user time. The unmodified kernel killed the process in 20.001 seconds of CPU (5.014 seconds of wall time) of which .016 seconds was system time. Really, though, the results were too close to call. The results were essentially the same with no itimer running. Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds (where the hard limit would never be reached) and an itimer running, the modified kernel exhibited worse tick accuracy than the unmodified kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise, performance was almost indistinguishable. With no itimer running this test exhibited virtually identical behavior and times in both cases. In times past I did some limited performance testing. those results are below. On a four-cpu Opteron system without this fix, a sixteen-thread test executed in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On the same system with the fix, user and elapsed time were about the same, but system time dropped to 0.007 seconds. Performance with eight, four and one thread were comparable. Interestingly, the timer ticks with the fix seemed more accurate: The sixteen-thread test with the fix received 149543 ticks for 0.024 seconds per tick, while the same test without the fix received 58720 for 0.061 seconds per tick. Both cases were configured for an interval of 0.01 seconds. Again, the other tests were comparable. Each thread in this test computed the primes up to 25,000,000. I also did a test with a large number of threads, 100,000 threads, which is impossible without the fix. In this case each thread computed the primes only up to 10,000 (to make the runtime manageable). System time dominated, at 1546.968 seconds out of a total 2176.906 seconds (giving a user time of 629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite accurate. There is obviously no comparable test without the fix. Signed-off-by: Frank Mayhar <fmayhar@google.com> Cc: Roland McGrath <roland@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-14 16:25:35 +02:00
Ingo Molnar	6e03f99803	Merge branch 'linus' into x86/iommu Conflicts: lib/swiotlb.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-14 14:07:00 +02:00
Li Zefan	4e74339af6	cpuset: avoid changing cpuset's cpus when -errno returned After the patch: commit `0b2f630a28` Author: Miao Xie <miaox@cn.fujitsu.com> Date: Fri Jul 25 01:47:21 2008 -0700 cpusets: restructure the function update_cpumask() and update_nodemask() It might happen that 'echo 0 > /cpuset/sub/cpus' returned failure but 'cpus' has been changed, because cpus was changed before calling heap_init() which may return -ENOMEM. This patch restores the orginal behavior. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-13 14:41:50 -07:00
David S. Miller	17f04fbb0f	sysctl: Use header file for sysctl knob declarations on sparc. This also takes care of a sparse warning as scons_pwroff's definition point. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-11 23:33:53 -07:00
David S. Miller	72c57ed506	sysctl: Use CONFIG_SPARC instead of __sparc__ for ifdef tests. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-11 23:29:54 -07:00
Arjan van de Ven	2e94d1f71f	hrtimer: peek at the timer queue just before going idle As part of going idle, we already look at the time of the next timer event to determine which C-state to select etc. This patch adds functionality that causes the timers that are past their soft expire time, to fire at this time, before we calculate the next wakeup time. This functionality will thus avoid wakeups by running timers before going idle rather than specially waking up for it. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-11 07:17:49 -07:00
Arjan van de Ven	ae4b748e81	hrtimer: make the futex() system call use the per process slack value This patch makes the futex() system call use the per process slack value; with this users are able to externally control existing applications to reduce the wakeup rate. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-11 07:17:00 -07:00
Arjan van de Ven	3bd012060f	hrtimer: make the nanosleep() syscall use the per process slack This patch makes the nanosleep() system call use the per process slack value; with this users are able to externally control existing applications to reduce the wakeup rate. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-11 07:16:55 -07:00
Ingo Molnar	09b22a2f67	Merge commit 'v2.6.27-rc6' into sched/devel	2008-09-11 13:37:28 +02:00
Hiroshi Shimamoto	ec5d498991	sched: fix deadlock in setting scheduler parameter to zero Andrei Gusev wrote: > I played witch scheduler settings. After doing something like: > echo -n 1000000 >sched_rt_period_us > > command is locked. I found in kernel.log: > > Sep 11 00:39:34 zaratustra > Sep 11 00:39:34 zaratustra Pid: 4495, comm: bash Tainted: G W > (2.6.26.3 #12) > Sep 11 00:39:34 zaratustra EIP: 0060:[<c0213fc7>] EFLAGS: 00210246 CPU: 0 > Sep 11 00:39:34 zaratustra EIP is at div64_u64+0x57/0x80 > Sep 11 00:39:34 zaratustra EAX: 0000389f EBX: 00000000 ECX: 00000000 > EDX: 00000000 > Sep 11 00:39:34 zaratustra ESI: d9800000 EDI: d9800000 EBP: 0000389f > ESP: ea7a6edc > Sep 11 00:39:34 zaratustra DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 > Sep 11 00:39:34 zaratustra Process bash (pid: 4495, ti=ea7a6000 > task=ea744000 task.ti=ea7a6000) > Sep 11 00:39:34 zaratustra Stack: 00000000 000003e8 d9800000 0000389f > c0119042 00000000 00000000 00000001 > Sep 11 00:39:34 zaratustra 00000000 00000000 ea7a6f54 00010000 00000000 > c04d2e80 00000001 000e7ef0 > Sep 11 00:39:34 zaratustra c01191a3 00000000 00000000 ea7a6fa0 00000001 > ffffffff c04d2e80 ea5b2480 > Sep 11 00:39:34 zaratustra Call Trace: > Sep 11 00:39:34 zaratustra [<c0119042>] __rt_schedulable+0x52/0x130 > Sep 11 00:39:34 zaratustra [<c01191a3>] sched_rt_handler+0x83/0x120 > Sep 11 00:39:34 zaratustra [<c01a76a6>] proc_sys_call_handler+0xb6/0xd0 > Sep 11 00:39:34 zaratustra [<c01a76c0>] proc_sys_write+0x0/0x20 > Sep 11 00:39:34 zaratustra [<c01a76d9>] proc_sys_write+0x19/0x20 > Sep 11 00:39:34 zaratustra [<c016cc68>] vfs_write+0xa8/0x140 > Sep 11 00:39:34 zaratustra [<c016cdd1>] sys_write+0x41/0x80 > Sep 11 00:39:34 zaratustra [<c0103051>] sysenter_past_esp+0x6a/0x91 > Sep 11 00:39:34 zaratustra ======================= > Sep 11 00:39:34 zaratustra Code: c8 41 0f ad f3 d3 ee f6 c1 20 0f 45 de > 31 f6 0f ad ef d3 ed f6 c1 20 0f 45 fd 0f 45 ee 31 c9 39 eb 89 fe 89 ea > 77 08 89 e8 31 d2 <f7> f3 89 c1 89 f0 8b 7c 24 08 f7 f3 8b 74 24 04 89 > ca 8b 1c 24 > Sep 11 00:39:34 zaratustra EIP: [<c0213fc7>] div64_u64+0x57/0x80 SS:ESP > 0068:ea7a6edc > Sep 11 00:39:34 zaratustra ---[ end trace 4eaa2a86a8e2da22 ]--- fix the boundary condition. sysctl_sched_rt_period=0 makes exception at to_ratio(). Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-11 09:39:18 +02:00
Zhang, Yanmin	baf25731e5	sched: fix 2.6.27-rc5 couldn't boot on tulsa machine randomly On my tulsa x86-64 machine, kernel 2.6.25-rc5 couldn't boot randomly. Basically, function __enable_runtime forgets to reset rt_rq->rt_throttled to 0. When every cpu is up, per-cpu migration_thread is created and it runs very fast, sometimes to mark the corresponding rt_rq->rt_throttled to 1 very quickly. After all cpus are up, with below calling chain: sched_init_smp => arch_init_sched_domains => build_sched_domains => ... => cpu_attach_domain => rq_attach_root => set_rq_online => ... => _enable_runtime _enable_runtime is called against every rt_rq again, so rt_rq->rt_time is reset to 0, but rt_rq->rt_throttled might be still 1. Later on function do_sched_rt_period_timer couldn't reset it, and all RT tasks couldn't be scheduled to run on that cpu. here is RT task migration_thread which is woken up when a task is migrated to another cpu. Below patch fixes it against 2.6.27-rc5. Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-11 09:34:28 +02:00
Ingo Molnar	59c37bf892	Merge commit 'v2.6.27-rc6' into x86/unify-cpu-detect Conflicts: arch/x86/kernel/cpu/amd.c arch/x86/kernel/cpu/common.c arch/x86/kernel/cpu/common_64.c arch/x86/kernel/cpu/feature_names.c include/asm-x86/cpufeature.h Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-10 14:00:45 +02:00
Ingo Molnar	e92b4fdacc	Merge commit 'v2.6.27-rc6' into x86/iommu	2008-09-10 11:32:52 +02:00
Ingo Molnar	6003ab0bad	Merge branch 'linus' into core/debug Conflicts: lib/vsprintf.c Manual merge: include/linux/kernel.h Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-10 09:09:51 +02:00
Arjan van de Ven	ab7476cf76	debug: add notifier chain debugging, v2 - unbreak ia64 (and powerpc) where function pointers dont point at code but at data (reported by Tony Luck) [ mingo@elte.hu: various cleanups ] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-10 09:08:53 +02:00
Ingo Molnar	fb822db465	softlockup: increase hung tasks check from 2 minutes to 8 minutes Andrew says: > Seems that about 100% of the reports we get of this warning triggering > are sys_sync, transaction commit, etc. increase the timeout. If it still triggers for people, we can kill it. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-10 09:08:51 +02:00
Arjan van de Ven	1b2439dbb7	debug: add notifier chain debugging during some development we suspected a case where we left something in a notifier chain that was from a module that was unloaded already... and that sort of thing is rather hard to track down. This patch adds a very simple sanity check (which isn't all that expensive) to make sure the notifier we're about to call is actually from either the kernel itself of from a still-loaded module, avoiding a hard-to-chase-down crash. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-10 09:08:50 +02:00
Ingo Molnar	429b022af4	Merge commit 'v2.6.27-rc6' into core/rcu	2008-09-10 08:35:40 +02:00
Thomas Gleixner	61c22c34c6	clockevents: remove WARN_ON which was used to gather information The issue of the endless reprogramming loop due to a too small min_delta_ns was fixed with the previous updates of the clock events code, but we had no information about the spread of this problem. I added a WARN_ON to get automated information via kerneloops.org and to get some direct reports, which allowed me to analyse the affected machines. The WARN_ON has served its purpose and would be annoying for a release kernel. Remove it and just keep the information about the increase of the min_delta_ns value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-09 22:20:01 +02:00
Linus Torvalds	e1d7bf1499	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: arch_reinit_sched_domains() must destroy domains to force rebuild sched, cpuset: rework sched domains and CPU hotplug handling (v4)	2008-09-08 15:47:21 -07:00
Manfred Spraul	e545a6140b	kernel/cpu.c: create a CPU_STARTING cpu_chain notifier Right now, there is no notifier that is called on a new cpu, before the new cpu begins processing interrupts/softirqs. Various kernel function would need that notification, e.g. kvm works around by calling smp_call_function_single(), rcu polls cpu_online_map. The patch adds a CPU_STARTING notification. It also adds a helper function that sends the message to all cpu_chain handlers. Tested on x86-64. All other archs are untested. Especially on sparc, I'm not sure if I got it right. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-08 19:25:24 +02:00
Arjan van de Ven	704af52bd1	hrtimer: show the timer ranges in /proc/timer_list to help debugging and visibility of timer ranges, show them in the existing timer list in /proc/timer_list Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-07 16:10:20 -07:00
Arjan van de Ven	da8f2e170e	hrtimer: add a hrtimer_start_range() function this patch adds a _range version of hrtimer_start() so that range timers can be created; the hrtimer_start() function is just a wrapper around this. In addition, hrtimer_start_expires() will now preserve existing ranges. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-07 10:58:01 -07:00
Linus Torvalds	f532522565	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource, acpi_pm.c: check for monotonicity clocksource, acpi_pm.c: use proper read function also in errata mode ntp: fix calculation of the next jiffie to trigger RTC sync x86: HPET: read back compare register before reading counter x86: HPET fix moronic 32/64bit thinko clockevents: broadcast fixup possible waiters HPET: make minimum reprogramming delta useful clockevents: prevent endless loop lockup clockevents: prevent multiple init/shutdown clockevents: enforce reprogram in oneshot setup clockevents: prevent endless loop in periodic broadcast handler clockevents: prevent clockevent event_handler ending up handler_noop	2008-09-06 19:33:26 -07:00
Ingo Molnar	291c54ff76	Merge branch 'sched/cpuset' into sched/urgent	2008-09-06 21:03:16 +02:00
Pawel MOLL	7e6e178ab1	genirq: irq_chip->startup() usage in setup_irq and set_irq_chained handler This patch clarifies usage of irq_chip->startup() callback: 1. The "if (startup) startup(); else enabled();" code in setup_irq() is unnecessary, as startup() falls back to enabled() via default callbacks, set by irq_chip_set_defaults(). 2. When using set_irq_chained_handler() the startup() was never called, which is not good at all... Fixed. And again - when startup() is not defined the call will fall back to enable() than to unmask() via default callbacks. Signed-off-by: Pawel Moll <pawel.moll@st.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-06 20:36:38 +02:00
Alexey Dobriyan	978b0116cd	softirq: allocate less vectors We don't need whole 32 of them, only NR_SOFTIRQS. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-06 20:04:36 +02:00
Max Krasnyansky	dfb512ec48	sched: arch_reinit_sched_domains() must destroy domains to force rebuild What I realized recently is that calling rebuild_sched_domains() in arch_reinit_sched_domains() by itself is not enough when cpusets are enabled. partition_sched_domains() code is trying to avoid unnecessary domain rebuilds and will not actually rebuild anything if new domain masks match the old ones. What this means is that doing echo 1 > /sys/devices/system/cpu/sched_mc_power_savings on a system with cpusets enabled will not take affect untill something changes in the cpuset setup (ie new sets created or deleted). This patch fixes restore correct behaviour where domains must be rebuilt in order to enable MC powersaving flags. Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS. Also tested on dual-core Core2 laptop. Lockdep is happy and things are working as expected. Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Tested-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-06 19:22:15 +02:00
Manfred Spraul	3ba35573ad	kernel/cpu.c: Move the CPU_DYING notifiers When a cpu is taken offline, the CPU_DYING notifiers are called on the dying cpu. According to <linux/notifiers.h>, the cpu should be "not running any task, not handling interrupts, soon dead". For the current implementation, this is not true: - __cpu_disable can fail. If it fails, then the cpu will remain alive and happy. - At least on x86, __cpu_disable() briefly enables the local interrupts to handle any outstanding interrupts. What about moving CPU_DYING down a few lines, behind the __cpu_disable() line? There are only two CPU_DYING handlers in the kernel right now: one in kvm, one in the scheduler. Both should work with the patch applied [and: I'm not sure if either one handles a failing __cpu_disable()] The patch survives simple offlining a cpu. kvm untested due to lack of a test setup. Signed-off-By: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-06 19:13:59 +02:00
Gautham R Shenoy	38736f4750	sched: fix __load_balance_iterator() for cfq with only one task The __load_balance_iterator() returns a NULL when there's only one sched_entity which is a task. It is caused by the following code-path. /* Skip over entities that are not tasks */ do { se = list_entry(next, struct sched_entity, group_node); next = next->next; } while (next != &cfs_rq->tasks && !entity_is_task(se)); if (next == &cfs_rq->tasks) return NULL; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This will return NULL even when se is a task. As a side-effect, there was a regression in sched_mc behavior since 2.6.25, since iter_move_one_task() when it calls load_balance_start_fair(), would not get any tasks to move! Fix this by checking if the last entity was a task or not. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-06 16:53:34 +02:00
Ingo Molnar	7f79d852ed	Merge branch 'linus' into sched/devel	2008-09-06 16:51:57 +02:00
Maciej W. Rozycki	4ff4b9e19a	ntp: fix calculation of the next jiffie to trigger RTC sync We have a bug in the calculation of the next jiffie to trigger the RTC synchronisation. The aim here is to run sync_cmos_clock() as close as possible to the middle of a second. Which means we want this function to be called less than or equal to half a jiffie away from when now.tv_nsec equals 5e8 (500000000). If this is not the case for a given call to the function, for this purpose instead of updating the RTC we calculate the offset in nanoseconds to the next point in time where now.tv_nsec will be equal 5e8. The calculated offset is then converted to jiffies as these are the unit used by the timer. Hovewer timespec_to_jiffies() used here uses a ceil()-type rounding mode, where the resulting value is rounded up. As a result the range of now.tv_nsec when the timer will trigger is from 5e8 to 5e8 + TICK_NSEC rather than the desired 5e8 - TICK_NSEC / 2 to 5e8 + TICK_NSEC / 2. As a result if for example sync_cmos_clock() happens to be called at the time when now.tv_nsec is between 5e8 + TICK_NSEC / 2 and 5e8 to 5e8 + TICK_NSEC, it will simply be rescheduled HZ jiffies later, falling in the same range of now.tv_nsec again. Similarly for cases offsetted by an integer multiple of TICK_NSEC. This change addresses the problem by subtracting TICK_NSEC / 2 from the nanosecond offset to the next point in time where now.tv_nsec will be equal 5e8, effectively shifting the following rounding in timespec_to_jiffies() so that it produces a rounded-to-nearest result. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-06 15:31:48 +02:00
Ingo Molnar	77dd3b3bd2	Merge branch 'linus' into timers/ntp	2008-09-06 15:31:03 +02:00
Krzysztof Helt	c8bfff6dd4	sched: compilation fix with gcc 3.4.6 I found that 2.6.27-rc5-mm1 does not compile with gcc 3.4.6. The error is: CC kernel/sched.o kernel/sched.c: In function `start_rt_bandwidth': kernel/sched.c:208: sorry, unimplemented: inlining failed in call to 'rt_bandwidth_enabled': function body not available kernel/sched.c:214: sorry, unimplemented: called from here make[1]: * [kernel/sched.o] Error 1 make: * [kernel] Error 2 It seems that the gcc 3.4.6 requires full inline definition before first usage. The patch below fixes the compilation problem. Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl> (if needed> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-06 15:17:09 +02:00
Thomas Gleixner	7300711e8c	clockevents: broadcast fixup possible waiters Until the C1E patches arrived there where no users of periodic broadcast before switching to oneshot mode. Now we need to trigger a possible waiter for a periodic broadcast when switching to oneshot mode. Otherwise we can starve them for ever. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-06 07:21:17 +02:00
Arjan van de Ven	6976675d94	hrtimer: create a "timer_slack" field in the task struct We want to be able to control the default "rounding" that is used by select() and poll() and friends. This is a per process property (so that we can have a "nice" like program to start certain programs with a looser or stricter rounding) that can be set/get via a prctl(). For this purpose, a field called "timer_slack_ns" is added to the task struct. In addition, a field called "default_timer_slack"ns" is added so that tasks easily can temporarily to a more/less accurate slack and then back to the default. The default value of the slack is set to 50 usec; this is significantly less than 2.6.27's average select() and poll() timing error but still allows the kernel to group timers somewhat to preserve power behavior. Applications and admins can override this via the prctl() Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-05 21:35:30 -07:00
Arjan van de Ven	654c8e0b1c	hrtimer: turn hrtimers into range timers this patch turns hrtimers into range timers; they have 2 expire points 1) the soft expire point 2) the hard expire point the kernel will do it's regular best effort attempt to get the timer run at the hard expire point. However, if some other time fires after the soft expire point, the kernel now has the freedom to fire this timer at this point, and thus grouping the events and preventing a power-expensive wakeup in the future. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-05 21:35:27 -07:00
Arjan van de Ven	cc584b213f	hrtimer: convert kernel/* to the new hrtimer apis In order to be able to do range hrtimers we need to use accessor functions to the "expire" member of the hrtimer struct. This patch converts kernel/* to these accessors. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-05 21:35:13 -07:00
Thomas Gleixner	df0cc0539b	select: add a timespec_add_safe() function For the select() rework, it's important to be able to add timespec structures in an overflow-safe manner. This patch adds a timespec_add_safe() function for this which is similar in operation to ktime_add_safe(), but works on a struct timespec. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-09-05 21:34:57 -07:00
Arjan van de Ven	7bb67439bf	select: Introduce a hrtimeout function This patch adds a schedule_hrtimeout() function, to be used by select() and poll() in a later patch. This function works similar to schedule_timeout() in most ways, but takes a timespec rather than jiffies. With a lot of contributions/fixes from Thomas Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-09-05 21:34:53 -07:00
Balbir Singh	49048622ea	sched: fix process time monotonicity Spencer reported a problem where utime and stime were going negative despite the fixes in commit `b27f03d4bd`. The suspected reason for the problem is that signal_struct maintains it's own utime and stime (of exited tasks), these are not updated using the new task_utime() routine, hence sig->utime can go backwards and cause the same problem to occur (sig->utime, adds tsk->utime and not task_utime()). This patch fixes the problem TODO: using max(task->prev_utime, derived utime) works for now, but a more generic solution is to implement cputime_max() and use the cputime_gt() function for comparison. Reported-by: spencer@bluehost.com Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-05 18:14:35 +02:00
Peter Zijlstra	56c7426b39	sched_clock: fix NOHZ interaction If HLT stops the TSC, we'll fail to account idle time, thereby inflating the actual process times. Fix this by re-calibrating the clock against GTOD when leaving nohz mode. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-05 18:14:08 +02:00
Thomas Gleixner	1fb9b7d29d	clockevents: prevent endless loop lockup The C1E/HPET bug reports on AMDX2/RS690 systems where tracked down to a too small value of the HPET minumum delta for programming an event. The clockevents code needs to enforce an interrupt event on the clock event device in some cases. The enforcement code was stupid and naive, as it just added the minimum delta to the current time and tried to reprogram the device. When the minimum delta is too small, then this loops forever. Add a sanity check. Allow reprogramming to fail 3 times, then print a warning and double the minimum delta value to make sure, that this does not happen again. Use the same function for both tick-oneshot and tick-broadcast code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-05 11:11:53 +02:00
Thomas Gleixner	9c17bcda99	clockevents: prevent multiple init/shutdown While chasing the C1E/HPET bugreports I went through the clock events code inch by inch and found that the broadcast device can be initialized and shutdown multiple times. Multiple shutdowns are not critical, but useless waste of time. Multiple initializations are simply broken. Another CPU might have the device in use already after the first initialization and the second init could just render it unusable again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-05 11:11:52 +02:00
Thomas Gleixner	7205656ab4	clockevents: enforce reprogram in oneshot setup In tick_oneshot_setup we program the device to the given next_event, but we do not check the return value. We need to make sure that the device is programmed enforced so the interrupt handler engine starts working. Split out the reprogramming function from tick_program_event() and call it with the device, which was handed in to tick_setup_oneshot(). Set the force argument, so the devices is firing an interrupt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-05 11:11:52 +02:00
Thomas Gleixner	d4496b3955	clockevents: prevent endless loop in periodic broadcast handler The reprogramming of the periodic broadcast handler was broken, when the first programming returned -ETIME. The clockevents code stores the new expiry value in the clock events device next_event field only when the programming time has not been elapsed yet. The loop in question calculates the new expiry value from the next_event value and therefor never increases. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-05 11:11:51 +02:00
Venkatesh Pallipadi	7c1e768974	clockevents: prevent clockevent event_handler ending up handler_noop There is a ordering related problem with clockevents code, due to which clockevents_register_device() called after tickless/highres switch will not work. The new clockevent ends up with clockevents_handle_noop as event handler, resulting in no timer activity. The problematic path seems to be * old device already has hrtimer_interrupt as the event_handler * new clockevent device registers with a higher rating * tick_check_new_device() is called * clockevents_exchange_device() gets called * old->event_handler is set to clockevents_handle_noop * tick_setup_device() is called for the new device * which sets new->event_handler using the old->event_handler which is noop. Change the ordering so that new device inherits the proper handler. This does not have any issue in normal case as most likely all the clockevent devices are setup before the highres switch. But, can potentially be affecting some corner case where HPET force detect happens after the highres switch. This was a problem with HPET in MSI mode code that we have been experimenting with. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-05 11:11:51 +02:00
Ingo Molnar	0c8c708a7e	Merge branch 'x86/core' into x86/unify-cpu-detect	2008-09-05 09:27:23 +02:00
Ingo Molnar	1cf44baad7	IO resources: fix/remove printk Andrew Morton noticed that the printk in kernel/resource.c was buggy: \| start and end have type resource_size_t. Such types CANNOT be printed \| unless cast to a known type. \| \| Because there is a %s following an incorrect %lld, the above code will \| crash the machine. ... and it's probably quite unneeded as well, so remove it. Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-04 22:08:24 +02:00
Ingo Molnar	8040d77688	Merge branch 'core/resources' into x86/core	2008-09-04 21:04:04 +02:00
Yinghai Lu	268364a0f4	IO resources: add reserve_region_with_split() add reserve_region_with_split() to not lose e820 reserved entries if they overlap with existing IO regions: with test case by extend 0xe0000000 - 0xeffffff to 0xdd800000 - we get: e0000000-efffffff : PCI MMCONFIG 0 e0000000-efffffff : reserved and in /proc/iomem we get: found conflict for reserved [dd800000, efffffff], try to reserve with split __reserve_region_with_split: (PCI Bus #80) [dd000000, ddffffff], res: (reserved) [dd800000, efffffff] __reserve_region_with_split: (PCI Bus #00) [de000000, dfffffff], res: (reserved) [de000000, efffffff] initcall pci_subsys_init+0x0/0x121 returned 0 after 381 msecs in dmesg various fixes and improvements suggested by Linus. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-09-04 21:02:44 +02:00
Al Viro	b380b0d4f7	forgotten refcount on sysctl root table We should've set refcount on the root sysctl table; otherwise we'll blow up the first time we get down to zero dynamically registered sysctl tables. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-04 11:06:21 -07:00
H. Peter Anvin	0ccd8c39bc	Merge branch 'linus' into x86/core	2008-09-04 08:09:09 -07:00
H. Peter Anvin	7203781c98	Merge branch 'x86/cpu' into x86/core Conflicts: arch/x86/kernel/cpu/feature_names.c include/asm-x86/cpufeature.h	2008-09-04 08:08:42 -07:00
John Kacur	9d35935747	pm_qos_requirement might sleep Make PM_QOS and CPU_IDLE play nicer when run with the RT-Preempt kernel. The purpose of the patch is to remove the spin_lock around the read in the function pm_qos_requirement - since spinlocks can sleep in -rt and this function is called from idle. CPU_IDLE polls the target_value's of some of the pm_qos parameters from the idle loop causing sleeping locking warnings. Changing the target_value to an atomic avoids this issue. Remove the spinlock in pm_qos_requirement by making target_value an atomic type. Signed-off-by: mark gross <mgross@linux.intel.com> Signed-off-by: John Kacur <jkacur@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-02 19:21:40 -07:00
Oleg Nesterov	950bbabb5a	pid_ns: (BUG 11391) change ->child_reaper when init->group_leader exits We don't change pid_ns->child_reaper when the main thread of the subnamespace init exits. As Robert Rex <robert.rex@exasol.com> pointed out this is wrong. Yes, the re-parenting itself works correctly, but if the reparented task exits it needs ->parent->nsproxy->pid_ns in do_notify_parent(), and if the main thread is zombie its ->nsproxy was already cleared by exit_task_namespaces(). Introduce the new function, find_new_reaper(), which finds the new ->parent for the re-parenting and changes ->child_reaper if needed. Kill the now unneeded exit_child_reaper(). Also move the changing of ->child_reaper from zap_pid_ns_processes() to find_new_reaper(), this consolidates the games with ->child_reaper and makes it stable under tasklist_lock. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=11391 Reported-by: Robert Rex <robert.rex@exasol.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-02 19:21:38 -07:00
Oleg Nesterov	add0d4dfd6	pid_ns: zap_pid_ns_processes: fix the ->child_reaper changing zap_pid_ns_processes() sets pid_ns->child_reaper = NULL, this is wrong. Yes, we have already killed all tasks in this namespace, and sys_wait4() doesn't see any child. But this doesn't mean ->children list is empty, we may have EXIT_DEAD tasks which are not visible to do_wait(). In that case the subsequent forget_original_parent() will crash the kernel because it will try to re-parent these tasks to the NULL reaper. Even if there are no childs, it is not good that forget_original_parent() uses reaper == NULL. Change the code to set ->child_reaper = init_pid_ns.child_reaper instead. We could use pid_ns->parent->child_reaper as well, I think this does not really matter. These EXIT_DEAD tasks are not visible to the new ->parent after re-parenting, they will silently do release_task() eventually. Note that we must change ->child_reaper, otherwise forget_original_parent() will use reaper == father, and in that case we will hit the (correct) BUG_ON(!list_empty(&father->children)). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-02 19:21:38 -07:00
Linus Torvalds	99039e1352	Merge branch 'audit.b57' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b57' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: [PATCH] audit: Moved variable declaration to beginning of function	2008-09-02 11:04:47 -07:00
Oleg Nesterov	cbaed698f3	softlockup: minor cleanup, don't check task->state twice The recent commit 16d9679f33caf7e683471647d1472bfe133d858 changed check_hung_task() to filter out the TASK_KILLABLE tasks. We can move this check to the caller which has to test t->state anyway. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-02 10:49:51 -07:00
Randy Dunlap	6781f4ae30	kernel/resource.c: fix new kernel-doc warning Fix kernel-doc warning for new function: Warning(linux-2.6.27-rc5-git2//kernel/resource.c:448): No description found for parameter 'root' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-09-02 10:47:30 -07:00
Cordelia	c4bacefb7a	[PATCH] audit: Moved variable declaration to beginning of function got rid of compilation warning: ISO C90 forbids mixed declarations and code Signed-off-by: Cordelia Sam <cordesam@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-09-01 23:06:45 -04:00
Linus Torvalds	bef69ea0dc	Resource handling: add 'insert_resource_expand_to_fit()' function Not used anywhere yet, but this complements the existing plain 'insert_resource()' functionality with a version that can expand the resource we are adding in order to fix up any conflicts it has with existing resources. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-29 20:25:20 -07:00
Andi Kleen	316d9679f3	Don't trigger softlockup detector on network fs blocked tasks Pulling the ethernet cable on a 2.6.27-rc system with NFS mounts currently leads to an ongoing flood of soft lockup detector backtraces for all tasks blocked on the NFS mounts when the hickup takes longer than 120s. I don't think NFS problems should be all that noisy. Luckily there's a reasonably easy way to distingush this case. Don't report task softlockup warnings for tasks in TASK_KILLABLE state, which is used by the network file systems. I believe this patch is a 2.6.27 candidate. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-29 14:46:29 -07:00
Linus Torvalds	66833d5f39	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: exit signals: use of uninitialized field notify_count lockdep: fix invalid list_del_rcu in zap_class lockstat: repair erronous contention statistics lockstat: fix numerical output rounding error	2008-08-28 12:31:49 -07:00
Linus Torvalds	0234bf1d98	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: rt-bandwidth accounting fix sched: fix sched_rt_rq_enqueue() resched idle	2008-08-28 12:31:12 -07:00
Linus Torvalds	e52c8857e0	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: update defconfigs x86: msr: fix bogus return values from rdmsr_safe/wrmsr_safe x86: cpuid: correct return value on partial operations x86: msr: correct return value on partial operations x86: cpuid: propagate error from smp_call_function_single() x86: msr: propagate errors from smp_call_function_single() smp: have smp_call_function_single() detect invalid CPUs	2008-08-28 12:30:59 -07:00
Rafael J. Wysocki	41108eb101	ftrace: disable tracing for hibernation In accordance with commit `f42ac38c59` ("ftrace: disable tracing for suspend to ram"), disable tracing around the suspend code in hibernation code paths. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-28 12:27:39 -07:00
Peter Zijlstra	cc2991cf15	sched: rt-bandwidth accounting fix It fixes an accounting bug where we would continue accumulating runtime even though the bandwidth control is disabled. This would lead to very long throttle periods once bandwidth control gets turned on again. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-28 13:42:38 +02:00
Ingo Molnar	7940ca3605	sched: extract walk_tg_tree(), fix fix: kernel/sched.c: In function '__rt_schedulable': kernel/sched.c:8771: error: implicit declaration of function 'walk_tg_tree' kernel/sched.c:8771: error: 'tg_nop' undeclared (first use in this function) kernel/sched.c:8771: error: (Each undeclared identifier is reported only once kernel/sched.c:8771: error: for each function it appears in.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-28 12:08:00 +02:00
Ingo Molnar	aef745fca0	sched: clean up __might_sleep() add KERN_ to the printout and clean up the flow a bit. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-28 11:36:03 +02:00
Joe Korty	29cbef4869	make might_sleep() display the oopsing process Expand might_sleep's printk to indicate the oopsing process. Signed-off-by: Joe Korty <joe.korty@ccur.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-28 11:36:02 +02:00
Bharata B Rao	aec0a5142c	sched: call resched_task() conditionally from new task wake up path - During wake up of a new task, task_new_fair() can do a resched_task() on the current task. Later in the code path, check_preempt_curr() also ends up doing the same, which can be avoided. Check if TIF_NEED_RESCHED is already set for the current task. - task_new_fair() does a resched_task() on the current task unconditionally. This can be done only in case when child runs before the parent. So this is a small speedup. Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-28 11:35:51 +02:00
John Blackwood	f3ade83780	sched: fix sched_rt_rq_enqueue() resched idle When sysctl_sched_rt_runtime is set to something other than -1 and the CONFIG_RT_GROUP_SCHED kernel parameter is NOT enabled, we get into a state where we see one or more CPUs idling forvever even though there are real-time tasks in their rt runqueue that are able to run (no longer throttled). The sequence is: - A real-time task is running when the timer sets the rt runqueue to throttled, and the rt task is resched_task()ed and switched out, and idle is switched in since there are no non-rt tasks to run on that cpu. - Eventually the do_sched_rt_period_timer() runs and un-throttles the rt runqueue, but we just exit the timer interrupt and go back to executing the idle task in the idle loop forever. If we change the sched_rt_rq_enqueue() routine to use some of the code from the CONFIG_RT_GROUP_SCHED enabled version of this same routine and resched_task() the currently executing task (idle in our case) if it is a lower priority task than the higher rt task in the now un-throttled runqueue, the problem is no longer observed. Signed-off-by: John Blackwood <john.blackwood@ccur.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-28 11:13:24 +02:00
Steven Rostedt	f42ac38c59	ftrace: disable tracing for suspend to ram I've been painstakingly debugging the issue with suspend to ram and ftraced. The 2.6.28 code does not have this issue, but since the mcount recording is not going to be in 27, this must be solved for the ftrace daemon version. The resume from suspend to ram would reboot because it was triple faulting. Debugging further, I found that calling the mcount function itself was not an issue, but it would fault when it incremented preempt_count. preempt_count is on the tasks info structure that is on the low memory address of the task's stack. For some reason, it could not write to it. Resuming out of suspend to ram does quite a lot of funny tricks to get to work, so it is not surprising at all that simply doing a preempt_disable() would cause a fault. Thanks to Rafael for suggesting to add a "while (1);" to find the place in resuming that is causing the fault. I would place the loop somewhere in the code, compile and reboot and see if it would either reboot (hit the fault) or simply hang (hit the loop). Doing this over and over again, I narrowed it down that it was happening in enable_nonboot_cpus. At this point, I found that it is easier to simply disable tracing around the suspend code, instead of searching for the particular function that can not handle doing a preempt_disable. This patch disables the tracer as it suspends and reenables it on resume. I tested this patch on my Laptop, and it can resume fine with the patch. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-27 13:54:20 -07:00
Hiroshi Shimamoto	0cd418ddb1	rcuclassic: fix compiler warning CC kernel/rcuclassic.o kernel/rcuclassic.c: In function 'rcu_init_percpu_data': kernel/rcuclassic.c:705: warning: comparison of distinct pointer types lacks a cast kernel/rcuclassic.c:713: warning: comparison of distinct pointer types lacks a cast flags should be unsigned long. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-27 09:28:07 +02:00
Steve VanDeBogart	2633f0e57b	exit signals: use of uninitialized field notify_count task->signal->notify_count is only initialized if task->signal->group_exit_task is not NULL. Reorder a conditional so that uninitialised memory is not used. Found by Valgrind. Signed-off-by: Steve VanDeBogart <vandebo-lkml@nerdbox.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-27 09:10:09 +02:00
Zhu Yi	7487017282	lockdep: fix invalid list_del_rcu in zap_class The problem is found during iwlagn driver testing on v2.6.27-rc4-176-gb8e6c91 kernel, but it turns out to be a lockdep bug. In our testing, we frequently load and unload the iwlagn driver (>50 times). Then the MAX_STACK_TRACE_ENTRIES is reached (expected behaviour?). The error message with the call trace is as below. BUG: MAX_STACK_TRACE_ENTRIES too low! turning off the locking correctness validator. Pid: 4895, comm: iwlagn Not tainted 2.6.27-rc4 #13 Call Trace: [<ffffffff81014aa1>] save_stack_trace+0x22/0x3e [<ffffffff8105390a>] save_trace+0x8b/0x91 [<ffffffff81054e60>] mark_lock+0x1b0/0x8fa [<ffffffff81056f71>] __lock_acquire+0x5b9/0x716 [<ffffffffa00d818a>] ieee80211_sta_work+0x0/0x6ea [mac80211] [<ffffffff81057120>] lock_acquire+0x52/0x6b [<ffffffff81045f0e>] run_workqueue+0x97/0x1ed [<ffffffff81045f5e>] run_workqueue+0xe7/0x1ed [<ffffffff81045f0e>] run_workqueue+0x97/0x1ed [<ffffffff81046ae4>] worker_thread+0xd8/0xe3 [<ffffffff81049503>] autoremove_wake_function+0x0/0x2e [<ffffffff81046a0c>] worker_thread+0x0/0xe3 [<ffffffff810493ec>] kthread+0x47/0x73 [<ffffffff8128e3ab>] trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8100cea9>] child_rip+0xa/0x11 [<ffffffff8100c4df>] restore_args+0x0/0x30 [<ffffffff810316e1>] finish_task_switch+0x0/0xcc [<ffffffff810493a5>] kthread+0x0/0x73 [<ffffffff8100ce9f>] child_rip+0x0/0x11 Although the above is harmless, when the ilwagn module is removed later lockdep will trigger a kernel oops as below. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff810531e1>] zap_class+0x24/0x82 PGD 73128067 PUD 7448c067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: rfcomm l2cap bluetooth autofs4 sunrpc nf_conntrack_ipv6 xt_state nf_conntrack xt_tcpudp ip6t_ipv6header ip6t_REJECT ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_log dm_multipath dm_mod snd_hda_intel sr_mod snd_seq_dummy snd_seq_oss snd_seq_midi_event battery snd_seq snd_seq_device cdrom button snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc e1000e snd_hwdep sg iTCO_wdt iTCO_vendor_support ac pcspkr i2c_i801 i2c_core snd soundcore video output ata_piix ata_generic libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: mac80211] Pid: 4941, comm: modprobe Not tainted 2.6.27-rc4 #10 RIP: 0010:[<ffffffff810531e1>] [<ffffffff810531e1>] zap_class+0x24/0x82 RSP: 0000:ffff88007bcb3eb0 EFLAGS: 00010046 RAX: 0000000000068ee8 RBX: ffffffff8192a0a0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000001dfb RDI: ffffffff816e70b0 RBP: ffffffffa00cd000 R08: ffffffff816818f8 R09: ffff88007c923558 R10: ffffe20002ad2408 R11: ffffffff811028ec R12: ffffffff8192a0a0 R13: 000000000002bd90 R14: 0000000000000000 R15: 0000000000000296 FS: 00007f9d1cee56f0(0000) GS:ffffffff814a58c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000000073047000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process modprobe (pid: 4941, threadinfo ffff88007bcb2000, task ffff8800758d1fc0) Stack: ffffffff81057376 0000000000000000 ffffffffa00f7b00 0000000000000000 0000000000000080 0000000000618278 00007fff24f16720 0000000000000000 ffffffff8105d37a ffffffffa00f7b00 ffffffff8105d591 313132303863616d Call Trace: [<ffffffff81057376>] ? lockdep_free_key_range+0x61/0xf5 [<ffffffff8105d37a>] ? free_module+0xd4/0xe4 [<ffffffff8105d591>] ? sys_delete_module+0x1de/0x1f9 [<ffffffff8106dbfa>] ? audit_syscall_entry+0x12d/0x160 [<ffffffff8100be2b>] ? system_call_fastpath+0x16/0x1b Code: b2 00 01 00 00 00 c3 31 f6 49 c7 c0 10 8a 61 81 eb 32 49 39 38 75 26 48 98 48 6b c0 38 48 8b 90 08 8a 61 81 48 8b 88 00 8a 61 81 <48> 89 51 08 48 89 0a 48 c7 80 08 8a 61 81 00 02 20 00 48 ff c6 RIP [<ffffffff810531e1>] zap_class+0x24/0x82 RSP <ffff88007bcb3eb0> CR2: 0000000000000008 ---[ end trace a1297e0c4abb0f2e ]--- The root cause for this oops is in add_lock_to_list() when save_trace() fails due to MAX_STACK_TRACE_ENTRIES is reached, entry->class is assigned but entry is never added into any lock list. This makes the list_del_rcu() in zap_class() oops later when the module is unloaded. This patch fixes the problem by assigning entry->class after save_trace() returns success. Signed-off-by: Zhu Yi <yi.zhu@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-27 08:40:36 +02:00
Joe Korty	04148b73b8	lockstat: repair erronous contention statistics Fix bad contention counting in /proc/lock_stat. /proc/lockstat tries to gather per-ip contention statistics per-lock. This was failing due to a garbage per-ip index selector being used. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-26 10:37:47 +02:00
Joe Korty	2189459d25	lockstat: fix numerical output rounding error Fix rounding error in /proc/lock_stat numerical output. On occasion the two digit fractional part contains the three digit value '100'. This is due to a bug in the rounding algorithm which pushes values in the range '95..99' to '100' rather than to '00' + an increment to the integer part. For example, - 123456.100 old display + 123457.00 new display Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-26 10:37:46 +02:00
Kevin Diggs	65eb3dc609	sched: add kernel doc for the completion, fix kernel-doc-nano-HOWTO.txt This patch adds kernel doc for the completion feature. An error in the split-man.pl PERL snippet in kernel-doc-nano-HOWTO.txt is also fixed. Signed-off-by: Kevin Diggs <kevdig@hypersurf.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-26 10:26:54 +02:00
Ingo Molnar	3cf430b063	Merge branch 'linus' into sched/devel	2008-08-26 10:25:59 +02:00
H. Peter Anvin	f73be6dedf	smp: have smp_call_function_single() detect invalid CPUs Have smp_call_function_single() return invalid CPU indicies and return -ENXIO. This function is already executed inside a get_cpu()..put_cpu() which locks out CPU removal, so rather than having the higher layers doing another layer of locking to guard against unplugged CPUs do the test here. Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2008-08-25 17:45:48 -07:00
Linus Torvalds	cc556c5c92	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched_clock: fix cpu_clock()	2008-08-25 11:26:02 -07:00
Linus Torvalds	ffb4ba76a2	[module] Don't let gcc inline load_module() 'load_module()' is a complex function that contains all the ELF section logic, and inlining it is utterly insane. But gcc will do it, simply because there is only one call-site. As a result, all the stack space that is allocated for all the work to load the module will still be active when we actually call the module init sequence, and the deep call chain makes stack overflows happen. And stack overflows are really hard to debug, because they not only corrupt random pages below the stack, but also corrupt the thread_info structure that is allocated under the stack. In this case, Alan Brunelle reported some crazy oopses at bootup, after loading the processor module that ends up doing complex ACPI stuff and has quite a deep callchain. This should fix it, and is the sane thing to do regardless. Cc: Alan D. Brunelle <Alan.Brunelle@hp.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-25 11:10:26 -07:00
Peter Zijlstra	354879bb97	sched_clock: fix cpu_clock() This patch fixes 3 issues: a) it removes the dependency on jiffies, because jiffies are incremented by a single CPU, and the tick is not synchronized between CPUs. Therefore relying on it to calculate a window to clip whacky TSC values doesn't work as it can drift around. So instead use [GTOD, GTOD+TICK_NSEC) as the window. b) __update_sched_clock() did (roughly speaking): delta = sched_clock() - scd->tick_raw; clock += delta; Which gives exponential growth, instead of linear. c) allows the sched_clock_cpu() value to warp the u64 without breaking. the results are more reliable sched_clock() deltas: before after sched_clock cpu_clock: 15750 51312 51488 cpu_clock: 59719 51052 50947 cpu_clock: 15879 51249 51061 cpu_clock: 1 50933 51198 cpu_clock: 1 50931 51039 cpu_clock: 1 51093 50981 cpu_clock: 1 51043 51040 cpu_clock: 1 50959 50938 cpu_clock: 1 50981 51011 cpu_clock: 1 51364 51212 cpu_clock: 1 51219 51273 cpu_clock: 1 51389 51048 cpu_clock: 1 51285 51611 cpu_clock: 1 50964 51137 cpu_clock: 1 50973 50968 cpu_clock: 1 50967 50972 cpu_clock: 1 58910 58485 cpu_clock: 1 51082 51025 cpu_clock: 1 50957 50958 cpu_clock: 1 50958 50957 cpu_clock: 1006128 51128 50971 cpu_clock: 1 51107 51155 cpu_clock: 1 51371 51081 cpu_clock: 1 51104 51365 cpu_clock: 1 51363 51309 cpu_clock: 1 51107 51160 cpu_clock: 1 51139 51100 cpu_clock: 1 51216 51136 cpu_clock: 1 51207 51215 cpu_clock: 1 51087 51263 cpu_clock: 1 51249 51177 cpu_clock: 1 51519 51412 cpu_clock: 1 51416 51255 cpu_clock: 1 51591 51594 cpu_clock: 1 50966 51374 cpu_clock: 1 50966 50966 cpu_clock: 1 51291 50948 cpu_clock: 1 50973 50867 cpu_clock: 1 50970 50970 cpu_clock: 998306 50970 50971 cpu_clock: 1 50971 50970 cpu_clock: 1 50970 50970 cpu_clock: 1 50971 50971 cpu_clock: 1 50970 50970 cpu_clock: 1 51351 50970 cpu_clock: 1 50970 51352 cpu_clock: 1 50971 50970 cpu_clock: 1 50970 50970 cpu_clock: 1 51321 50971 cpu_clock: 1 50974 51324 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-25 17:39:57 +02:00
Adrian Bunk	7a8fc9b248	removed unused #include <linux/version.h>'s This patch lets the files using linux/version.h match the files that #include it. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-23 12:14:12 -07:00
Linus Torvalds	43cc071db8	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: enable LB_BIAS by default	2008-08-22 08:36:55 -07:00
Linus Torvalds	05f57f50e0	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: fix synchronize_rcu() so that kernel-doc works	2008-08-22 08:36:42 -07:00
Oleg Nesterov	93dcf55f82	wait_task_inactive: "improve" the returned value for ->nvcsw == 0 wait_task_inactive() returns 1 when p->nvcsw == 0 \|\| p->nvcsw == 1. This means that two subsequent calls can return the same number while the task was scheduled in between. Change the code to return "nvcsw \| LONG_MIN" instead of "nvcsw ?: 1", now the overlap always needs LONG_MAX schedules. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-22 15:17:31 +02:00
Oleg Nesterov	f31e11d87a	wait_task_inactive(): don't consider task->nivcsw If wait_task_inactive() returns success the task was deactivated. In that case schedule() always increments ->nvcsw which alone can be used as a "generation counter". If the next call returns the same number, we can be sure that the task was unscheduled. Otherwise, because we know that .on_rq == 0 again, ->nvcsw should have been changed in between. Q: perhaps it is better to do "ncsw = (p->nvcsw << 1) \| 1" ? This decreases the possibility of "was it unscheduled" false positive when ->nvcsw == 0. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-22 15:17:29 +02:00
Oleg Nesterov	94d3d8247d	sched: do_wait_for_common: use signal_pending_state() Change do_wait_for_common() to use signal_pending_state() instead of open coding. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-22 15:17:28 +02:00
Ingo Molnar	a38409fbb5	dma-coherent: export dma_[alloc\|release]_from_coherent methods fixes modular builds: ERROR: "dma_alloc_from_coherent" [sound/core/snd-page-alloc.ko] undefined! ERROR: "dma_release_from_coherent" [sound/core/snd-page-alloc.ko] undefined! Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-22 08:34:53 +02:00
Anton Vorontsov	377bf1e4ac	genirq: fix irq_desc->depth handling with DEBUG_SHIRQ When DEBUG_SHIRQ is selected, a spurious IRQ is issued before the setup_irq() initializes the desc->depth. An IRQ handler may call disable_irq_nosync(), but then setup_irq() will overwrite desc->depth, and upon enable_irq() we'll catch this WARN: ------------[ cut here ]------------ Badness at kernel/irq/manage.c:180 NIP: c0061ab8 LR: c0061f10 CTR: 00000000 REGS: cf83be50 TRAP: 0700 Not tainted (2.6.27-rc3-23450-g74919b0) MSR: 00021032 <ME,IR,DR> CR: 22042022 XER: 20000000 TASK = cf829100[5] 'events/0' THREAD: cf83a000 GPR00: c0061f10 cf83bf00 cf829100 c038e674 00000016 00000000 cf83bef8 00000038 GPR08: c0298910 00000000 c0310d28 cf83a000 00000c9c 1001a1a8 0fffe000 00800000 GPR16: ffffffff 00000000 007fff00 00000000 007ffeb0 c03320a0 c031095c c0310924 GPR24: cf8292ec cf807190 cf83a000 00009032 c038e6a4 c038e674 cf99b1cc c038e674 NIP [c0061ab8] __enable_irq+0x20/0x80 LR [c0061f10] enable_irq+0x50/0x70 Call Trace: [cf83bf00] [c038e674] irq_desc+0x630/0x9000 (unreliable) [cf83bf10] [c0061f10] enable_irq+0x50/0x70 [cf83bf30] [c01abe94] phy_change+0x68/0x108 [cf83bf50] [c0046394] run_workqueue+0xc4/0x16c [cf83bf90] [c0046834] worker_thread+0x74/0xd4 [cf83bfd0] [c004ab7c] kthread+0x48/0x84 [cf83bff0] [c00135e0] kernel_thread+0x44/0x60 Instruction dump: 4e800020 3d20c031 38a94214 4bffffcc 9421fff0 7c0802a6 93e1000c 7c7f1b78 90010014 8123001c 2f890000 409e001c <0fe00000> 80010014 83e1000c 38210010 That trace corresponds to this line: WARN(1, KERN_WARNING "Unbalanced enable for IRQ %d\n", irq); The patch fixes the problem by moving the SHIRQ code below the setup_irq(). Unfortunately we can't easily move the SHIRQ code inside the setup_irq(), since it grabs a spinlock, so to prvent a 'real' IRQ from interfere us we should disable that IRQ. p.s. The driver in question is drivers/net/phy/phy.c. Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-22 08:07:21 +02:00
Roman Zippel	916c7a8551	ntp: fix ADJ_OFFSET_SS_READ bug and do_adjtimex() cleanup Thanks to the review by Michael Kerrisk a bug in the recent ADJ_OFFSET_SS_READ option was discovered, where the ntp time_offset was inadvertently set by it. This fixes this by making the adjtime code more separate from the ntp_adjtime code (both of which really want to be separate syscalls). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-22 06:40:18 +02:00
Paul E. McKenney	275a89bdd3	rcu: use irq-safe locks Some earlier tip/core/rcu patches caused RCU to incorrectly enable irqs too early in boot. This caused Yinghai's repeated-kexec testing to hit oopses, presumably due to so that device interrupts left over from the prior kernel instance (which would oops the newly booting kernel before it got a chance to reset said devices). This patch therefore converts all the local_irq_disable()s in rcuclassic.c to local_irq_save(). Besides, I never did like local_irq_disable() anyway. ;-) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-21 16:01:02 +02:00
Oleg Nesterov	d82f0b0f6f	migrate_timers: add comment, use spinlock_irq() Add the comment to explain why the double lock in migrate_timers() can't deadlock. Change the code to use spinlock_irq() instead of local_irq_disable() + spin_lock(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-21 13:34:54 +02:00
Miao Xie	3c4fbe5e01	nohz: fix wrong event handler after online an offlined cpu On the tickless system(CONFIG_NO_HZ=y and CONFIG_HIGH_RES_TIMERS=n), after I made an offlined cpu online, I found this cpu's event handler was tick_handle_periodic, not tick_nohz_handler. After debuging, I found this bug was caused by the wrong tick mode. the tick mode is not changed to NOHZ_MODE_INACTIVE when the cpu is offline. This patch fixes this bug. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-21 09:54:06 +02:00
John Stultz	2d42244ae7	clocksource: introduce CLOCK_MONOTONIC_RAW In talking with Josip Loncaric, and his work on clock synchronization (see btime.sf.net), he mentioned that for really close synchronization, it is useful to have access to "hardware time", that is a notion of time that is not in any way adjusted by the clock slewing done to keep close time sync. Part of the issue is if we are using the kernel's ntp adjusted representation of time in order to measure how we should correct time, we can run into what Paul McKenney aptly described as "Painting a road using the lines we're painting as the guide". I had been thinking of a similar problem, and was trying to come up with a way to give users access to a purely hardware based time representation that avoided users having to know the underlying frequency and mask values needed to deal with the wide variety of possible underlying hardware counters. My solution is to introduce CLOCK_MONOTONIC_RAW. This exposes a nanosecond based time value, that increments starting at bootup and has no frequency adjustments made to it what so ever. The time is accessed from userspace via the posix_clock_gettime() syscall, passing CLOCK_MONOTONIC_RAW as the clock_id. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-21 09:50:24 +02:00
Roman Zippel	9a055117d3	clocksource: introduce clocksource_forward_now() To keep the raw monotonic patch simple first introduce clocksource_forward_now(), which takes care of the offset since the last update_wall_time() call and adds it to the clock, so there is no need anymore to deal with it explicitly at various places, which need to make significant changes to the clock. This is also gets rid of the timekeeping_suspend_nsecs, instead of waiting until resume, the value is accumulated during suspend. In the end there is only a single user of __get_nsec_offset() left, so I integrated it back to getnstimeofday(). Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-21 09:50:24 +02:00
John Stultz	1aa5dfb751	clocksource: keep track of original clocksource frequency The clocksource frequency is represented by clocksource->mult/2^(clocksource->shift). Currently, when NTP makes adjustments to the clock frequency, they are made directly to the mult value. This has the drawback that once changed, we cannot know what the orignal mult value was, or how much adjustment has been applied. This property causes problems in calculating proper ntp intervals when switching back and forth between clocksources. This patch separates the current mult value into a mult and mult_orig pair. The mult_orig value stays constant, while the ntp clocksource adjustments are done only to the mult value. This allows for correct ntp interval calculation and additionally lays the groundwork for a new notion of time, what I'm calling the monotonic-raw time, which is introduced in a following patch. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-21 09:50:23 +02:00
Randy Dunlap	01dcb0443e	rcu: fix synchronize_rcu() so that kernel-doc works Fix RCU's synchronize_rcu() so that it looks like a C function, enabling it to be recognized as a function with kernel-doc annotation. Warning(linux-2.6.26-git11//kernel/rcupdate.c:81): No description found for parameter 'synchronize_rcu' Warning(linux-2.6.26-git11//kernel/rcupdate.c:81): No description found for parameter 'call_rcu' [akpm@linux-foundation.org: fix comment] Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-21 09:31:44 +02:00
Peter Zijlstra	efc2dead2c	sched: enable LB_BIAS by default Yanmin reported a significant regression on his 16-core machine due to: commit `93b75217df` Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Fri Jun 27 13:41:33 2008 +0200 Flip back to the old behaviour. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-21 08:18:02 +02:00
Ken Chen	2d70b68d42	fix setpriority(PRIO_PGRP) thread iterator breakage When user calls sys_setpriority(PRIO_PGRP ...) on a NPTL style multi-LWP process, only the task leader of the process is affected, all other sibling LWP threads didn't receive the setting. The problem was that the iterator used in sys_setpriority() only iteartes over one task for each process, ignoring all other sibling thread. Introduce a new macro do_each_pid_thread / while_each_pid_thread to walk each thread of a process. Convert 4 call sites in {set/get}priority and ioprio_{set/get}. Signed-off-by: Ken Chen <kenchen@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-20 15:40:32 -07:00
Jiri Kosina	1fa63a817d	printk: robustify printk, update comment Remove the comment describing the possibility of printk() deadlocking on runqueue lock. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-20 15:39:59 +02:00
Peter Zijlstra	fa33507a22	printk: robustify printk, fix #2 Dmitry Adamushko reported: > [*] btw., with DEBUG being enabled, pr_debug() generates [1] when > debug_smp_processor_id() is used (CONFIG_DEBUG_PREEMPT). > > the problem seems to be caused by the following commit: > commit `b845b517b5` > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Fri Aug 8 21:47:09 2008 +0200 > > printk: robustify printk > > > wake_up_klogd() -> __get_cpu_var() -> smp_processor_id() > > and that's being called from release_console_sem() which is, in turn, > said to be "may be called from any context" [2] > > and in this case, it seems to be called from some non-preemptible > context (although, it can't be printk()... > although, I haven't looked carefully yet). > > Provided [2], __get_cpu_var() is perhaps not the right solution there. > > > [1] > > [ 7697.942005] BUG: using smp_processor_id() in preemptible [00000000] code: syslogd/3542 > [ 7697.942005] caller is wake_up_klogd+0x1b/0x50 > [ 7697.942005] Pid: 3542, comm: syslogd Not tainted 2.6.27-rc3-tip-git #2 > [ 7697.942005] Call Trace: > [ 7697.942005] [<ffffffff8036b398>] debug_smp_processor_id+0xe8/0xf0 > [ 7697.942005] [<ffffffff80239d3b>] wake_up_klogd+0x1b/0x50 > [ 7697.942005] [<ffffffff8023a047>] release_console_sem+0x1e7/0x200 > [ 7697.942005] [<ffffffff803c0f17>] do_con_write+0xb7/0x1f30 > [ 7697.942005] [<ffffffff8020d920>] ? show_trace+0x10/0x20 > [ 7697.942005] [<ffffffff8020dc42>] ? dump_stack+0x72/0x80 > [ 7697.942005] [<ffffffff8036392d>] ? __ratelimit+0xbd/0xe0 > [ 7697.942005] [<ffffffff8036b398>] ? debug_smp_processor_id+0xe8/0xf0 > [ 7697.942005] [<ffffffff80239d3b>] ? wake_up_klogd+0x1b/0x50 > [ 7697.942005] [<ffffffff8023a047>] ? release_console_sem+0x1e7/0x200 > [ 7697.942005] [<ffffffff803c2de9>] con_write+0x19/0x30 > [ 7697.942005] [<ffffffff803b37b6>] write_chan+0x276/0x3c0 > [ 7697.942005] [<ffffffff80232b20>] ? default_wake_function+0x0/0x10 > [ 7697.942005] [<ffffffff804cb872>] ? _spin_lock_irqsave+0x22/0x50 > [ 7697.942005] [<ffffffff803b1334>] tty_write+0x194/0x260 > [ 7697.942005] [<ffffffff803b3540>] ? write_chan+0x0/0x3c0 > [ 7697.942005] [<ffffffff803b14a4>] redirected_tty_write+0xa4/0xb0 > [ 7697.942005] [<ffffffff803b1400>] ? redirected_tty_write+0x0/0xb0 > [ 7697.942005] [<ffffffff802a88c2>] do_loop_readv_writev+0x52/0x80 > [ 7697.942005] [<ffffffff802a939d>] do_readv_writev+0x1bd/0x1d0 > [ 7697.942005] [<ffffffff802a93e9>] vfs_writev+0x39/0x60 > [ 7697.942005] [<ffffffff802a9870>] sys_writev+0x50/0x90 > [ 7697.942005] [<ffffffff8020bb3b>] system_call_fastpath+0x16/0x1b Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reported-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-20 12:22:37 +02:00
Roland McGrath	1b04624f93	tracehook: fix SA_NOCLDWAIT I outwitted myself again in commit `2b2a1ff64a`, and broke the SA_NOCLDWAIT behavior so it leaks zombies. This fixes it. Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Roland McGrath <roland@redhat.com>	2008-08-19 20:37:07 -07:00
Peter Zijlstra	9a7e0b180d	sched: rt-bandwidth fixes The last patch allows sysctl_sched_rt_runtime to disable bandwidth accounting for the group scheduler - however it doesn't deal with sched_setscheduler(), which will keep tasks out of groups that have no assigned runtime. If we relax this, we get into the situation where RT tasks can get into a group when we disable bandwidth control, and then starve them by enabling it again. Rework the schedulability code to check for this condition and fail to turn on bandwidth control with -EBUSY when this situation is found. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-19 13:10:12 +02:00
Peter Zijlstra	eb755805f2	sched: extract walk_tg_tree() Extract walk_tg_tree() and make it a little more generic so we can use it in the schedulablity test. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-19 13:10:11 +02:00
Peter Zijlstra	0b148fa048	sched: rt-bandwidth group disable fixes More extensive disable of bandwidth control. It allows sysctl_sched_rt_runtime to disable full group bandwidth control. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-19 13:10:10 +02:00
Peter Zijlstra	6f0d5c390e	sched: rt-bandwidth accounting fix It fixes an accounting bug where we would continue accumulating runtime even though the bandwidth control is disabled. This would lead to very long throttle periods once bandwidth control gets turned on again. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-19 13:10:09 +02:00
Peter Zijlstra	af4491e516	sched: rt-bandwidth for user grouping interface rt_runtime is a signed value Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-19 13:10:09 +02:00
Hiroshi Shimamoto	0c925d7923	rcuclassic: fix compilation NG fix: CC kernel/rcuclassic.o kernel/rcuclassic.c: In function '__rcu_process_callbacks': kernel/rcuclassic.c:561: error: 'flags' undeclared (first use in this function) kernel/rcuclassic.c:561: error: (Each undeclared identifier is reported only once kernel/rcuclassic.c:561: error: for each function it appears in.) Declare missing variable flags. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-19 11:03:33 +02:00
Paul E. McKenney	eff9b713ee	rcu: fix locking cleanup fallout Given that the rcp->lock is now acquired from call_rcu(), which can be invoked from irq-disable regions, all acquisitions need to disable irqs. The following patch fixes this. Although I don't have any reason to believe that this is the cause of Yinghai's oops, it does need to be fixed. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-19 04:15:36 +02:00
Paul E. McKenney	ded00a56e9	rcu: remove redundant ACCESS_ONCE definition from rcupreempt.c Remove the redundant definition of ACCESS_ONCE() from rcupreempt.c in favor of the one in compiler.h. Also merge the comment header from rcupreempt.c's definition into that in compiler.h. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-18 09:45:22 +02:00
Dmitry Baryshkov	6951b12a0f	lockdep: fix spurious 'inconsistent lock state' warning Since `f82b217e35` lockdep can output spurious warnings related to hwirqs due to hardirq_off shrinkage from int to bit-sized flag. Guard it with double negation to fix the warning. Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-18 09:42:31 +02:00
Paul E. McKenney	cd95851785	rcu: fix classic RCU locking cleanup lockdep problem On Fri, Aug 15, 2008 at 04:24:30PM +0200, Ingo Molnar wrote: > > Paul, > > one of your two recent RCU patches caused this lockdep splat in -tip > testing: > > -------------------> > Brought up 2 CPUs > Total of 2 processors activated (6850.87 BogoMIPS). > PM: Adding info for No Bus:platform > khelper used greatest stack depth: 3124 bytes left > > ================================= > [ INFO: inconsistent lock state ] > 2.6.27-rc3-tip #1 > --------------------------------- > inconsistent {softirq-on-W} -> {in-softirq-W} usage. > ksoftirqd/0/4 [HC0[0]:SC1[1]:HE1:SE0] takes: > (&rcu_ctrlblk.lock){-+..}, at: [<c016d91c>] __rcu_process_callbacks+0x1ac/0x1f0 > {softirq-on-W} state was registered at: > [<c01528e4>] __lock_acquire+0x3f4/0x5b0 > [<c0152b29>] lock_acquire+0x89/0xc0 > [<c076142b>] _spin_lock+0x3b/0x70 > [<c016d649>] rcu_init_percpu_data+0x29/0x80 > [<c075e43f>] rcu_cpu_notify+0xaf/0xd0 > [<c076458d>] notifier_call_chain+0x2d/0x60 > [<c0145ede>] __raw_notifier_call_chain+0x1e/0x30 > [<c075db29>] _cpu_up+0x79/0x110 > [<c075dc0d>] cpu_up+0x4d/0x70 > [<c0a769e1>] kernel_init+0xb1/0x200 > [<c01048a3>] kernel_thread_helper+0x7/0x10 > [<ffffffff>] 0xffffffff > irq event stamp: 14 > hardirqs last enabled at (14): [<c01534db>] trace_hardirqs_on+0xb/0x10 > hardirqs last disabled at (13): [<c014dbeb>] trace_hardirqs_off+0xb/0x10 > softirqs last enabled at (0): [<c012b186>] copy_process+0x276/0x1190 > softirqs last disabled at (11): [<c0105c0a>] call_on_stack+0x1a/0x30 > > other info that might help us debug this: > no locks held by ksoftirqd/0/4. > > stack backtrace: > Pid: 4, comm: ksoftirqd/0 Not tainted 2.6.27-rc3-tip #1 > [<c01504dc>] print_usage_bug+0x16c/0x1b0 > [<c0152455>] mark_lock+0xa75/0xb10 > [<c0108b75>] ? sched_clock+0x15/0x30 > [<c015289d>] __lock_acquire+0x3ad/0x5b0 > [<c0152b29>] lock_acquire+0x89/0xc0 > [<c016d91c>] ? __rcu_process_callbacks+0x1ac/0x1f0 > [<c076142b>] _spin_lock+0x3b/0x70 > [<c016d91c>] ? __rcu_process_callbacks+0x1ac/0x1f0 > [<c016d91c>] __rcu_process_callbacks+0x1ac/0x1f0 > [<c016d986>] rcu_process_callbacks+0x26/0x50 > [<c0132305>] __do_softirq+0x95/0x120 > [<c0132270>] ? __do_softirq+0x0/0x120 > [<c0105c0a>] call_on_stack+0x1a/0x30 > [<c0132426>] ? ksoftirqd+0x96/0x110 > [<c0132390>] ? ksoftirqd+0x0/0x110 > [<c01411f7>] ? kthread+0x47/0x80 > [<c01411b0>] ? kthread+0x0/0x80 > [<c01048a3>] ? kernel_thread_helper+0x7/0x10 > ======================= > calling init_cpufreq_transition_notifier_list+0x0/0x20 > initcall init_cpufreq_transition_notifier_list+0x0/0x20 returned 0 after 0 msecs > calling net_ns_init+0x0/0x190 > net_namespace: 676 bytes > initcall net_ns_init+0x0/0x190 returned 0 after 0 msecs > calling cpufreq_tsc+0x0/0x20 > initcall cpufreq_tsc+0x0/0x20 returned 0 after 0 msecs > calling reboot_init+0x0/0x20 > initcall reboot_init+0x0/0x20 returned 0 after 0 msecs > calling print_banner+0x0/0x10 > Booting paravirtualized kernel on bare hardware > > <----------------------- > > my guess is on: > > commit `1f7b94cd3d` > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Date: Tue Aug 5 09:21:44 2008 -0700 > > rcu: classic RCU locking and memory-barrier cleanups > > Ingo Fixes a problem detected by lockdep in which rcu->lock was acquired both in irq context and in process context, but without disabling from process context. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-17 17:38:01 +02:00
Linus Torvalds	406703f8de	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: fix build if CONFIG_PROVE_LOCKING not defined lockdep: use WARN() in kernel/lockdep.c lockdep: spin_lock_nest_lock(), checkpatch fixes lockdep: build fix	2008-08-16 17:16:07 -07:00
Linus Torvalds	c100548d46	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: scale sysctl_sched_shares_ratelimit with nr_cpus sched: fix rt-bandwidth hotplug race sched: fix the race between walk_tg_tree and sched_create_group	2008-08-16 17:15:32 -07:00
Linus Torvalds	71ef2a46fc	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: security: Fix setting of PF_SUPERPRIV by __capable()	2008-08-15 15:32:13 -07:00
Stephen Hemminger	df60a84418	lockdep: fix build if CONFIG_PROVE_LOCKING not defined If CONFIG_PROVE_LOCKING not defined, then no dependency information is available. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-15 19:22:04 +02:00
Peter Zijlstra	55cd53404c	sched: scale sysctl_sched_shares_ratelimit with nr_cpus David reported that his Niagra spend a little too much time in tg_shares_up(), which considering he has a large cpu count makes sense. So scale the ratelimit value with the number of cpus like we do for other controls as well. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-15 18:25:07 +02:00
Steven Rostedt	5802294f1b	rcu: trace fix possible mem-leak In the initialization of the RCU trace module, if rcupreempt_debugfs_init() fails, we never free the the trace buffer. This patch frees the trace buffer in case the debugfs fails. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-15 17:54:40 +02:00
Dave Chinner	be4de35263	completions: uninline try_wait_for_completion and completion_done m68k fails to build with these functions inlined in completion.h. Move them out of line into sched.c and export them to avoid this problem. Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-15 08:35:44 -07:00
Andrew Morton	8c5a1cf0ad	kexec: use a mutex for locking rather than xchg() Functionally the same, but more conventional. Cc: Huang Ying <ying.huang@intel.com> Tested-by: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-15 08:35:43 -07:00
Huang Ying	3122c33119	kexec jump: fix for ftrace Ftrace depends on some processor state that we destroyed during kexec and restored by restore_processor_state(). So save_processor_state() and restore_processor_state() are moved into machine_kexec() and ftrace is restored after restore_processor_state(). Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-15 08:35:43 -07:00
Huang Ying	73bd9c72a2	kexec jump: in sync with hibernation implementation Add device_pm_lock() and device_pm_unlock() in kernel_kexec() in sync with current hibernation implementation. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-15 08:35:42 -07:00
Huang Ying	ca195b7f6d	kexec jump: remove duplication of kexec_restart_prepare() Call kernel_restart_prepare() in kernel_kexec() instead of duplicating the code. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Pavel Machek <pavel@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-15 08:35:42 -07:00
Huang Ying	163f6876f5	kexec jump: rename KEXEC_CONTROL_CODE_SIZE to KEXEC_CONTROL_PAGE_SIZE Rename KEXEC_CONTROL_CODE_SIZE to KEXEC_CONTROL_PAGE_SIZE, because control page is used for not only code on some platform. For example in kexec jump, it is used for data and stack too. [akpm@linux-foundation.org: unbreak powerpc and arm, finish conversion] Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-15 08:35:42 -07:00
Huang Ying	7ade3fcc1f	kexec jump: clean up #ifdef and comments Move if (kexec_image->preserve_context) { ... } into #ifdef CONFIG_KEXEC_JUMP to make code looks cleaner. Fix no longer correct comments of kernel_kexec(). Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-15 08:35:42 -07:00
Huang Ying	4cd69b986e	kexec: fix compilation warning on xchg(&kexec_lock, 0) in kernel_kexec() kernel/kexec.c: In function 'kernel_kexec': kernel/kexec.c:1506: warning: value computed is not used Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-15 08:35:42 -07:00
Paul E. McKenney	1f7b94cd3d	rcu: classic RCU locking and memory-barrier cleanups This patch simplifies the locking and memory-barrier usage in the Classic RCU grace-period-detection mechanism, incorporating Lai Jiangshan's feedback from the earlier version (http://lkml.org/lkml/2008/8/1/400 and http://lkml.org/lkml/2008/8/3/43). Passed 10 hours of rcutorture concurrent with CPUs being put online and taken offline on a 128-hardware-thread Power machine. My apologies to whoever in the Eastern Hemisphere was planning to use this machine over the Western Hemisphere night, but it was sitting idle and... So this is ready for tip/core/rcu. This patch is in preparation for moving to a hierarchical algorithm to allow the very large SMP machines -- requested by some people at OLS, and there seem to have been a few recent patches in the 4096-CPU direction as well. The general idea is to move to a much more conservative concurrency design, then apply a hierarchy to reduce contention on the global lock by a few orders of magnitude (larger machines would see greater reductions). The reason for taking a conservative approach is that this code isn't on any fast path. Prototype in progress. This patch is against the linux-tip git tree (tip/core/rcu). If you wish to test this against 2.6.26, use the following set of patches: http://www.rdrop.com/users/paulmck/patches/2.6.26-ljsimp-1.patch http://www.rdrop.com/users/paulmck/patches/2.6.26-ljsimpfix-3.patch The first patch combines commits `5127bed588` and `3cac97cbb1` from Lai Jiangshan <laijs@cn.fujitsu.com>, and the second patch contains my changes. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-15 16:08:47 +02:00
Paul E. McKenney	293a17ebc9	rcu: prevent console flood when one CPU sees another AWOL via RCU One small change needed to keep from flooding the console when one CPU notices that another is AWOL. Unless I am missing something subtle. Otherwise the cleanups look good! Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-15 15:08:58 +02:00
Peter Zijlstra	f1679d0848	sched: fix rt-bandwidth hotplug race When we hot-unplug a cpu and rebuild the sched-domain, all cpus will be detatched. Alex observed the case where a runqueue was stealing bandwidth from an already disabled runqueue to satisfy its own needs. Stop this by skipping over already disabled runqueues. Reported-by: Alex Nixon <alex.nixon@citrix.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Alex Nixon <alex.nixon@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-14 15:50:58 +02:00
David Howells	5cd9c58fbe	security: Fix setting of PF_SUPERPRIV by __capable() Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>	2008-08-14 22:59:43 +10:00
Ingo Molnar	51ca3c6791	Merge branch 'linus' into x86/core Conflicts: arch/x86/kernel/genapic_64.c include/asm-x86/kvm_host.h Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-14 14:58:01 +02:00
Max Krasnyansky	cf417141cb	sched, cpuset: rework sched domains and CPU hotplug handling (v4) This is an updated version of my previous cpuset patch on top of the latest mainline git. The patch fixes CPU hotplug handling issues in the current cpusets code. Namely circular locking in rebuild_sched_domains() and unsafe access to the cpu_online_map in the cpuset cpu hotplug handler. This version includes changes suggested by Paul Jackson (naming, comments, style, etc). I also got rid of the separate workqueue thread because it is now safe to call get_online_cpus() from workqueue callbacks. Here are some more details: rebuild_sched_domains() is the only way to rebuild sched domains correctly based on the current cpuset settings. What this means is that we need to be able to call it from different contexts, like cpu hotplug for example. Also latest scheduler code in -tip now calls rebuild_sched_domains() directly from functions like arch_reinit_sched_domains(). In order to support that properly we need to rework cpuset locking rules to avoid circular dependencies, which is what this patch does. New lock nesting rules are explained in the comments. We can now safely call rebuild_sched_domains() from virtually any context. The only requirement is that it needs to be called under get_online_cpus(). This allows cpu hotplug handlers and the scheduler to call rebuild_sched_domains() directly. The rest of the cpuset code now offloads sched domains rebuilds to a workqueue (async_rebuild_sched_domains()). This version of the patch addresses comments from the previous review. I fixed all miss-formated comments and trailing spaces. I also factored out the code that builds domain masks and split up CPU and memory hotplug handling. This was needed to simplify locking, to avoid unsafe access to the cpu_online_map from mem hotplug handler, and in general to make things cleaner. The patch passes moderate testing (building kernel with -j 16, creating & removing domains and bringing cpus off/online at the same time) on the quad-core2 based machine. It passes lockdep checks, even with preemptable RCU enabled. This time I also tested in with suspend/resume path and everything is working as expected. Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Acked-by: Paul Jackson <pj@sgi.com> Cc: menage@google.com Cc: a.p.zijlstra@chello.nl Cc: vegard.nossum@gmail.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-14 11:23:51 +02:00
Zhang, Yanmin	09f2724a78	sched: fix the race between walk_tg_tree and sched_create_group With 2.6.27-rc3, I hit a kernel panic when running volanoMark on my new x86_64 machine. I also hit it with other 2.6.27-rc kernels. See below log. Basically, function walk_tg_tree and sched_create_group have a race between accessing and initiating tg->children. Below patch fixes it by moving tg->children initiation to the front of linking tg->siblings to parent->children. {----------------panic log------------} BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: [<ffffffff802292ab>] walk_tg_tree+0x45/0x7f PGD 1be1c4067 PUD 1bdd8d067 PMD 0 Oops: 0000 [1] SMP CPU 11 Modules linked in: igb Pid: 22979, comm: java Not tainted 2.6.27-rc3 #1 RIP: 0010:[<ffffffff802292ab>] [<ffffffff802292ab>] walk_tg_tree+0x45/0x7f RSP: 0018:ffff8801bfbbbd18 EFLAGS: 00010083 RAX: 0000000000000000 RBX: ffff8800be0dce40 RCX: ffffffffffffffc0 RDX: ffff880102c43740 RSI: 0000000000000000 RDI: ffff8800be0dce40 RBP: ffff8801bfbbbd48 R08: ffff8800ba437bc8 R09: 0000000000001f40 R10: ffff8801be812100 R11: ffffffff805fdf44 R12: ffff880102c43740 R13: 0000000000000000 R14: ffffffff8022cf0f R15: ffffffff8022749f FS: 00000000568ac950(0063) GS:ffff8801bfa26d00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000001bd848000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process java (pid: 22979, threadinfo ffff8801b145a000, task ffff8801bf18e450) Stack: 0000000000000001 ffff8800ba5c8d60 0000000000000001 0000000000000001 ffff8800bad1ccb8 0000000000000000 ffff8801bfbbbd98 ffffffff8022ed37 0000000000000001 0000000000000286 ffff8801bd5ee180 ffff8800ba437bc8 Call Trace: <IRQ> [<ffffffff8022ed37>] try_to_wake_up+0x71/0x24c [<ffffffff80247177>] autoremove_wake_function+0x9/0x2e [<ffffffff80228039>] ? __wake_up_common+0x46/0x76 [<ffffffff802296d5>] __wake_up+0x38/0x4f [<ffffffff806169cc>] tcp_v4_rcv+0x380/0x62e Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-14 10:58:48 +02:00
Arjan van de Ven	2df8b1d656	lockdep: use WARN() in kernel/lockdep.c Use WARN() instead of a printk+WARN_ON() pair; this way the message becomes part of the warning section for better reporting/collection. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-08-13 19:06:46 +02:00
Andrew Morton	c72f4573a5	lockdep: spin_lock_nest_lock(), checkpatch fixes fix: WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable #46: FILE: kernel/spinlock.c:326: +EXPORT_SYMBOL(_spin_lock_nest_lock); total: 0 errors, 1 warnings, 26 lines checked Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-13 13:56:51 +02:00
Ingo Molnar	73909f7a66	Merge commit 'v2.6.27-rc3' into core/urgent	2008-08-13 13:56:44 +02:00
Ingo Molnar	d6672c5018	lockdep: build fix fix: kernel/built-in.o: In function `lockdep_stats_show': lockdep_proc.c:(.text+0x3cb2f): undefined reference to `lockdep_count_forward_deps' kernel/built-in.o: In function `l_show': lockdep_proc.c:(.text+0x3d02b): undefined reference to `lockdep_count_forward_deps' lockdep_proc.c:(.text+0x3d047): undefined reference to `lockdep_count_backward_deps' Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-13 12:55:10 +02:00
Alexey Dobriyan	f18e439d10	genirq: switch /proc/irq//smp_affinity et al to seqfiles Switch /proc/irq//smp_affinity , /proc/irq/default_smp_affinity to seq_files. cat(1) reads with 1024 chunks by default, with high enough NR_CPUS, there will be -EINVAL. As side effect, there are now two less users of the ->read_proc interface. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Paul Jackson <pj@sgi.com> Cc: Mike Travis <travis@sgi.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-12 16:07:30 -07:00
Heiko Carstens	3ee1062b4e	cpu hotplug: s390 doesn't support additional_cpus anymore. s390 doesn't support the additional_cpus kernel parameter anymore since a long time. So we better update the code and documentation to reflect that. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-12 16:07:28 -07:00
Linus Torvalds	96348852cf	Merge branch 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask(), fix	2008-08-12 08:49:53 -07:00
Linus Torvalds	1c89ac5501	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: fix spinlock recursion in hvc_console stop_machine: remove unused variable modules: extend initcall_debug functionality to the module loader export virtio_rng.h lguest: use get_user_pages_fast() instead of get_user_pages() mm: Make generic weak get_user_pages_fast and EXPORT_GPL it lguest: don't set MAC address for guest unless specified	2008-08-12 08:40:19 -07:00
Nick Piggin	c2fc11985d	generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask(), fix > > Nick Piggin (1): > > generic-ipi: fix stack and rcu interaction bug in > > smp_call_function_mask() > > I'm still not 100% sure that I have this patch right... I might have seen > a lockup trace implicating the smp call function path... which may have > been due to some other problem or a different bug in the new call function > code, but if some more people can take a look at it before merging? OK indeed it did have a couple of bugs. Firstly, I wasn't freeing the data properly in the alloc && wait case. Secondly, I wasn't resetting CSD_FLAG_WAIT in the for each cpu loop (so only the first CPU would wait). After those fixes, the patch boots and runs with the kmalloc commented out (so it always executes the slowpath). Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-12 11:21:27 +02:00
Li Zefan	ed6d68763b	stop_machine: remove unused variable Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-08-12 17:52:55 +10:00
Arjan van de Ven	59f9415ffb	modules: extend initcall_debug functionality to the module loader The kernel has this really nice facility where if you put "initcall_debug" on the kernel commandline, it'll print which function it's going to execute just before calling an initcall, and then after the call completes it will 1) print if it had an error code 2) checks for a few simple bugs (like leaving irqs off) and 3) print how long the init call took in milliseconds. While trying to optimize the boot speed of my laptop, I have been loving number 3 to figure out what to optimize... ... and then I wished that the same thing was done for module loading. This patch makes the module loader use this exact same functionality; it's a logical extension in my view (since modules are just sort of late binding initcalls anyway) and so far I've found it quite useful in finding where things are too slow in my boot. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-08-12 17:52:54 +10:00
Linus Torvalds	1ea2950884	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched, cpu hotplug: fix set_cpus_allowed() use in hotplug callbacks sched: fix mysql+oltp regression sched_clock: delay using sched_clock() sched clock: couple local and remote clocks sched clock: simplify __update_sched_clock() sched: eliminate scd->prev_raw sched clock: clean up sched_clock_cpu() sched clock: revert various sched_clock() changes sched: move sched_clock before first use sched: test runtime rather than period in global_rt_runtime() sched: fix SCHED_HRTICK dependency sched: fix warning in hrtick_start_fair()	2008-08-11 16:46:31 -07:00
Linus Torvalds	67a077dca4	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-timers: fix posix_timer_event() vs dequeue_signal() race posix-timers: do_schedule_next_timer: fix the setting of ->si_overrun	2008-08-11 16:46:11 -07:00
Linus Torvalds	9b4d0bab32	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: fix debug_lock_alloc lockdep: increase MAX_LOCKDEP_KEYS generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask() lockdep: fix overflow in the hlock shrinkage code lockdep: rename map_[acquire\|release]() => lock_map_[acquire\|release]() lockdep: handle chains involving classes defined in modules mm: fix mm_take_all_locks() locking order lockdep: annotate mm_take_all_locks() lockdep: spin_lock_nest_lock() lockdep: lock protection locks lockdep: map_acquire lockdep: shrink held_lock structure lockdep: re-annotate scheduler runqueues lockdep: lock_set_subclass - reset a held lock's subclass lockdep: change scheduler annotation debug_locks: set oops_in_progress if we will log messages. lockdep: fix combinatorial explosion in lock subgraph traversal	2008-08-11 16:45:46 -07:00
Ingo Molnar	23a0ee908c	Merge branch 'core/locking' into core/urgent	2008-08-12 00:11:49 +02:00
Ingo Molnar	e26b33e955	Merge branch 'sched/clock' into sched/urgent	2008-08-12 00:07:02 +02:00
Peter Zijlstra	0f2bc27be2	lockdep: fix debug_lock_alloc When we enable DEBUG_LOCK_ALLOC but do not enable PROVE_LOCKING and or LOCK_STAT, lock_alloc() and lock_release() turn into nops, even though we should be doing hlock checking (check=1). This causes a false warning and a lockdep self-disable. Rectify this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 22:45:51 +02:00
Dmitry Adamushko	279ef6bbb8	sched, cpu hotplug: fix set_cpus_allowed() use in hotplug callbacks Mark Langsdorf reported: > One of my co-workers noticed that the powernow-k8 > driver no longer restarts when a CPU core is > hot-disabled and then hot-enabled on AMD quad-core > systems. > > The following comands work fine on 2.6.26 and fail > on 2.6.27-rc1: > > echo 0 > /sys/devices/system/cpu/cpu3/online > echo 1 > /sys/devices/system/cpu/cpu3/online > find /sys -name cpufreq > > For 2.6.26, the find will return a cpufreq > directory for each processor. In 2.6.27-rc1, > the cpu3 directory is missing. > > After digging through the code, the following > logic is failing when the core is hot-enabled > at runtime. The code works during the boot > sequence. > > cpumask_t = current->cpus_allowed; > set_cpus_allowed_ptr(current, &cpumask_of_cpu(cpu)); > if (smp_processor_id() != cpu) > return -ENODEV; So set the CPU active before calling the CPU_ONLINE notifier chain, there are a handful of notifiers that use set_cpus_allowed(). This fix also solves the problem with x86-microcode. I've sent alternative patches for microcode, but as this "rely on set_cpus_allowed_ptr() being workable in cpu-hotplug(CPU_ONLINE, ...)" assumption seems to be more broad than what we thought, perhaps this fix should be applied. With this patch we define that by the moment CPU_ONLINE is being sent, a 'cpu' is online and ready for tasks to be migrated onto it. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Reported-by: Mark Langsdorf <mark.langsdorf@amd.com> Tested-by: Mark Langsdorf <mark.langsdorf@amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 16:32:41 +02:00
Nick Piggin	cc7a486cac	generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask() * Venki Pallipadi <venkatesh.pallipadi@intel.com> wrote: > Found a OOPS on a big SMP box during an overnight reboot test with > upstream git. > > Suresh and I looked at the oops and looks like the root cause is in > generic_smp_call_function_interrupt() and smp_call_function_mask() with > wait parameter. > > The actual oops looked like > > [ 11.277260] BUG: unable to handle kernel paging request at ffff8802ffffffff > [ 11.277815] IP: [<ffff8802ffffffff>] 0xffff8802ffffffff > [ 11.278155] PGD 202063 PUD 0 > [ 11.278576] Oops: 0010 [1] SMP > [ 11.279006] CPU 5 > [ 11.279336] Modules linked in: > [ 11.279752] Pid: 0, comm: swapper Not tainted 2.6.27-rc2-00020-g685d87f #290 > [ 11.280039] RIP: 0010:[<ffff8802ffffffff>] [<ffff8802ffffffff>] 0xffff8802ffffffff > [ 11.280692] RSP: 0018:ffff88027f1f7f70 EFLAGS: 00010086 > [ 11.280976] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000000 > [ 11.281264] RDX: 0000000000004f4e RSI: 0000000000000001 RDI: 0000000000000000 > [ 11.281624] RBP: ffff88027f1f7f98 R08: 0000000000000001 R09: ffffffff802509af > [ 11.281925] R10: ffff8800280c2780 R11: 0000000000000000 R12: ffff88027f097d48 > [ 11.282214] R13: ffff88027f097d70 R14: 0000000000000005 R15: ffff88027e571000 > [ 11.282502] FS: 0000000000000000(0000) GS:ffff88027f1c3340(0000) knlGS:0000000000000000 > [ 11.283096] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > [ 11.283382] CR2: ffff8802ffffffff CR3: 0000000000201000 CR4: 00000000000006e0 > [ 11.283760] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 11.284048] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 11.284337] Process swapper (pid: 0, threadinfo ffff88027f1f2000, task ffff88027f1f0640) > [ 11.284936] Stack: ffffffff80250963 0000000000000212 0000000000ee8c78 0000000000ee8a66 > [ 11.285802] ffff88027e571550 ffff88027f1f7fa8 ffffffff8021adb5 ffff88027f1f3e40 > [ 11.286599] ffffffff8020bdd6 ffff88027f1f3e40 <EOI> ffff88027f1f3ef8 0000000000000000 > [ 11.287120] Call Trace: > [ 11.287768] <IRQ> [<ffffffff80250963>] ? generic_smp_call_function_interrupt+0x61/0x12c > [ 11.288354] [<ffffffff8021adb5>] smp_call_function_interrupt+0x17/0x27 > [ 11.288744] [<ffffffff8020bdd6>] call_function_interrupt+0x66/0x70 > [ 11.289030] <EOI> [<ffffffff8024ab3b>] ? clockevents_notify+0x19/0x73 > [ 11.289380] [<ffffffff803b9b75>] ? acpi_idle_enter_simple+0x18b/0x1fa > [ 11.289760] [<ffffffff803b9b6b>] ? acpi_idle_enter_simple+0x181/0x1fa > [ 11.290051] [<ffffffff8053aeca>] ? cpuidle_idle_call+0x70/0xa2 > [ 11.290338] [<ffffffff80209f61>] ? cpu_idle+0x5f/0x7d > [ 11.290723] [<ffffffff8060224a>] ? start_secondary+0x14d/0x152 > [ 11.291010] > [ 11.291287] > [ 11.291654] Code: Bad RIP value. > [ 11.292041] RIP [<ffff8802ffffffff>] 0xffff8802ffffffff > [ 11.292380] RSP <ffff88027f1f7f70> > [ 11.292741] CR2: ffff8802ffffffff > [ 11.310951] ---[ end trace 137c54d525305f1c ]--- > > The problem is with the following sequence of events: > > - CPU A calls smp_call_function_mask() for CPU B with wait parameter > - CPU A sets up the call_function_data on the stack and does an rcu add to > call_function_queue > - CPU A waits until the WAIT flag is cleared > - CPU B gets the call function interrupt and starts going through the > call_function_queue > - CPU C also gets some other call function interrupt and starts going through > the call_function_queue > - CPU C, which is also going through the call_function_queue, starts referencing > CPU A's stack, as that element is still in call_function_queue > - CPU B finishes the function call that CPU A set up and as there are no other > references to it, rcu deletes the call_function_data (which was from CPU A > stack) > - CPU B sees the wait flag and just clears the flag (no call_rcu to free) > - CPU A which was waiting on the flag continues executing and the stack > contents change > > - CPU C is still in rcu_read section accessing the CPU A's stack sees > inconsistent call_funation_data and can try to execute > function with some random pointer, causing stack corruption for A > (by clearing the bits in mask field) and oops. Nice debugging work. I'd suggest something like the attached (boot tested) patch as the simple fix for now. I expect the benefits from the less synchronized, multiple-in-flight-data global queue will still outweigh the costs of dynamic allocations. But if worst comes to worst then we just go back to a globally synchronous one-at-a-time implementation, but that would be pretty sad! Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 15:21:28 +02:00
Mike Galbraith	77ae651347	sched: fix mysql+oltp regression Defer commit `6d299f1b53` to the next release. Testing of the tip/sched/clock tree revealed a mysql+oltp regression which bisection eventually traced back to this commit in mainline. Pertinent test results: Three run sysbench averages, throughput units in read/write requests/sec. clients 1 2 4 8 16 32 64 `6e0534f` 9646 17876 34774 33868 32230 30767 29441 2.6.26.1 9112 17936 34652 33383 31929 30665 29232 `6d299f1` 9112 14637 28370 33339 32038 30762 29204 Note: subsequent commits hide the majority of this regression until you apply the clock fixes, at which time it reemerges at full magnitude. We cannot see anything bad about the change itself so we defer it to the next release until this problem is fully analysed. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 14:49:29 +02:00
Peter Zijlstra	b845b517b5	printk: robustify printk Avoid deadlocks against rq->lock and xtime_lock by deferring the klogd wakeup by polling from the timer tick. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 13:46:53 +02:00
Ingo Molnar	251a169c69	Merge branch 'linus' into sched/urgent	2008-08-11 13:40:56 +02:00
Ingo Molnar	78635fc739	rcu, debug: detect stalled grace periods, cleanups small cleanups. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 13:35:18 +02:00
Paul E. McKenney	67182ae1c4	rcu, debug: detect stalled grace periods this is a diagnostic patch for Classic RCU. The approach is to record a timestamp at the beginning of the grace period (in rcu_start_batch()), then have rcu_check_callbacks() complain if: 1. it is running on a CPU that has holding up grace periods for a long time (say one second). This will identify the culprit assuming that the culprit has not disabled hardware irqs, instruction execution, or some such. 2. it is running on a CPU that is not holding up grace periods, but grace periods have been held up for an even longer time (say two seconds). It is enabled via the default-off CONFIG_DEBUG_RCU_STALL kernel parameter. Rather than exponential backoff, it backs off to once per 30 seconds. My feeling upon thinking on it was that if you have stalled RCU grace periods for that long, a few extra printk() messages are probably the least of your worries... Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: David Witbrodt <dawitbro@sbcglobal.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 13:35:18 +02:00
Ingo Molnar	c4c0c56a7a	Merge branch 'linus' into core/rcu	2008-08-11 13:27:47 +02:00
Ingo Molnar	3295f0ef9f	lockdep: rename map_[acquire\|release]() => lock_map_[acquire\|release]() the names were too generic: drivers/uio/uio.c:87: error: expected identifier or '(' before 'do' drivers/uio/uio.c:87: error: expected identifier or '(' before 'while' drivers/uio/uio.c:113: error: 'map_release' undeclared here (not in a function) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 10:30:30 +02:00
Rabin Vincent	8bfe0298f7	lockdep: handle chains involving classes defined in modules Solve this by marking the classes as unused and not printing information about the unused classes. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 09:30:26 +02:00
Peter Zijlstra	b7d39aff91	lockdep: spin_lock_nest_lock() Expose the new lock protection lock. This can be used to annotate places where we take multiple locks of the same class and avoid deadlocks by always taking another (top-level) lock first. NOTE: we're still bound to the MAX_LOCK_DEPTH (48) limit. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 09:30:24 +02:00
Peter Zijlstra	7531e2f34d	lockdep: lock protection locks On Fri, 2008-08-01 at 16:26 -0700, Linus Torvalds wrote: > On Fri, 1 Aug 2008, David Miller wrote: > > > > Taking more than a few locks of the same class at once is bad > > news and it's better to find an alternative method. > > It's not always wrong. > > If you can guarantee that anybody that takes more than one lock of a > particular class will always take a single top-level lock _first_, then > that's all good. You can obviously screw up and take the same lock _twice_ > (which will deadlock), but at least you cannot get into ABBA situations. > > So maybe the right thing to do is to just teach lockdep about "lock > protection locks". That would have solved the multi-queue issues for > networking too - all the actual network drivers would still have taken > just their single queue lock, but the one case that needs to take all of > them would have taken a separate top-level lock first. > > Never mind that the multi-queue locks were always taken in the same order: > it's never wrong to just have some top-level serialization, and anybody > who needs to take <n> locks might as well do <n+1>, because they sure as > hell aren't going to be on _any_ fastpaths. > > So the simplest solution really sounds like just teaching lockdep about > that one special case. It's not "nesting" exactly, although it's obviously > related to it. Do as Linus suggested. The lock protection lock is called nest_lock. Note that we still have the MAX_LOCK_DEPTH (48) limit to consider, so anything that spills that it still up shit creek. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 09:30:24 +02:00
Peter Zijlstra	4f3e7524b2	lockdep: map_acquire Most the free-standing lock_acquire() usages look remarkably similar, sweep them into a new helper. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 09:30:23 +02:00
Dave Jones	f82b217e35	lockdep: shrink held_lock structure struct held_lock { u64 prev_chain_key; /* 0 8 / struct lock_class class; /* 8 8 / long unsigned int acquire_ip; / 16 8 / struct lockdep_map instance; /* 24 8 / int irq_context; / 32 4 / int trylock; / 36 4 / int read; / 40 4 / int check; / 44 4 / int hardirqs_off; / 48 4 / / size: 56, cachelines: 1 / / padding: 4 / / last cacheline: 56 bytes / }; struct held_lock { u64 prev_chain_key; / 0 8 / long unsigned int acquire_ip; / 8 8 / struct lockdep_map instance; /* 16 8 / unsigned int class_idx:11; / 24:21 4 / unsigned int irq_context:2; / 24:19 4 / unsigned int trylock:1; / 24:18 4 / unsigned int read:2; / 24:16 4 / unsigned int check:2; / 24:14 4 / unsigned int hardirqs_off:1; / 24:13 4 / / size: 32, cachelines: 1 / / padding: 4 / / bit_padding: 13 bits / / last cacheline: 32 bytes */ }; [mingo@elte.hu: shrunk hlock->class too] [peterz@infradead.org: fixup bit sizes] Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2008-08-11 09:30:23 +02:00
Peter Zijlstra	1b12bbc747	lockdep: re-annotate scheduler runqueues Instead of using a per-rq lock class, use the regular nesting operations. However, take extra care with double_lock_balance() as it can release the already held rq->lock (and therefore change its nesting class). So what can happen is: spin_lock(rq->lock); // this rq subclass 0 double_lock_balance(rq, other_rq); // release rq // acquire other_rq->lock subclass 0 // acquire rq->lock subclass 1 spin_unlock(other_rq->lock); leaving you with rq->lock in subclass 1 So a subsequent double_lock_balance() call can try to nest a subclass 1 lock while already holding a subclass 1 lock. Fix this by introducing double_unlock_balance() which releases the other rq's lock, but also re-sets the subclass for this rq's lock to 0. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 09:30:22 +02:00
Peter Zijlstra	64aa348edc	lockdep: lock_set_subclass - reset a held lock's subclass this can be used to reset a held lock's subclass, for arbitrary-depth iterated data structures such as trees or lists which have per-node locks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 09:30:21 +02:00
Ingo Molnar	cf206bffbb	Merge branch 'linus' into sched/clock	2008-08-11 08:59:21 +02:00
Peter Zijlstra	c1955a3d47	sched_clock: delay using sched_clock() Some arch's can't handle sched_clock() being called too early - delay this until sched_clock_init() has been called. Reported-by: Bill Gatliff <bgat@billgatliff.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Nishanth Aravamudan <nacc@us.ibm.com> CC: Russell King - ARM Linux <linux@arm.linux.org.uk> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-11 08:59:03 +02:00
Dmitry Baryshkov	cb3952bf78	DMA: make dma-coherent.c documentation kdoc-friendly Spotted by Randy. Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2008-08-07 06:52:01 -07:00
Richard Hughes	bf1db69fbf	pm_qos: spelling fixes A documentation cleanup patch. With a minor tweak to clarify units for kbs. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: mark gross <mgross@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-05 14:33:50 -07:00
Jan Beulich	d2dc1f4adb	dma: fix order calculation in dma_mark_declared_memory_occupied() get_order() takes byte-sized input, not a page-granular one. Irrespective of this fix I'm inclined to believe that this doesn't work right anyway - bitmap_allocate_region() has an implicit assumption of 'pos' being suitable for 'order', which this function doesn't seem to enforce (and since it's being called with a byte-granular value there's no reason to believe that the callers would make sure device_addr is passed accordingly - it's also not documented that way). Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dmitry Baryshkov <dbaryshkov@gmail.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-05 14:33:49 -07:00
David Brownell	c69ad71bcd	genirq: better warning on irqchip->set_type() failure While I'm glad to finally see the hole fixed whereby passing an invalid IRQ trigger type to request_irq() would be ignored, the current diagnostic isn't quite useful. Fixed by also listing the trigger type which was rejected. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-05 14:33:47 -07:00
Oleg Nesterov	5b2becc8cf	semaphore: __down_common: use signal_pending_state() Change __down_common() to use signal_pending_state() instead of open coding. The changes in kernel/semaphore.o are just artifacts, the state checks are optimized away. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-05 14:33:47 -07:00
Tom Zanussi	3219445033	relay: fix "full buffer with exactly full last subbuffer" accounting problem In relay's current read implementation, if the buffer is completely full but hasn't triggered the buffer-full condition (i.e. the last write didn't cross the subbuffer boundary) and the last subbuffer is exactly full, the subbuffer accounting code erroneously finds nothing available. This patch fixes the problem. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-05 14:33:46 -07:00
Linus Torvalds	b13ad6f47c	Merge branch 'audit.b56' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b56' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: Re: [PATCH] Fix the kernel panic of audit_filter_task when key field is set	2008-08-04 17:21:38 -07:00
Jeremy Fitzhardinge	725aad24c3	__sched_setscheduler: don't do any policy checks when not "user" The "user" parameter to __sched_setscheduler indicates whether the change is being done on behalf of a user process or not. If not, we shouldn't apply any permissions checks, so don't call security_task_setscheduler(). Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Tested-by: Steve Wise <swise@opengridcomputing.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-04 17:16:20 -07:00
zhangxiliang	1a61c88def	Re: [PATCH] Fix the kernel panic of audit_filter_task when key field is set Sorry, I miss a blank between if and "(". And I add "unlikely" to check "ctx" in audit_match_perm() and audit_match_filetype(). This is a new patch for it. Signed-off-by: Zhang Xiliang <zhangxiliang@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-08-04 06:13:50 -04:00
Roland McGrath	5c7edcd7ee	tracehook: fix exit_signal=0 case My commit `2b2a1ff64a` introduced a regression (sorry about that) for the odd case of exit_signal=0 (e.g. clone_flags=0). This is not a normal use, but it's used by a case in the glibc test suite. Dying with exit_signal=0 sends no signal, but it's supposed to wake up a parent's blocked wait*() calls (unlike the delayed_group_leader case). This fixes tracehook_notify_death() and its caller to distinguish a "signal 0" wakeup from the delayed_group_leader case (with no wakeup). Signed-off-by: Roland McGrath <roland@redhat.com> Tested-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-08-01 12:01:11 -07:00
Linus Torvalds	5adf2b03d9	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: locking: fix mutex @key parameter kernel-doc notation	2008-08-01 11:52:39 -07:00
Linus Torvalds	31582b094d	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdb: fix gdb serial thread queries kgdb: fix kgdb_validate_break_address to perform a mem write kgdb: remove the requirement for CONFIG_FRAME_POINTER	2008-08-01 11:45:09 -07:00
zhangxiliang	20c6aaa39a	[PATCH] Fix the bug of using AUDIT_STATUS_RATE_LIMIT when set fail, no error output. When the "status_get->mask" is "AUDIT_STATUS_RATE_LIMIT \|\| AUDIT_STATUS_BACKLOG_LIMIT". If "audit_set_rate_limit" fails and "audit_set_backlog_limit" succeeds, the "err" value will be greater than or equal to 0. It will miss the failure of rate set. Signed-off-by: Zhang Xiliang <zhangxiliang@cn.fujitsu.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-08-01 12:15:16 -04:00
zhangxiliang	980dfb0db3	[PATCH] Fix the kernel panic of audit_filter_task when key field is set When calling audit_filter_task(), it calls audit_filter_rules() with audit_context is NULL. If the key field is set, the result in audit_filter_rules() will be set to 1 and ctx->filterkey will be set to key. But the ctx is NULL in this condition, so kernel will panic. Signed-off-by: Zhang Xiliang <zhangxiliang@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-08-01 12:15:03 -04:00
zhangxiliang	036bbf76ad	Re: [PATCH] the loginuid field should be output in all AUDIT_CONFIG_CHANGE audit messages > shouldn't these be using the "audit_get_loginuid(current)" and if we > are going to output loginuid we also should be outputting sessionid Thanks for your detailed explanation. I have made a new patch for outputing "loginuid" and "sessionid" by audit_get_loginuid(current) and audit_get_sessionid(current). If there are some deficiencies, please give me your indication. Signed-off-by: Zhang Xiliang <zhangxiliang@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-08-01 12:15:03 -04:00
Vesa-Matti J Kari	1d6c9649e2	kernel/audit.c control character detection is off-by-one Hello, According to my understanding there is an off-by-one bug in the function: audit_string_contains_control() in: kernel/audit.c Patch is included. I do not know from how many places the function is called from, but for example, SELinux Access Vector Cache tries to log untrusted filenames via call path: avc_audit() audit_log_untrustedstring() audit_log_n_untrustedstring() audit_string_contains_control() If audit_string_contains_control() detects control characters, then the string is hex-encoded. But the hex=0x7f dec=127, DEL-character, is not detected. I guess this could have at least some minor security implications, since a user can create a filename with 0x7f in it, causing logged filename to possibly look different when someone reads it on the terminal. Signed-off-by: Vesa-Matti Kari <vmkari@cc.helsinki.fi> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-08-01 12:05:35 -04:00
Eric Paris	ee1d315663	[PATCH] Audit: Collect signal info when SIGUSR2 is sent to auditd Makes the kernel audit subsystem collect information about the sending process when that process sends SIGUSR2 to the userspace audit daemon. SIGUSR2 is a new interesting signal to auditd telling auditd that it should try to start logging to disk again and the error condition which caused it to stop logging to disk (usually out of space) has been rectified. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-08-01 12:05:32 -04:00
Jason Wessel	25fc999913	kgdb: fix gdb serial thread queries The command "info threads" did not work correctly with kgdb. It would result in a silent kernel hang if used. This patach addresses several problems. - Fix use of deprecated NR_CPUS - Fix kgdb to not walk linearly through the pid space - Correctly implement shadow pids - Change the threads per query to a #define - Fix kgdb_hex2long to work with negated values The threads 0 and -1 are reserved to represent the current task. That means that CPU 0 will start with a shadow thread id of -2, and CPU 1 will have a shadow thread id of -3, etc... From the debugger you can switch to a shadow thread to see what one of the other cpus was doing, however it is not possible to execute run control operations on any other cpu execept the cpu executing the kgdb_handle_exception(). Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2008-08-01 08:39:35 -05:00
Jason Wessel	a9b60bf4c2	kgdb: fix kgdb_validate_break_address to perform a mem write A regression to the kgdb core was found in the case of using the CONFIG_DEBUG_RODATA kernel option. When this option is on, a breakpoint cannot be written into any readonly memory page. When an external debugger requests a breakpoint to get set, the kgdb_validate_break_address() was only checking to see if the address to place the breakpoint was readable and lacked a write check. This patch changes the validate routine to try reading (via the breakpoint set request) and also to try immediately writing the break point. If either fails, an error is correctly returned and the debugger behaves correctly. Then an end user can make the descision to use hardware breakpoints. Also update the documentation to reflect that using CONFIG_DEBUG_RODATA will inhibit the use of software breakpoints. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2008-08-01 08:39:34 -05:00
Peter Zijlstra	5e710e37bd	lockdep: change scheduler annotation While thinking about David's graph walk lockdep patch it _finally_ dawned on me that there is no reason we have a lock class per cpu ... Sorry for being dense :-/ The below changes the annotation from a lock class per cpu, to a single nested lock, as the scheduler never holds more that 2 rq locks at a time anyway. If there was code requiring holding all rq locks this would not work and the original annotation would be the only option, but that not being the case, this is a much lighter one. Compiles and boots on a 2-way x86_64. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Miller <davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-08-01 10:46:48 +02:00
David Miller	419ca3f135	lockdep: fix combinatorial explosion in lock subgraph traversal When we traverse the graph, either forwards or backwards, we are interested in whether a certain property exists somewhere in a node reachable in the graph. Therefore it is never necessary to traverse through a node more than once to get a correct answer to the given query. Take advantage of this property using a global ID counter so that we need not clear all the markers in all the lock_class entries before doing a traversal. A new ID is choosen when we start to traverse, and we continue through a lock_class only if it's ID hasn't been marked with the new value yet. This short-circuiting is essential especially for high CPU count systems. The scheduler has a runqueue per cpu, and needs to take two runqueue locks at a time, which leads to long chains of backwards and forwards subgraphs from these runqueue lock nodes. Without the short-circuit implemented here, a graph traversal on a runqueue lock can take up to (1 << (N - 1)) checks on a system with N cpus. For anything more than 16 cpus or so, lockdep will eventually bring the machine to a complete standstill. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-31 18:38:28 +02:00
Ingo Molnar	6679ce6e5f	Merge branch 'linus' into sched/urgent	2008-07-31 18:34:22 +02:00
Ingo Molnar	4a273f209c	sched clock: couple local and remote clocks When taking the time of a remote CPU, use the opportunity to couple (sync) the clocks to each other. (in a monotonic way) Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de>	2008-07-31 17:21:01 +02:00
Ingo Molnar	56b906126d	sched clock: simplify __update_sched_clock() - return the current clock instead of letting callers fetch it from scd->clock Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de>	2008-07-31 17:20:55 +02:00
Ingo Molnar	18e4e36c66	sched: eliminate scd->prev_raw eliminate prev_raw and use tick_raw instead. It's enough to base the current time on the scheduler tick timestamp alone - the monotonicity and maximum checks will prevent any damage. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de>	2008-07-31 17:20:49 +02:00
Ingo Molnar	50526968e9	sched clock: clean up sched_clock_cpu() - simplify the remote clock rebasing Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de>	2008-07-31 17:20:42 +02:00
Ingo Molnar	e4e4e534fa	sched clock: revert various sched_clock() changes Found an interactivity problem on a quad core test-system - simple CPU loops would occasionally delay the system un an unacceptable way. After much debugging with Peter Zijlstra it turned out that the problem is caused by the string of sched_clock() changes - they caused the CPU clock to jump backwards a bit - which confuses the scheduler arithmetics. (which is unsigned for performance reasons) So revert: # `c300ba2`: sched_clock: and multiplier for TSC to gtod drift # `c0c8773`: sched_clock: only update deltas with local reads. # `af52a90`: sched_clock: stop maximum check on NO HZ # `f7cce27`: sched_clock: widen the max and min time This solves the interactivity problems. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de>	2008-07-31 17:20:29 +02:00
Ingo Molnar	15dd859cac	Merge commit 'v2.6.27-rc1' into x86/core Conflicts: include/asm-x86/dma-mapping.h include/asm-x86/namei.h include/asm-x86/uaccess.h Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-30 19:33:48 +02:00
Andi Kleen	f718cd4add	sched: make scheduler sysfs attributes sysdev class devices They are really class devices, but were incorrectly declared. This leads to crashes with the recent changes that makes non normal sysdevs use a different prototype. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pierre Ossman <drzeus-list@drzeus.cx> Cc: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:47 -07:00
Oleg Nesterov	6af8bf3d86	workqueues: add comments to __create_workqueue_key() Dmitry Adamushko pointed out that the error handling in __create_workqueue_key() is not clear, add the comment. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:47 -07:00
Uwe Kleine-König	641de9d8f5	printk: fix comment for printk ratelimiting The comment assumed the burst to be one and the ratelimit used to be named printk_ratelimit_jiffies. Signed-off-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com> Cc: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:45 -07:00
Mathieu Desnoyers	5def9a3a22	markers: fix markers read barrier for multiple probes Paul pointed out two incorrect read barriers in the marker handler code in the path where multiple probes are connected. Those are ordering reads of "ptype" (single or multi probe marker), "multi" array pointer, and "multi" array data access. It should be ordered like this : read ptype smp_rmb() read multi array pointer smp_read_barrier_depends() access data referenced by multi array pointer The code with a single probe connected (optimized case, does not have to allocate an array) has correct memory ordering. It applies to kernel 2.6.26.x, 2.6.25.x and linux-next. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:45 -07:00
Li Zefan	aeed682421	cpuset: clean up cpuset hierarchy traversal code Use cpuset.stack_list rather than kfifo, so we avoid memory allocation for kfifo. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:44 -07:00
Li Zefan	93a6557558	cpuset: fix wrong calculation of relax domain level When multiple cpusets are overlapping in their 'cpus' and hence they form a single sched domain, the largest sched_relax_domain_level among those should be used. But when top_cpuset's sched_load_balance is set, its sched_relax_domain_level is used regardless other sub-cpusets'. This patch fixes it by walking the cpuset hierarchy to find the largest sched_relax_domain_level. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:44 -07:00
Lai Jiangshan	f5393693e9	cpuset: speed up sched domain partition All child cpusets contain a subset of the parent's cpus, so we can skip them when partitioning sched domains. This decreases 'csa' greately for cpusets with multi-level hierarchy. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:44 -07:00
Li Zefan	8d1e6266f5	cpuset: a bit cleanup for scan_for_empty_cpusets() clean up hierarchy traversal code Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Cliff Wickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:44 -07:00
Li Zefan	55b6fd0162	cgroup: uninline cgroup_has_css_refs() It's not small enough, and has 2 call sites. text data bss dec hex filename 12813 1676 4832 19321 4b79 cgroup.o.orig 12775 1676 4832 19283 4b53 cgroup.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:44 -07:00
Li Zefan	36553434f4	cgroup: remove duplicate code in allocate_cg_link() - just call free_cg_links() in allocate_cg_links() - the list will get initialized in allocate_cg_links(), so don't init it twice Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:44 -07:00
Li Zefan	5a3eb9f6b7	cgroup: fix possible memory leak There's a leak if copy_from_user() returns failure. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:44 -07:00
Magnus Damm	1a4e564b7d	resource: add resource_size() Avoid one-off errors by introducing a resource_size() function. Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: Ben Dooks <ben-linux@fluff.org> Cc: Jean Delvare <khali@linux-fr.org> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-30 09:41:43 -07:00
Ingo Molnar	39675e89fb	Merge branch 'sched/urgent' into sched/clock	2008-07-30 10:38:30 +02:00
Linus Torvalds	1d9b9f6a53	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (21 commits) x86/PCI: use dev_printk when possible PCI: add D3 power state avoidance quirk PCI: fix bogus "'device' may be used uninitialized" warning in pci_slot PCI: add an option to allow ASPM enabled forcibly PCI: disable ASPM on pre-1.1 PCIe devices PCI: disable ASPM per ACPI FADT setting PCI MSI: Don't disable MSIs if the mask bit isn't supported PCI: handle 64-bit resources better on 32-bit machines PCI: rewrite PCI BAR reading code PCI: document pci_target_state PCI hotplug: fix typo in pcie hotplug output x86 gart: replace to_pages macro with iommu_num_pages x86, AMD IOMMU: replace to_pages macro with iommu_num_pages iommu: add iommu_num_pages helper function dma-coherent: add documentation to new interfaces Cris: convert to using generic dma-coherent mem allocator Sh: use generic per-device coherent dma allocator ARM: support generic per-device coherent dma mem Generic dma-coherent: fix DMA_MEMORY_EXCLUSIVE x86: use generic per-device dma coherent allocator ...	2008-07-28 18:14:24 -07:00
Andrea Arcangeli	cddb8a5c14	mmu-notifiers: core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) mm_take_all_locks() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken too. 2) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_take_all_locks may be interrupted by a signal and return -EINTR. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm kvm_arch_create_vm(void) { struct kvm kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. The patch also adds a few needed but missing includes that would prevent kernel to compile after these changes on non-x86 archs (x86 didn't need them by luck). [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix mm/filemap_xip.c build] [akpm@linux-foundation.org: fix mm/mmu_notifier.c build] Signed-off-by: Andrea Arcangeli <andrea@qumranet.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kanoj Sarcar <kanojsarcar@yahoo.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Avi Kivity <avi@qumranet.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Chris Wright <chrisw@redhat.com> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Eric Dumazet <dada1@cosmosbay.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Izik Eidus <izike@qumranet.com> Cc: Anthony Liguori <aliguori@us.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-28 16:30:21 -07:00
Ingo Molnar	cb28a1bbdb	Merge branch 'linus' into core/generic-dma-coherent Conflicts: arch/x86/Kconfig Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-29 00:07:55 +02:00
Ingo Molnar	9e3ee1c39c	Merge branch 'linus' into cpus4096 Conflicts: kernel/stop_machine.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-28 23:32:00 +02:00
Linus Torvalds	e56b3bc794	cpu masks: optimize and clean up cpumask_of_cpu() Clean up and optimize cpumask_of_cpu(), by sharing all the zero words. Instead of stupidly generating all possible i=0...NR_CPUS 2^i patterns creating a huge array of constant bitmasks, realize that the zero words can be shared. In other words, on a 64-bit architecture, we only ever need 64 of these arrays - with a different bit set in one single world (with enough zero words around it so that we can create any bitmask by just offsetting in that big array). And then we just put enough zeroes around it that we can point every single cpumask to be one of those things. So when we have 4k CPU's, instead of having 4k arrays (of 4k bits each, with one bit set in each array - 2MB memory total), we have exactly 64 arrays instead, each 8k bits in size (64kB total). And then we just point cpumask(n) to the right position (which we can calculate dynamically). Once we have the right arrays, getting "cpumask(n)" ends up being: static inline const cpumask_t get_cpu_mask(unsigned int cpu) { const unsigned long p = cpu_bit_bitmap[1 + cpu % BITS_PER_LONG]; p -= cpu / BITS_PER_LONG; return (const cpumask_t *)p; } This brings other advantages and simplifications as well: - we are not wasting memory that is just filled with a single bit in various different places - we don't need all those games to re-create the arrays in some dense format, because they're already going to be dense enough. if we compile a kernel for up to 4k CPU's, "wasting" that 64kB of memory is a non-issue (especially since by doing this "overlapping" trick we probably get better cache behaviour anyway). [ mingo@elte.hu: Converted Linus's mails into a commit. See: http://lkml.org/lkml/2008/7/27/156 http://lkml.org/lkml/2008/7/28/320 Also applied a family filter - which also has the side-effect of leaving out the bits where Linus calls me an idio... Oh, never mind ;-) ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-28 22:20:41 +02:00
Ingo Molnar	414f746d23	Merge branch 'linus' into cpus4096	2008-07-28 21:14:43 +02:00
Randy Dunlap	0e241ffd30	locking: fix mutex @key parameter kernel-doc notation Fix @key parameter to mutex_init() and one of its callers. Warning(linux-2.6.26-git11//drivers/base/class.c:210): No description found for parameter 'key' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-28 18:12:36 +02:00
Linus Torvalds	37eaf8c746	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: stop_machine: fix up ftrace.c stop_machine: Wean existing callers off stop_machine_run() stop_machine(): stop_machine_run() changed to use cpu mask Hotplug CPU: don't check cpu_online after take_cpu_down Simplify stop_machine stop_machine: add ALL_CPUS option module: fix build warning with !CONFIG_KALLSYMS	2008-07-28 08:37:46 -07:00
Hugh Dickins	2c3d103ba9	sched: move sched_clock before first use Move sched_clock() up to stop warning: weak declaration of `sched_clock' after first use results in unspecified behavior (if -fno-unit-at-a-time). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Mike Travis <travis@sgi.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Linuxppc-dev@ozlabs.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-28 16:35:03 +02:00
roel kluin	e26873bb10	sched: test runtime rather than period in global_rt_runtime() Test runtime rather than period Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-28 15:57:24 +02:00
OGAWA Hirofumi	94f5655988	sched: fix SCHED_HRTICK dependency Currently, it seems SCHED_HRTICK allowed for !SMP. But, it seems to have no dependency of it. Fix it. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-28 14:37:38 +02:00
Peter Zijlstra	157124c11f	sched: fix warning in hrtick_start_fair() Benjamin Herrenschmidt reported: > I get that on ppc64 ... > > In file included from kernel/sched.c:1595: > kernel/sched_fair.c: In function ‘hrtick_start_fair’: > kernel/sched_fair.c:902: warning: comparison of distinct pointer types lacks a cast > > Probably harmless but annoying. s64 delta = slice - ran; --> delta = max(10000LL, delta); Probably ppc64's s64 is long vs long long.. I think hpa was looking at sanitizing all these 64bit types across the architectures. Use max_t with an explicit type meanwhile. Reported-by: Benjamin Herrenschmid <benh@kernel.crashing.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-28 12:01:58 +02:00
Rusty Russell	784e2d7600	stop_machine: fix up ftrace.c Simple conversion. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Abhishek Sagar <sagar.abhishek@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org>	2008-07-28 12:16:31 +10:00
Rusty Russell	9b1a4d3837	stop_machine: Wean existing callers off stop_machine_run() Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-07-28 12:16:31 +10:00
Rusty Russell	eeec4fad96	stop_machine(): stop_machine_run() changed to use cpu mask Instead of a "cpu" arg with magic values NR_CPUS (any cpu) and ~0 (all cpus), pass a cpumask_t. Allow NULL for the common case (where we don't care which CPU the function is run on): temporary cpumask_t's are usually considered bad for stack space. This deprecates stop_machine_run, to be removed soon when all the callers are dead. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-07-28 12:16:30 +10:00
Rusty Russell	0432158758	Hotplug CPU: don't check cpu_online after take_cpu_down Akinobu points out that if take_cpu_down() succeeds, the cpu must be offline. Remove the cpu_online() check, and put a BUG_ON(). Quoting Akinobu Mita: Actually the cpu_online() check was necessary before appling this stop_machine: simplify patch. With old __stop_machine_run(), __stop_machine_run() could succeed (return !IS_ERR(p) value) even if take_cpu_down() returned non-zero value. The return value of take_cpu_down() was obtained through kthread_stop().. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: "Akinobu Mita" <akinobu.mita@gmail.com>	2008-07-28 12:16:29 +10:00
Rusty Russell	ffdb5976c4	Simplify stop_machine stop_machine creates a kthread which creates kernel threads. We can create those threads directly and simplify things a little. Some care must be taken with CPU hotunplug, which has special needs, but that code seems more robust than it was in the past. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>	2008-07-28 12:16:29 +10:00
Jason Baron	5c2aed6225	stop_machine: add ALL_CPUS option -allow stop_mahcine_run() to call a function on all cpus. Calling stop_machine_run() with a 'ALL_CPUS' invokes this new behavior. stop_machine_run() proceeds as normal until the calling cpu has invoked 'fn'. Then, we tell all the other cpus to call 'fn'. Signed-off-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> CC: Adrian Bunk <bunk@stusta.de> CC: Andi Kleen <andi@firstfloor.org> CC: Alexey Dobriyan <adobriyan@gmail.com> CC: Christoph Hellwig <hch@infradead.org> CC: mingo@elte.hu CC: akpm@osdl.org	2008-07-28 12:16:28 +10:00
WANG Cong	15bba37d62	module: fix build warning with !CONFIG_KALLSYMS This patch fixed the warning: CC kernel/module.o /home/wangcong/Projects/linux-2.6/kernel/module.c:332: warning: ‘lookup_symbol’ defined but not used Signed-off-by: WANG Cong <wangcong@zeuux.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2008-07-28 12:16:28 +10:00
Andrea Righi	940389b8af	task IO accounting: move all IO statistics in struct task_io_accounting Simplify the code of include/linux/task_io_accounting.h. It is also more reasonable to have all the task i/o-related statistics in a single struct (task_io_accounting). Signed-off-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-27 16:12:28 -07:00
Ingo Molnar	2106b531ea	Merge branch 'timers/urgent' of ssh://master.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into timers/urgent	2008-07-27 23:15:26 +02:00
Andrea Righi	5995477ab7	task IO accounting: improve code readability Put all i/o statistics in struct proc_io_accounting and use inline functions to initialize and increment statistics, removing a lot of single variable assignments. This also reduces the kernel size as following (with CONFIG_TASK_XACCT=y and CONFIG_TASK_IO_ACCOUNTING=y). text data bss dec hex filename 11651 0 0 11651 2d83 kernel/exit.o.before 11619 0 0 11619 2d63 kernel/exit.o.after 10886 132 136 11154 2b92 kernel/fork.o.before 10758 132 136 11026 2b12 kernel/fork.o.after 3082029 807968 4818600 8708597 84e1f5 vmlinux.o.before 3081869 807968 4818600 8708437 84e155 vmlinux.o.after Signed-off-by: Andrea Righi <righi.andrea@gmail.com> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-27 09:58:20 -07:00
Andrea Righi	605ccb73f6	tracing: remove unused variable Remove the following warning with CONFIG_TRACING=y: kernel/trace/trace.c: In function ‘s_next’: kernel/trace/trace.c:1186: warning: unused variable ‘last_ent’ Signed-off-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-27 09:58:20 -07:00
Al Viro	bfbcf03479	lost sysctl fix try_attach() should walk into the matching subdirectory, not the first one... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Valdis.Kletnieks@vt.edu Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-27 09:45:34 -07:00
Al Viro	3f8206d496	[PATCH] get rid of indirect users of namei.h fs.h needs path.h, not namei.h; nfs_fs.h doesn't need it at all. Several places in the tree needed direct include. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:42 -04:00
Al Viro	7f2da1e7d0	[PATCH] kill altroot long overdue... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:20 -04:00
Al Viro	e6305c43ed	[PATCH] sanitize ->permission() prototype * kill nameidata * argument; map the 3 bits in ->flags anybody cares about to new MAY_... ones and pass with the mask. * kill redundant gfs2_iop_permission() * sanitize ecryptfs_permission() * fix remaining places where ->permission() instances might barf on new MAY_... found in mask. The obvious next target in that direction is permission(9) folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:14 -04:00
Al Viro	9043476f72	[PATCH] sanitize proc_sysctl * keep references to ctl_table_head and ctl_table in /proc/sys inodes * grab the former during operations, use the latter for access to entry if that succeeds * have ->d_compare() check if table should be seen for one who does lookup; that allows us to avoid flipping inodes - if we have the same name resolve to different things, we'll just keep several dentries and ->d_compare() will reject the wrong ones. * have ->lookup() and ->readdir() scan the table of our inode first, then walk all ctl_table_header and scan ->attached_by for those that are attached to our directory. * implement ->getattr(). * get rid of insane amounts of tree-walking * get rid of the need to know dentry in ->permission() and of the contortions induced by that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:12 -04:00
Al Viro	ae7edecc9b	[PATCH] sysctl: keep track of tree relationships In a sense, that's the heart of the series. It's based on the following property of the trees we are actually asked to add: they can be split into stem that is already covered by registered trees and crown that is entirely new. IOW, if a/b and a/c/d are introduced by our tree, then a/c is also introduced by it. That allows to associate tree and table entry with each node in the union; while directory nodes might be covered by many trees, only one will cover the node by its crown. And that will allow much saner logics for /proc/sys in the next patches. This patch introduces the data structures needed to keep track of that. When adding a sysctl table, we find a "parent" one. Which is to say, find the deepest node on its stem that already is present in one of the tables from our table set or its ancestor sets. That table will be our parent and that node in it - attachment point. Add our table to list anchored in parent, have it refer the parent and contents of attachment point. Also remember where its crown lives. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:11 -04:00
Al Viro	f7e6ced406	[PATCH] allow delayed freeing of ctl_table_header Refcount the sucker; instead of freeing it by the end of unregistration just drop the refcount and free only when it hits zero. Make sure that we _always_ make ->unregistering non-NULL in start_unregistering(). That allows anybody to get a reference to such puppy, preventing its freeing and reuse. It does not block unregistration. Anybody who holds such a reference can * try to grab a "use" reference (ctl_head_grab()); that will succeeds if and only if it hadn't entered unregistration yet. If it succeeds, we can use it in all normal ways until we release the "use" reference (with ctl_head_finish()). Note that this relies on having ->unregistering become non-NULL in all cases when one starts to unregister the sucker. * keep pointers to ctl_table entries; they can be freed if the entire thing is unregistered. However, if ctl_head_grab() succeeds, we know that unregistration had not happened (and will not happen until ctl_head_finish()) and such pointers can be used safely. IOW, now we can have inodes under /proc/sys keep references to ctl_table entries, protecting them with references to ctl_table_header and grabbing the latter for the duration of operations that require access to ctl_table. That won't cause deadlocks, since unregistration will not be stopped by mere keeping a reference to ctl_table_header. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:09 -04:00
Al Viro	734550921e	[PATCH] beginning of sysctl cleanup - ctl_table_set New object: set of sysctls [currently - root and per-net-ns]. Contains: pointer to parent set, list of tables and "should I see this set?" method (->is_seen(set)). Current lists of tables are subsumed by that; net-ns contains such a beast. ->lookup() for ctl_table_root returns pointer to ctl_table_set instead of that to ->list of that ctl_table_set. [folded compile fixes by rdd for configs without sysctl] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:08 -04:00
Linus Torvalds	a048d3aff8	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ftrace: fix modular build ftrace: disable tracing on acpi idle calls ftrace: remove latency-tracer leftover ftrace: only trace preempt off with preempt tracer ftrace: fix `4d3702b6` (post-v2.6.26): WARNING: at kernel/lockdep.c:2731 check_flags (ftrace)	2008-07-26 13:25:47 -07:00
Adrian Bunk	96930a6365	make cgroup_seqfile_release() static cgroup_seqfile_release() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:11 -07:00
Roland McGrath	85ba2d862e	tracehook: wait_task_inactive This extends wait_task_inactive() with a new argument so it can be used in a "soft" mode where it will check for the task changing state unexpectedly and back off. There is no change to existing callers. This lays the groundwork to allow robust, noninvasive tracing that can try to sample a blocked thread but back off safely if it wakes up. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:09 -07:00
Roland McGrath	b787f7ba67	tracehook: force signal_pending() This defines a new hook tracehook_force_sigpending() that lets tracing code decide to force TIF_SIGPENDING on in recalc_sigpending(). This is not used yet, so it compiles away to nothing for now. It lays the groundwork for new tracing code that can interrupt a task synthetically without actually sending a signal. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:09 -07:00
Roland McGrath	2b2a1ff64a	tracehook: death This moves the ptrace logic in task death (exit_notify) into tracehook.h inlines. Some code is rearranged slightly to make things nicer. There is no change, only cleanup. There is one hook called with the tasklist_lock write-locked, as ptrace needs. There is also a new hook called after exit_state changes and without locks. This is a better place for tracing work to be in the future, since it doesn't delay the whole system with locking. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:09 -07:00
Roland McGrath	fa00b80b3c	tracehook: job control This defines the tracehook_notify_jctl() hook to formalize the ptrace effects on the job control notifications. There is no change, only cleanup. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:09 -07:00
Roland McGrath	7bcf6a2ca5	tracehook: get_signal_to_deliver This defines the tracehook_get_signal() hook to allow tracing code to slip in before normal signal dequeuing. This lays the groundwork for new tracing features that can inject synthetic signals outside the normal queue or control the disposition of delivered signals. The calling convention lets tracehook_get_signal() decide both exactly what will happen and what signal number to report in the handler/exit. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:09 -07:00
Roland McGrath	445a91d2fe	tracehook: tracehook_consider_fatal_signal This defines tracehook_consider_fatal_signal() has a fine-grained hook for deciding to skip the special cases for a fatal signal, as ptrace does. There is no change, only cleanup. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:09 -07:00
Roland McGrath	35de254dc6	tracehook: tracehook_consider_ignored_signal This defines tracehook_consider_ignored_signal() has a fine-grained hook for deciding to prevent the normal short-circuit of sending an ignored signal, as ptrace does. There is no change, only cleanup. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:09 -07:00
Roland McGrath	dae33574dc	tracehook: release_task This moves the ptrace-related logic from release_task into tracehook.h and ptrace.h inlines. It provides clean hooks both before and after locking tasklist_lock, for future tracing logic to do more cleanup without the lock. This also changes release_task() itself in the rare "zap_leader" case to set the leader to EXIT_DEAD before iterating. This maintains the invariant that release_task() only ever handles a task in EXIT_DEAD. This is a common-sense invariant that is already always true except in this one arcane case of zombie leader whose parent ignores SIGCHLD. This change is harmless and only costs one store in this one rare case. It keeps the expected state more consisently sane, which is nicer when debugging weirdness in release_task(). It also lets some future code in the tracehook entry points rely on this invariant for bookkeeping. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:08 -07:00
Roland McGrath	daded34be9	tracehook: vfork-done This moves the PTRACE_EVENT_VFORK_DONE tracing into a tracehook.h inline, tracehook_report_vfork_done(). The change has no effect, just clean-up. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:08 -07:00
Roland McGrath	09a05394fe	tracehook: clone This moves all the ptrace initialization and tracing logic for task creation into tracehook.h and ptrace.h inlines. It reorganizes the code slightly, but should not change any behavior. There are four tracehook entry points, at each important stage of task creation. This keeps the interface from the core fork.c code fairly clean, while supporting the complex setup required for ptrace or something like it. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:08 -07:00
Roland McGrath	30199f5a46	tracehook: exit This moves the PTRACE_EVENT_EXIT tracing into a tracehook.h inline, tracehook_report_exec(). The change has no effect, just clean-up. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:08 -07:00
Roland McGrath	ff1188646c	tracehook: unexport ptrace_notify The ptrace_notify() function should not be called by any modules. It was only ever exported to be called by binfmt exec functions. But that is no longer necessary since fs/exec.c deals with that generically now. There should be no calls to ptrace_notify() from outside the core kernel. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:08 -07:00
Arjan van de Ven	261c40c119	use WARN() in kernel/irq/chip.c Use WARN() instead of a printk+WARN_ON() pair; this way the message becomes part of the warning section for better reporting/collection. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:07 -07:00
Arjan van de Ven	b8c512f619	Use WARN() in kernel/irq/manage.c Replace a printk+WARN_ON() by a WARN(); this increases the chance of the string making it into the bugreport (ie: it goes inside the ---[ cut here ]--- section) Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:07 -07:00
Alexey Dobriyan	51cc50685a	SL*B: drop kmem cache argument from constructor Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:07 -07:00
Eduard - Gabriel Munteanu	20d8b67c06	relay: add buffer-only channels; useful for early logging Allows one to create and use a channel with no associated files. Files can be initialized later. This is useful in scenarios such as logging in early code, before VFS is up. Therefore, such channels can be created and used as soon as kmem_cache_init() completed. This is needed by kmemtrace to do tracing in early kernel code. [kosaki.motohiro@jp.fujitsu.com: build fix] Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:04 -07:00
Eduard - Gabriel Munteanu	7babe8db99	Full conversion to early_initcall() interface, remove old interface A previous patch added the early_initcall(), to allow a cleaner hooking of pre-SMP initcalls. Now we remove the older interface, converting all existing users to the new one. [akpm@linux-foundation.org: cleanups] [akpm@linux-foundation.org: build fix] [kosaki.motohiro@jp.fujitsu.com: warning fix] [kosaki.motohiro@jp.fujitsu.com: warning fix] Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:04 -07:00
Huang Ying	89081d17f7	kexec jump: save/restore device state This patch implements devices state save/restore before after kexec. This patch together with features in kexec_jump patch can be used for following: - A simple hibernation implementation without ACPI support. You can kexec a hibernating kernel, save the memory image of original system and shutdown the system. When resuming, you restore the memory image of original system via ordinary kexec load then jump back. - Kernel/system debug through making system snapshot. You can make system snapshot, jump back, do some thing and make another system snapshot. - Cooperative multi-kernel/system. With kexec jump, you can switch between several kernels/systems quickly without boot process except the first time. This appears like swap a whole kernel/system out/in. - A general method to call program in physical mode (paging turning off). This can be used to invoke BIOS code under Linux. The following user-space tools can be used with kexec jump: - kexec-tools needs to be patched to support kexec jump. The patches and the precompiled kexec can be download from the following URL: source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2 patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2 binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10 - makedumpfile with patches are used as memory image saving tool, it can exclude free pages from original kernel memory image file. The patches and the precompiled makedumpfile can be download from the following URL: source: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile-src_cvs_kh10.tar.bz2 patches: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile-patches_cvs_kh10.tar.bz2 binary: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile_cvs_kh10 - An initramfs image can be used as the root file system of kexeced kernel. An initramfs image built with "BuildRoot" can be downloaded from the following URL: initramfs image: http://khibernation.sourceforge.net/download/release_v10/initramfs/rootfs_cvs_kh10.gz All user space tools above are included in the initramfs image. Usage example of simple hibernation: 1. Compile and install patched kernel with following options selected: CONFIG_X86_32=y CONFIG_RELOCATABLE=y CONFIG_KEXEC=y CONFIG_CRASH_DUMP=y CONFIG_PM=y CONFIG_HIBERNATION=y CONFIG_KEXEC_JUMP=y 2. Build an initramfs image contains kexec-tool and makedumpfile, or download the pre-built initramfs image, called rootfs.gz in following text. 3. Prepare a partition to save memory image of original kernel, called hibernating partition in following text. 4. Boot kernel compiled in step 1 (kernel A). 5. In the kernel A, load kernel compiled in step 1 (kernel B) with /sbin/kexec. The shell command line can be as follow: /sbin/kexec --load-preserve-context /boot/bzImage --mem-min=0x100000 --mem-max=0xffffff --initrd=rootfs.gz 6. Boot the kernel B with following shell command line: /sbin/kexec -e 7. The kernel B will boot as normal kexec. In kernel B the memory image of kernel A can be saved into hibernating partition as follow: jump_back_entry=`cat /proc/cmdline \| tr ' ' '\n' \| grep kexec_jump_back_entry \| cut -d '='` echo $jump_back_entry > kexec_jump_back_entry cp /proc/vmcore dump.elf Then you can shutdown the machine as normal. 8. Boot kernel compiled in step 1 (kernel C). Use the rootfs.gz as root file system. 9. In kernel C, load the memory image of kernel A as follow: /sbin/kexec -l --args-none --entry=`cat kexec_jump_back_entry` dump.elf 10. Jump back to the kernel A as follow: /sbin/kexec -e Then, kernel A is resumed. Implementation point: To support jumping between two kernels, before jumping to (executing) the new kernel and jumping back to the original kernel, the devices are put into quiescent state, and the state of devices and CPU is saved. After jumping back from kexeced kernel and jumping to the new kernel, the state of devices and CPU are restored accordingly. The devices/CPU state save/restore code of software suspend is called to implement corresponding function. Known issues: - Because the segment number supported by sys_kexec_load is limited, hibernation image with many segments may not be load. This is planned to be eliminated by adding a new flag to sys_kexec_load to make a image can be loaded with multiple sys_kexec_load invoking. Now, only the i386 architecture is supported. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:04 -07:00
Huang Ying	3ab8352137	kexec jump This patch provides an enhancement to kexec/kdump. It implements the following features: - Backup/restore memory used by the original kernel before/after kexec. - Save/restore CPU state before/after kexec. The features of this patch can be used as a general method to call program in physical mode (paging turning off). This can be used to call BIOS code under Linux. kexec-tools needs to be patched to support kexec jump. The patches and the precompiled kexec can be download from the following URL: source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2 patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2 binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10 Usage example of calling some physical mode code and return: 1. Compile and install patched kernel with following options selected: CONFIG_X86_32=y CONFIG_KEXEC=y CONFIG_PM=y CONFIG_KEXEC_JUMP=y 2. Build patched kexec-tool or download the pre-built one. 3. Build some physical mode executable named such as "phy_mode" 4. Boot kernel compiled in step 1. 5. Load physical mode executable with /sbin/kexec. The shell command line can be as follow: /sbin/kexec --load-preserve-context --args-none phy_mode 6. Call physical mode executable with following shell command line: /sbin/kexec -e Implementation point: To support jumping without reserving memory. One shadow backup page (source page) is allocated for each page used by kexeced code image (destination page). When do kexec_load, the image of kexeced code is loaded into source pages, and before executing, the destination pages and the source pages are swapped, so the contents of destination pages are backupped. Before jumping to the kexeced code image and after jumping back to the original kernel, the destination pages and the source pages are swapped too. C ABI (calling convention) is used as communication protocol between kernel and called code. A flag named KEXEC_PRESERVE_CONTEXT for sys_kexec_load is added to indicate that the loaded kernel image is used for jumping back. Now, only the i386 architecture is supported. Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:04 -07:00
WANG Cong	7fccf03265	kernel/kexec.c: make 'kimage_terminate' void Since kimage_terminate() always returns 0, make it void. Signed-off-by: WANG Cong <wangcong@zeuux.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:04 -07:00
David Brownell	a2e2e3577c	pm selftest: rtc paranoia Cope with a quirk of some RTCs (notably ACPI ones) which aren't guaranteed to implement oneshot behavior when they woke the system from sleeep: forcibly disable the alarm, just in case. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:02 -07:00
Ingo Molnar	5a7a201c51	cpumask: export cpumask_of_cpu_map fix: ERROR: "cpumask_of_cpu_map" [drivers/acpi/processor.ko] undefined! ERROR: "cpumask_of_cpu_map" [arch/x86/kernel/microcode.ko] undefined! ERROR: "cpumask_of_cpu_map" [arch/x86/kernel/cpu/cpufreq/speedstep-ich.ko] undefined! ERROR: "cpumask_of_cpu_map" [arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko] undefined! Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-26 16:48:59 +02:00
Mike Travis	0bc3cc03fa	cpumask: change cpumask_of_cpu_ptr to use new cpumask_of_cpu * Replace previous instances of the cpumask_of_cpu_ptr* macros with a the new (lvalue capable) generic cpumask_of_cpu(). Signed-off-by: Mike Travis <travis@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-26 16:40:33 +02:00
Mike Travis	6524d938b3	cpumask: put cpumask_of_cpu_map in the initdata section * Create the cpumask_of_cpu_map statically in the init data section using NR_CPUS but replace it during boot up with one sized by nr_cpu_ids (num possible cpus). Signed-off-by: Mike Travis <travis@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-26 16:40:33 +02:00
Mike Travis	b8d317d10c	cpumask: make cpumask_of_cpu_map generic If an arch doesn't define cpumask_of_cpu_map, create a generic statically-initialized one for them. This allows removal of the buggy cpumask_of_cpu() macro (&cpumask_of_cpu() gives address of out-of-scope var). An arch with NR_CPUS of 4096 probably wants to allocate this itself based on the actual number of CPUs, since otherwise they're using 2MB of rodata (1024 cpus means 128k). That's what CONFIG_HAVE_CPUMASK_OF_CPU_MAP is for (only x86/64 does so at the moment). In future as we support more CPUs, we'll need to resort to a get_cpu_map()/put_cpu_map() allocation scheme. Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-26 16:40:32 +02:00
Ingo Molnar	6dec3a10a7	Merge branch 'x86/x2apic' into x86/core Conflicts: include/asm-x86/i8259.h include/asm-x86/msidef.h Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-26 16:29:23 +02:00
Ingo Molnar	1fe371044b	ftrace: fix modular build fix: ERROR: "start_critical_timings" [drivers/acpi/processor.ko] undefined! ERROR: "stop_critical_timings" [drivers/acpi/processor.ko] undefined! Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-26 15:08:22 +02:00
Ingo Molnar	9b81361631	signalfd: fix undefined reference to `compat_sys_signalfd4' when !CONFIG_SIGNALFD fix: arch/x86/ia32/built-in.o: In function `ia32_sys_call_table': (.rodata+0xa38): undefined reference to `compat_sys_signalfd4' on !CONFIG_SIGNALFD. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 11:35:41 -07:00
Vegard Nossum	b81f3ea92b	taskstats: remove initialization of static per-cpu variable Cc: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
Keika Kobayashi	016ae219b9	per-task-delay-accounting: update taskstats for memory reclaim delay Add members for memory reclaim delay to taskstats, and accumulate them in __delayacct_add_tsk() . Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
Keika Kobayashi	873b477177	per-task-delay-accounting: add memory reclaim delay Sometimes, application responses become bad under heavy memory load. Applications take a bit time to reclaim memory. The statistics, how long memory reclaim takes, will be useful to measure memory usage. This patch adds accounting memory reclaim to per-task-delay-accounting for accounting the time of do_try_to_free_pages(). <i.e> - When System is under low memory load, memory reclaim may not occur. $ free total used free shared buffers cached Mem: 8197800 1577300 6620500 0 4808 1516724 -/+ buffers/cache: 55768 8142032 Swap: 16386292 0 16386292 $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 5069748 10612 3014060 0 0 0 0 3 26 0 0 100 0 0 0 0 5069748 10612 3014060 0 0 0 0 4 22 0 0 100 0 0 0 0 5069748 10612 3014060 0 0 0 0 3 18 0 0 100 0 Measure the time of tar command. $ ls -s test.dat 1501472 test.dat $ time tar cvf test.tar test.dat real 0m13.388s user 0m0.116s sys 0m5.304s $ ./delayget -d -p <pid> CPU count real total virtual total delay total 428 5528345500 5477116080 62749891 IO count delay total 338 8078977189 SWAP count delay total 0 0 RECLAIM count delay total 0 0 - When system is under heavy memory load memory reclaim may occur. $ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 7159032 49724 1812 3012 0 0 0 0 3 24 0 0 100 0 0 0 7159032 49724 1812 3012 0 0 0 0 4 24 0 0 100 0 0 0 7159032 49848 1812 3012 0 0 0 0 3 22 0 0 100 0 In this case, one process uses more 8G memory by execution of malloc() and memset(). $ time tar cvf test.tar test.dat real 1m38.563s <- increased by 85 sec user 0m0.140s sys 0m7.060s $ ./delayget -d -p <pid> CPU count real total virtual total delay total 9021 7140446250 7315277975 923201824 IO count delay total 8965 90466349669 SWAP count delay total 3 21036367 RECLAIM count delay total 740 61011951153 In the later case, the value of RECLAIM is increasing. So, taskstats can show how much memory reclaim influences TAT. Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujistu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
David Howells	3e85ba034d	tsacct: fix bacct_add_tsk()'s use of do_div() Fix bacct_add_tsk()'s use of do_div() on an s64 by making ac_etime a u64 instead and dividing that. Possibly this should be guarded lest the interval calculation turn up negative, but the possible negativity of the result of the division is cast away, and it shouldn't end up negative anyway. This was introduced by patch `f3cef7a994`. Signed-off-by: David Howells <dhowells@redhat.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
Andrea Righi	297c5d9263	task IO accounting: provide distinct tgid/tid I/O statistics Report per-thread I/O statistics in /proc/pid/task/tid/io and aggregate parent I/O statistics in /proc/pid/io. This approach follows the same model used to account per-process and per-thread CPU times. As a practial application, this allows for example to quickly find the top I/O consumer when a process spawns many child threads that perform the actual I/O work, because the aggregated I/O statistics can always be found in /proc/pid/io. [ Oleg Nesterov points out that we should check that the task is still alive before we iterate over the threads, but also says that we can do that fixup on top of this later. - Linus ] Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrea Righi <righi.andrea@gmail.com> Cc: Matt Heaton <matt@hostmonster.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Acked-by-with-comments: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
Pavel Emelyanov	0c18d7a5df	bsdacct: fix and add comments around acct_process() Fix the one describing what this function is and add one more - about locking absence around pid namespaces loop. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
Pavel Emelyanov	7d1e13505b	bsdacct: account dying tasks in all relevant namespaces This just makes the acct_proces walk the pid namespaces from current up to the top and account a task in each with the accounting turned on. ns->parent access if safe lockless, since current it still alive and holds its namespace, which in turn holds its parent. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
Pavel Emelyanov	b5a7174875	bsdacct: turn acct off for all pidns-s on umount time All the bsd_acct_strcts with opened accounting are linked into a global list. So, the acct_auto_close(_mnt) walks one and drops the accounting for each. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
Pavel Emelyanov	0b6b030fc3	bsdacct: switch from global bsd_acct_struct instance to per-pidns one Allocate the structure on the first call to sys_acct(). After this each namespace, that ordered the accounting, will live with this structure till its own death. Two notes - routines, that close the accounting on fs umount time use the init_pid_ns's acct by now; - accounting routine accounts to dying task's namespace (also by now). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
Pavel Emelyanov	6248b1b342	bsdacct: make internal code work with passed bsd_acct_struct, not global This adds the appropriate pointer to all the internal (i.e. static) functions that work with global acct instance. API calls pass a global instance to them (while we still have such). Mostly this is a s/acct_globals./acct->/ over the file. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
Pavel Emelyanov	a75d979765	bsdacct: turn the acct_lock from on-the-struct to global Don't use per-bsd-acct-struct lock, but work with a global one. This lock is taken for short periods, so it doesn't seem it'll become a bottleneck, but it will allow us to easily avoid many locking difficulties in the future. So this is a mostly s/acct_globals.lock/acct_lock/ over the file. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:47 -07:00
Pavel Emelyanov	e59a04a7aa	bsdacct: make check timer accept a bsd_acct_struct argument We're going to have many bsd_acct_struct instances, not just one, so the timer (currently working with a global one) has to know which one to work with. Use a handy setup_timer macro for it (thanks to Oleg for one). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:46 -07:00
Pavel Emelyanov	1c552858ac	bsdacct: "truthify" a comment near acct_process The acct_process does not accept any arguments actually. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:46 -07:00
Pavel Emelyanov	84406c153a	pidns: use kzalloc when allocating new pid_namespace struct It makes many fields initialization implicit helping in auto-setting #ifdef-ed fields (bsd-acct related pointer will be such). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:46 -07:00
Pavel Emelyanov	081e4c8a75	bsdacct: rename acct_gbls to bsd_acct_struct After I fixed access to task->tgid in kernel/acct.c, Oleg pointed out some bad side effects with this accounting vs pid namespaces interaction. I.e. when some task in pid namespace sets this accounting up, this blocks all the others from doing the same. Restricting this to init namespace only could help, but didn't look a graceful solution. So here is the approach to make this accounting work with pid namespaces properly. The idea is simple - when a task dies it accounts itself in each namespace it is visible from and which set the accounting up. For example here are the commands run and the output of lastcomm from init and sub namespaces: init_ns# accton pacct sub_ns# accton pacct (this is a different file - sub ns is run in a chroot-ed environment) init_ns# cat /dev/null sub_ns# ls /dev/null init_ns# accton sub_ns# accton sub_ns# lastcomm -f pacct ls 0 [136,0] 0.00 secs Thu May 15 10:30 accton 0 [136,0] 0.00 secs Thu May 15 10:30 init_ns# lastcomm -f pacct accton root pts/0 0.00 secs Thu May 15 14:30 << got from sub cat root pts/1 0.00 secs Thu May 15 14:30 ls root pts/0 0.00 secs Thu May 15 14:30 << got from sub accton root pts/1 0.00 secs Thu May 15 14:30 That was the summary, the details are in patches. This patch: It will be visible in pid_namespace.h file, so fix its name to look better outside the acct.c file. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:46 -07:00
Jonathan Lim	49b5cf3472	accounting: account for user time when updating memory integrals Adapt acct_update_integrals() to include user time when calculating the time difference. The units of acct_rss_mem1 and acct_vm_mem1 are also changed from pages-jiffies to pages-usecs to avoid calling jiffies_to_usecs() in xacct_add_tsk() which might overflow. Signed-off-by: Jonathan Lim <jlim@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:46 -07:00
Adrian Bunk	7394f0f6c0	unexport uts_sem With the removal of the Solaris binary emulation the export of uts_sem became unused. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:45 -07:00
Harvey Harrison	a89cc1959d	markers: fix sparse integer as NULL pointer warning kernel/trace/trace_sysprof.c:164:20: warning: Using plain integer as NULL pointer Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:45 -07:00
Mathieu Desnoyers	28325df0d9	markers: use rcu_barrier_sched() and call_rcu_sched() rcu_barrier_sched() and call_rcu_sched() were introduced in 2.6.26 for the Markers. Change the marker code to use them. It can be seen as a fix since the marker code was using an ugly, temporary, #ifdef hack to work around CONFIG_PREEMPT_RCU. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: Paul McKenney <paulmck@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:45 -07:00
Pavel Emelyanov	e49859e71e	pidns: remove now unused find_pid function. This one had the only users so far - the kill_proc, which is removed, so drop this (invalid in namespaced world) call too. And of course - erase all references on it from comments. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:45 -07:00
Pavel Emelyanov	19b0cfcca4	pidns: remove now unused kill_proc function This function operated on a pid_t to kill a task, which is no longer valid in a containerized system. It has finally lost all its users and we can safely remove it from the tree. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:45 -07:00
Alexey Dobriyan	99541c23cd	sysctl: check for bogus modes Catch, e. g., 644/0644 typo. Signed-off-by: Alexey Dobriyan <adobriyan@parallels.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:45 -07:00
David Sterba	339caf2a22	proc: misplaced export of find_get_pid Move EXPORT_SYMBOL right after the func Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:45 -07:00
Oleg Nesterov	8448502cfc	workqueues: do CPU_UP_CANCELED if CPU_UP_PREPARE fails The bug was pointed out by Akinobu Mita <akinobu.mita@gmail.com>, and this patch is based on his original patch. workqueue_cpu_callback(CPU_UP_PREPARE) expects that if it returns NOTIFY_BAD, _cpu_up() will send CPU_UP_CANCELED then. However, this is not true since "cpu hotplug: cpu: deliver CPU_UP_CANCELED only to NOTIFY_OKed callbacks with CPU_UP_PREPARE" commit: `a0d8cdb652` The callback which has returned NOTIFY_BAD will not receive CPU_UP_CANCELED. Change the code to fulfil the CPU_UP_CANCELED logic if CPU_UP_PREPARE fails. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Reported-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:41 -07:00
Oleg Nesterov	8de6d308ba	workqueues: schedule_on_each_cpu() can use schedule_work_on() schedule_on_each_cpu() can use schedule_work_on() to avoid the code duplication. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:40 -07:00
Oleg Nesterov	ef1ca236b8	workqueues: queue_work() can use queue_work_on() queue_work() can use queue_work_on() to avoid the code duplication. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:40 -07:00
Oleg Nesterov	a67da70dc0	workqueues: lockdep annotations for flush_work() Add lockdep annotations to flush_work() and update the comment. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jarek Poplawski <jarkao2@o2.pl> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:40 -07:00
Oleg Nesterov	3da1c84c00	workqueues: make get_online_cpus() useable for work->func() workqueue_cpu_callback(CPU_DEAD) flushes cwq->thread under cpu_maps_update_begin(). This means that the multithreaded workqueues can't use get_online_cpus() due to the possible deadlock, very bad and very old problem. Introduce the new state, CPU_POST_DEAD, which is called after cpu_hotplug_done() but before cpu_maps_update_done(). Change workqueue_cpu_callback() to use CPU_POST_DEAD instead of CPU_DEAD. This means that create/destroy functions can't rely on get_online_cpus() any longer and should take cpu_add_remove_lock instead. [akpm@linux-foundation.org: fix CONFIG_SMP=n] Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Gautham R Shenoy <ego@in.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Max Krasnyansky <maxk@qualcomm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:40 -07:00
Oleg Nesterov	8616a89ab7	workqueues: schedule_on_each_cpu: use flush_work() Change schedule_on_each_cpu() to use flush_work() instead of flush_workqueue(), this way we don't wait for other work_struct's which can be queued meanwhile. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jarek Poplawski <jarkao2@gmail.com> Cc: Max Krasnyansky <maxk@qualcomm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:40 -07:00
Oleg Nesterov	db70089722	workqueues: implement flush_work() Most of users of flush_workqueue() can be changed to use cancel_work_sync(), but sometimes we really need to wait for the completion and cancelling is not an option. schedule_on_each_cpu() is good example. Add the new helper, flush_work(work), which waits for the completion of the specific work_struct. More precisely, it "flushes" the result of of the last queue_work() which is visible to the caller. For example, this code queue_work(wq, work); /* WINDOW / queue_work(wq, work); flush_work(work); doesn't necessary work "as expected". What can happen in the WINDOW above is - wq starts the execution of work->func() - the caller migrates to another CPU now, after the 2nd queue_work() this work is active on the previous CPU, and at the same time it is queued on another. In this case flush_work(work) may return before the first work->func() completes. It is trivial to add another helper int flush_work_sync(struct work_struct work) { return flush_work(work) \|\| wait_on_work(work); } which works "more correctly", but it has to iterate over all CPUs and thus it much slower than flush_work(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Max Krasnyansky <maxk@qualcomm.com> Acked-by: Jarek Poplawski <jarkao2@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:40 -07:00
Oleg Nesterov	1a4d9b0aa0	workqueues: insert_work: use "list_head " instead of "int tail" insert_work() inserts the new work_struct before or after cwq->worklist, depending on the "int tail" parameter. Change it to accept "list_head " instead, this shrinks .text a bit and allows us to insert the barrier after specific work_struct. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Jarek Poplawski <jarkao2@gmail.com> Cc: Max Krasnyansky <maxk@qualcomm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:40 -07:00
Oleg Nesterov	a94e2d408e	coredump: kill mm->core_done Now that we have core_state->dumper list we can use it to wake up the sub-threads waiting for the coredump completion. This uglifies the code and .text grows by 47 bytes, but otoh mm_struct lessens by sizeof(struct completion). Also, with this change we can decouple exit_mm() from the coredumping code. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:40 -07:00
Oleg Nesterov	b564daf806	coredump: construct the list of coredumping threads at startup time binfmt->core_dump() has to iterate over the all threads in system in order to find the coredumping threads and construct the list using the GFP_ATOMIC allocations. With this patch each thread allocates the list node on exit_mm()'s stack and adds itself to the list. This allows us to do further changes: - simplify ->core_dump() - change exit_mm() to clear ->mm first, then wait for ->core_done. this makes the coredumping process visible to oom_kill - kill mm->core_done Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:40 -07:00
Oleg Nesterov	c5f1cc8c18	coredump: turn core_state->nr_threads into atomic_t Turn core_state->nr_threads into atomic_t and kill now unneeded down_write(&mm->mmap_sem) in exit_mm(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	999d9fc167	coredump: move mm->core_waiters into struct core_state Move mm->core_waiters into "struct core_state" allocated on stack. This shrinks mm_struct a little bit and allows further changes. This patch mostly does s/core_waiters/core_state. The only essential change is that coredump_wait() must clear mm->core_state before return. The coredump_wait()'s path is uglified and .text grows by 30 bytes, this is fixed by the next patch. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	32ecb1f26d	coredump: turn mm->core_startup_done into the pointer to struct core_state mm->core_startup_done points to "struct completion startup_done" allocated on the coredump_wait()'s stack. Introduce the new structure, core_state, which holds this "struct completion". This way we can add more info visible to the threads participating in coredump without enlarging mm_struct. No changes in affected .o files. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	246bb0b1de	kill PF_BORROWED_MM in favour of PF_KTHREAD Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users. No functional changes yet. But this allows us to do further fixes/cleanups. oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the kthreads, this is wrong because of use_mm(). The problem with PF_BORROWED_MM is that we need task_lock() to avoid races. With this patch we can check PF_KTHREAD directly, or use a simple lockless helper: /* The result must not be dereferenced !!! / struct mm_struct __get_task_mm(struct task_struct *tsk) { if (tsk->flags & PF_KTHREAD) return NULL; return tsk->mm; } Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel thread without PF_BORROWED_MM. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	7b34e4283c	introduce PF_KTHREAD flag Introduce the new PF_KTHREAD flag to mark the kernel threads. It is set by INIT_TASK() and copied to the forked childs (we could set it in kthreadd() along with PF_NOFREEZE instead). daemonize() was changed as well. In that case testing of PF_KTHREAD is racy, but daemonize() is hopeless anyway. This flag is cleared in do_execve(), before search_binary_handler(). Probably not the best place, we can do this in exec_mmap() or in start_thread(), or clear it along with PF_FORKNOEXEC. But I think this doesn't matter in practice, and if do_execve() fails kthread should die soon. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	3d749b9e67	ptrace: simplify ptrace_stop()->sigkill_pending() path 1. SIGKILL can't be blocked, remove this check from sigkill_pending(). 2. When ptrace_stop() sees sigkill_pending() == T, it can just return. Kill "int killed" and simplify the code. This also is more correct, the tracer shouldn't see us in TASK_TRACED if we are not going to stop. I strongly believe this code needs further changes. We should do the "was this task killed" check unconditionally, currently it depends on arch_ptrace_stop_needed(). On the other hand, sigkill_pending() isn't very clever. If the task was killed tkill(SIGKILL), the signal can be already dequeued if the caller is do_exit(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Gustavo Fernando Padovan	bc64efd220	kernel/signal.c: change vars pid and tgid types to pid_t Change the type of pid and tgid variables from int to the POSIX type pid_t. Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Michael Kerrisk	d8878ba3f0	signals: make siginfo_t si_utime + si_sstime report times in USER_HZ, not HZ In the switch to configurable HZ in 2.6, the treatment of the si_utime and si_stime fields that are exposed to userland via the siginfo structure looks to have been botched. As things stand, these fields report times in units of HZ, so that userland gets information that varies depending on the HZ that the kernel was configured with. This patch changes the reported values to use USER_HZ units. Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:39 -07:00
Oleg Nesterov	2b201a9edd	signals: do_signal_stop: kill the SIGNAL_UNKILLABLE check `fae5fa44f1` changed do_signal_stop() to check SIGNAL_UNKILLABLE, this wasn't needed. If signal_group_exit() == F, the signal sent to SIGNAL_UNKILLABLE task must be already filtered out by the caller, get_signal_to_deliver(). And if signal_group_exit() == T we are not going to stop. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Oleg Nesterov	92413d771e	signals: dequeue_signal: don't check SIGNAL_GROUP_EXIT when setting SIGNAL_STOP_DEQUEUED dequeue_signal() checks SIGNAL_GROUP_EXIT before setting SIGNAL_STOP_DEQUEUED. This was added by `788e05a67c` a long ago to avoid the coredump/SIGSTOP race. Since then the related code was changed, and now this subtle check is both incomplete and unneeded at the same time. It is incomplete because nowadays exec() doesn't set SIGNAL_GROUP_EXIT, so in fact we should check signal_group_exit() to avoid a similar race. Fortunately, we doesn't need the check at all. The only function which relies on SIGNAL_STOP_DEQUEUED is do_signal_stop(), and it ignores this flag if signal_group_exit() == T, this covers the SIGNAL_GROUP_EXIT case. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Oleg Nesterov	3854a77182	__exit_signal: don't take rcu lock There is no reason for rcu_read_lock() in __exit_signal(). tsk->sighand can only be changed if tsk does exec, obviously this is not possible. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Oleg Nesterov	100360f030	signals: change collect_signal() to return void With the recent changes collect_signal() always returns true. Change it to return void and update the single caller. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Oleg Nesterov	d443420761	signals: collect_signal: simplify the "still_pending" logic Factor out sigdelset() calls and remove the "still_pending" variable. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Oleg Nesterov	6715ca451c	signals: collect_signal: remove the unneeded sigismember() check collect_signal() checks sigismember(&list->signal, sig), this is not needed. This "sig" was just found by next_signal(), so it must be valid. We have a (completely broken) call to ->notifier in between, but it must not play with sigpending->signal bits or unlock ->siglock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Oleg Nesterov	96347e7759	posix timers: release_posix_timer: kill the bogus put_task_struct(->it_process); release_posix_timer() can't be called with ->it_process != NULL. Once sys_timer_create() sets ->it_process it must not call release_posix_timer(), otherwise we can race with another thread doing sys_timer_delete(), this timer is visible to idr_find() and unlocked. The same is true for two other callers (actually, for any possible caller), sys_timer_delete() and itimer_delete(). They must clear ->it_process before unlock_timer() + release_posix_timer(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Oleg Nesterov	4b7a130426	posix timers: timer_delete: remove the bogus "->it_process != NULL" check sys_timer_delete() and itimer_delete() check "timer->it_process != NULL", this looks completely bogus. ->it_process == NULL means that this timer is already under destruction or it is not fully initialized, this must not happen. sys_timer_delete: the timer is locked, and lock_timer() can't succeed if ->it_process == NULL. itimer_delete: it is called by exit_itimers() when there are no other threads which can play with signal_struct->posix_timers. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Cc: john stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Lai Jiangshan	da5ef6bb96	cpuset: two minor code-cleanups In cpuset_update_task_memory_state() local variable struct task_struct tsk = current; And local variable tsk is used 14 times and statement task_cs(tsk) is used twice in this function. So using task_cs(tsk) instead of task_cs(current) is better for readability. And "(struct cgroup_scanner )&scan" is not good for readability also. (and "container_of" is used in cpuset_do_move_task(), not "(cpuset_hotplug_scanner *)scan") Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Lai Jiangshan	0241248377	cpuset: code-cleanup for started_after cgroup(cgroup_scan_tasks) will initialize heap->gt for us. This patch removes started_after() and its helper-function. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Lai Jiangshan	489a5393a2	cpuset: don't pass empty cpumasks to partition_sched_domains() I create lots of empty cpusets(empty cpumasks) and turn off the "sched_load_balance" in top cpuset. I found that all these empty cpumasks are passed to partition_sched_domains() in rebuild_sched_domains(), it's very time-consuming for partition_sched_domains() and it's not need. It also reduce memory consumed and some works in rebuild_sched_domains() too. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Li Zefan	c372e817af	cpuset: avoid unnecessary sched domains rebuilding When changing 'sched_relax_domain_level', don't rebuild sched domains if 'cpus' is empty or 'sched_load_balance' is not set. Also make the comments of rebuild_sched_domains() more readable. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Miao Xie	f9b4fb8dab	cpusets: update task's cpus_allowed and mems_allowed after CPU/NODE offline/online The bug is that a task may run on the cpu/node which is not in its cpuset.cpus/ cpuset.mems. It can be reproduced by the following commands: ----------------------------------- # mkdir /dev/cpuset # mount -t cpuset xxx /dev/cpuset # mkdir /dev/cpuset/0 # echo 0-1 > /dev/cpuset/0/cpus # echo 0 > /dev/cpuset/0/mems # echo $$ > /dev/cpuset/0/tasks # echo 0 > /sys/devices/system/cpu/cpu1/online # echo 1 > /sys/devices/system/cpu/cpu1/online ----------------------------------- There is only CPU0 in cpuset.cpus, but the task in this cpuset runs on both CPU0 and CPU1. It is because the task's cpu_allowed didn't get updated after we did CPU offline/online manipulation. Similar for mem_allowed. This patch fixes this bug expect for root cpuset. Because there is a problem about root cpuset, in that whether it is necessary to update all the tasks in root cpuset or not after cpu/node offline/online. If updating, some kernel threads which is bound into a specified cpu will be unbound. If not updating, there is a bug in root cpuset. This bug is also caused by offline/online manipulation. For example, there is a dual-cpu machine. we create a sub cpuset in root cpuset and assign 1 to its cpus. And then we attach some tasks into this sub cpuset. After this, we offline CPU1. Now, the tasks in this new cpuset are moved into root cpuset automatically because there is no cpu in sub cpuset. Then we online CPU1, we find all the tasks which doesn't belong to root cpuset originally just run on CPU0. Maybe we need to add a flag in the task_struct to mark which task can't be unbound? Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: Paul Jackson <pj@sgi.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:38 -07:00
Miao Xie	0b2f630a28	cpusets: restructure the function update_cpumask() and update_nodemask() Extract two functions from update_cpumask() and update_nodemask().They will be used later for updating tasks' cpus_allowed and mems_allowed after CPU/NODE offline/online. [lizf@cn.fujitsu.com: build fix] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: Paul Jackson <pj@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:37 -07:00
Serge E. Hallyn	e885dcde75	cgroup_clone: use pid of newly created task for new cgroup cgroup_clone creates a new cgroup with the pid of the task. This works correctly for unshare, but for clone cgroup_clone is called from copy_namespaces inside copy_process, which happens before the new pid is created. As a result, the new cgroup was created with current's pid. This patch: 1. Moves the call inside copy_process to after the new pid is created 2. Passes the struct pid into ns_cgroup_clone (as it is not yet attached to the task) 3. Passes a name from ns_cgroup_clone() into cgroup_clone() so as to keep cgroup_clone() itself simpler 4. Uses pid_vnr() to get the process id value, so that the pid used to name the new cgroup is always the pid as it would be known to the task which did the cloning or unsharing. I think that is the most intuitive thing to do. This way, task t1 does clone(CLONE_NEWPID) to get t2, which does clone(CLONE_NEWPID) to get t3, then the cgroup for t3 will be named for the pid by which t2 knows t3. (Thanks to Dan Smith for finding the main bug) Changelog: June 11: Incorporate Paul Menage's feedback: don't pass NULL to ns_cgroup_clone from unshare, and reduce patch size by using 'nodename' in cgroup_clone. June 10: Original version [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Serge Hallyn <serge@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Tested-by: Dan Smith <danms@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:37 -07:00
Paul Menage	856c13aa1f	cgroup files: convert res_counter_write() to be a cgroups write_string() handler Currently res_counter_write() is a raw file handler even though it's ultimately taking a number, since in some cases it wants to pre-process the string when converting it to a number. This patch converts res_counter_write() from a raw file handler to a write_string() handler; this allows some of the boilerplate copying/locking/checking to be removed, and simplies the cleanup path, since these functions are now performed by the cgroups framework. [lizf@cn.fujitsu.com: build fix] Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:36 -07:00
Paul Menage	e371239532	cgroup files: remove cpuset_common_file_write() This patch tweaks the signatures of the update_cpumask() and update_nodemask() functions so that they can be called directly as handlers for the new cgroups write_string() method. This allows cpuset_common_file_write() to be removed. Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:36 -07:00
Paul Menage	af351026aa	cgroup files: turn attach_task_by_pid directly into a cgroup write handler This patch changes attach_task_by_pid() to take a u64 rather than a string; as a result it can be called directly as a control groups write_u64 handler, and cgroup_common_file_write() can be removed. Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:36 -07:00
Paul Menage	6379c10615	cgroup files: move notify_on_release file to separate write handler This patch moves the write handler for the cgroups notify_on_release file into a separate handler. This handler requires no cgroups locking since it relies on atomic bitops for synchronization. Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:36 -07:00
Paul Menage	84eea84288	cgroups: misc cleanups to write_string patchset This patch contains cleanups suggested by reviewers for the recent write_string() patchset: - pair cgroup_lock_live_group() with cgroup_unlock() in cgroup.c for clarity, rather than directly unlocking cgroup_mutex. - make the return type of cgroup_lock_live_group() a bool - use a #define'd constant for the local buffer size in read/write functions Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:35 -07:00
Paul Menage	e788e066c6	cgroup files: move the release_agent file to use typed handlers Adds cgroup_release_agent_write() and cgroup_release_agent_show() methods to handle writing/reading the path to a cgroup hierarchy's release agent. As a result, cgroup_common_file_read() is now unnecessary. As part of the change, a previously-tolerated race in cgroup_release_agent() is avoided by copying the current release_agent_path prior to calling call_usermode_helper(). Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:35 -07:00
Paul Menage	db3b14978a	cgroup files: add write_string cgroup control file method This patch adds a write_string() method for cgroups control files. The semantics are that a buffer is copied from userspace to kernelspace and the handler function invoked on that buffer. The buffer is guaranteed to be nul-terminated, and no longer than max_write_len (defaulting to 64 bytes if unspecified). Later patches will convert existing raw file write handlers in control group subsystems to use this method. Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Balbir Singh <balbir@in.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:35 -07:00
Li Zefan	8947f9d5b3	cgroups: annotate two variables with __read_mostly - need_forkexit_callback will be read only after system boot. - use_task_css_set_links will be read only after it's set. And these 2 variables are checked when a new process is forked. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:35 -07:00
KOSAKI Motohiro	71cbb949d1	cgroup: list_for_each cleanup -------------------------- while() { list_entry(); ... } -------------------------- is equivalent to following code. -------------------------- list_for_each_entry(){ ... } -------------------------- later can review easily more. this patch is just clean up. it doesn't have any behavor change. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:35 -07:00
Li Zefan	7e9abd89cb	cgroup: use read lock to guard find_existing_css_set() The function does not modify anything (except the temporary css template), so it's sufficient to hold read lock. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:35 -07:00
Abhishek Sagar	8b6dd98682	kprobes: remove redundant config check I noticed that there's a CONFIG_KPROBES check inside kernel/kprobes.c, which is redundant. Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:30 -07:00
Srinivasa D S	ef53d9c5e4	kprobes: improve kretprobe scalability with hashed locking Currently list of kretprobe instances are stored in kretprobe object (as used_instances,free_instances) and in kretprobe hash table. We have one global kretprobe lock to serialise the access to these lists. This causes only one kretprobe handler to execute at a time. Hence affects system performance, particularly on SMP systems and when return probe is set on lot of functions (like on all systemcalls). Solution proposed here gives fine-grain locks that performs better on SMP system compared to present kretprobe implementation. Solution: 1) Instead of having one global lock to protect kretprobe instances present in kretprobe object and kretprobe hash table. We will have two locks, one lock for protecting kretprobe hash table and another lock for kretporbe object. 2) We hold lock present in kretprobe object while we modify kretprobe instance in kretprobe object and we hold per-hash-list lock while modifying kretprobe instances present in that hash list. To prevent deadlock, we never grab a per-hash-list lock while holding a kretprobe lock. 3) We can remove used_instances from struct kretprobe, as we can track used instances of kretprobe instances using kretprobe hash table. Time duration for kernel compilation ("make -j 8") on a 8-way ppc64 system with return probes set on all systemcalls looks like this. cacheline non-cacheline Un-patched kernel aligned patch aligned patch =============================================================================== real 9m46.784s 9m54.412s 10m2.450s user 40m5.715s 40m7.142s 40m4.273s sys 2m57.754s 2m58.583s 3m17.430s =========================================================== Time duration for kernel compilation ("make -j 8) on the same system, when kernel is not probed. ========================= real 9m26.389s user 40m8.775s sys 2m7.283s ========================= Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com> Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:30 -07:00
Dave Young	717115e1a5	printk ratelimiting rewrite All ratelimit user use same jiffies and burst params, so some messages (callbacks) will be lost. For example: a call printk_ratelimit(5 * HZ, 1) b call printk_ratelimit(5 * HZ, 1) before the 5*HZ timeout of a, then b will will be supressed. - rewrite __ratelimit, and use a ratelimit_state as parameter. Thanks for hints from andrew. - Add WARN_ON_RATELIMIT, update rcupreempt.h - remove __printk_ratelimit - use __ratelimit in net_ratelimit Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:29 -07:00
Arjan van de Ven	7a2c477069	kernel/irq/manage.c: replace a printk + WARN_ON() to a WARN() Replace a printk+WARN_ON() by a WARN(); this increases the chance of the string making it into the bugreport (ie: it goes inside the ---[ cut here ]--- section) Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:29 -07:00
Arjan van de Ven	a8f18b909c	Add a WARN() macro; this is WARN_ON() + printk arguments Add a WARN() macro that acts like WARN_ON(), with the added feature that it takes a printk like argument that is printed as part of the warning message. [akpm@linux-foundation.org: fix printk arguments] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Greg KH <greg@kroah.com> Cc: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:29 -07:00
FUJITA Tomonori	b69c49b784	clean up duplicated alloc/free_thread_info We duplicate alloc/free_thread_info defines on many platforms (the majority uses __get_free_pages/free_pages). This patch defines common defines and removes these duplicated defines. __HAVE_ARCH_THREAD_INFO_ALLOCATOR is introduced for platforms that do something different. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:28 -07:00
KOSAKI Motohiro	ac331d158e	call_usermodehelper(): increase reliability Presently call_usermodehelper_setup() uses GFP_ATOMIC. but it can return NULL _very_ easily. GFP_ATOMIC is needed only when we can't sleep. and, GFP_KERNEL is robust and better. thus, I add gfp_mask argument to call_usermodehelper_setup(). So, its callers pass the gfp_t as below: call_usermodehelper() and call_usermodehelper_keys(): depend on 'wait' argument. call_usermodehelper_pipe(): always GFP_KERNEL because always run under process context. orderly_poweroff(): pass to GFP_ATOMIC because may run under interrupt context. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: "Paul Menage" <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:28 -07:00
Adrian Bunk	b03f6489f9	build kernel/profile.o only when requested Build kernel/profile.o only if CONFIG_PROFILING is enabled. This makes CONFIG_PROFILING=n kernels smaller. As a bonus, some profile_tick() calls and one branch from schedule() are now eliminated with CONFIG_PROFILING=n (but I doubt these are measurable effects). This patch changes the effects of CONFIG_PROFILING=n, but I don't think having more than two choices would be the better choice. This patch also adds the name of the first parameter to the prototypes of profile_{hits,tick}() since I anyway had to add them for the dummy functions. Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:27 -07:00
Vegard Nossum	2fc9c4e18f	kallsyms: fix potential overflow in binary search This will probably never trigger... but it won't hurt to be careful. http://googleresearch.blogspot.com/2006/06/extra-extra-read-all-about-it-nearly.html Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: Joshua Bloch <jjb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:27 -07:00
Wang Chen	5df439ef06	flag parameters: fix compile error of sys_epoll_create1 GEN .version CHK include/linux/compile.h UPD include/linux/compile.h CC init/version.o LD init/built-in.o LD vmlinux arch/x86/kernel/built-in.o: In function `sys_call_table': (.rodata+0x8a4): undefined reference to `sys_epoll_create1' make: *** [vmlinux] Error 1 Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:26 -07:00
Ingo Molnar	10a010f695	Merge branch 'linus' into x86/x2apic Conflicts: drivers/pci/dmar.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-25 13:08:16 +02:00
Linus Torvalds	ecc8b655b3	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: nohz: adjust tick_nohz_stop_sched_tick() call of s390 as well nohz: prevent tick stop outside of the idle loop	2008-07-24 12:55:01 -07:00
Linus Torvalds	2528ce3237	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: arch/mips/kernel/stacktrace.c: Heiko can't type kthread: reduce stack pressure in create_kthread and kthreadd fix core/stacktrace changes on avr32, mips, sh	2008-07-24 12:54:26 -07:00
Linus Torvalds	8ffa5b6596	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: clean up compiler warning sched: fix hrtick & generic-ipi dependency	2008-07-24 12:53:51 -07:00
Ulrich Drepper	be61a86d72	flag parameters: NONBLOCK in pipe This patch adds O_NONBLOCK support to pipe2. It is minimally more involved than the patches for eventfd et.al but still trivial. The interfaces of the create_write_pipe and create_read_pipe helper functions were changed and the one other caller as well. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_pipe2 # ifdef __x86_64__ # define __NR_pipe2 293 # elif defined __i386__ # define __NR_pipe2 331 # else # error "need __NR_pipe2" # endif #endif int main (void) { int fds[2]; if (syscall (__NR_pipe2, fds, 0) == -1) { puts ("pipe2(0) failed"); return 1; } for (int i = 0; i < 2; ++i) { int fl = fcntl (fds[i], F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { printf ("pipe2(0) set non-blocking mode for fds[%d]\n", i); return 1; } close (fds[i]); } if (syscall (__NR_pipe2, fds, O_NONBLOCK) == -1) { puts ("pipe2(O_NONBLOCK) failed"); return 1; } for (int i = 0; i < 2; ++i) { int fl = fcntl (fds[i], F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { printf ("pipe2(O_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i); return 1; } close (fds[i]); } puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	4006553b06	flag parameters: inotify_init This patch introduces the new syscall inotify_init1 (note: the 1 stands for the one parameter the syscall takes, as opposed to no parameter before). The values accepted for this parameter are function-specific and defined in the inotify.h header. Here the values must match the O_* flags, though. In this patch CLOEXEC support is introduced. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_inotify_init1 # ifdef __x86_64__ # define __NR_inotify_init1 294 # elif defined __i386__ # define __NR_inotify_init1 332 # else # error "need __NR_inotify_init1" # endif #endif #define IN_CLOEXEC O_CLOEXEC int main (void) { int fd; fd = syscall (__NR_inotify_init1, 0); if (fd == -1) { puts ("inotify_init1(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("inotify_init1(0) set close-on-exit"); return 1; } close (fd); fd = syscall (__NR_inotify_init1, IN_CLOEXEC); if (fd == -1) { puts ("inotify_init1(IN_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("inotify_init1(O_CLOEXEC) does not set close-on-exit"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	b087498eb5	flag parameters: eventfd This patch adds the new eventfd2 syscall. It extends the old eventfd syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is EFD_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name EFD_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_eventfd2 # ifdef __x86_64__ # define __NR_eventfd2 290 # elif defined __i386__ # define __NR_eventfd2 328 # else # error "need __NR_eventfd2" # endif #endif #define EFD_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_eventfd2, 1, 0); if (fd == -1) { puts ("eventfd2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("eventfd2(0) sets close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_eventfd2, 1, EFD_CLOEXEC); if (fd == -1) { puts ("eventfd2(EFD_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("eventfd2(EFD_CLOEXEC) does not set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Ulrich Drepper	9deb27baed	flag parameters: signalfd This patch adds the new signalfd4 syscall. It extends the old signalfd syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is SFD_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name SFD_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_signalfd4 # ifdef __x86_64__ # define __NR_signalfd4 289 # elif defined __i386__ # define __NR_signalfd4 327 # else # error "need __NR_signalfd4" # endif #endif #define SFD_CLOEXEC O_CLOEXEC int main (void) { sigset_t ss; sigemptyset (&ss); sigaddset (&ss, SIGUSR1); int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0); if (fd == -1) { puts ("signalfd4(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("signalfd4(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_CLOEXEC); if (fd == -1) { puts ("signalfd4(SFD_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("signalfd4(SFD_CLOEXEC) does not set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Ulrich Drepper	aaca0bdca5	flag parameters: paccept This patch is by far the most complex in the series. It adds a new syscall paccept. This syscall differs from accept in that it adds (at the userlevel) two additional parameters: - a signal mask - a flags value The flags parameter can be used to set flag like SOCK_CLOEXEC. This is imlpemented here as well. Some people argued that this is a property which should be inherited from the file desriptor for the server but this is against POSIX. Additionally, we really want the signal mask parameter as well (similar to pselect, ppoll, etc). So an interface change in inevitable. The flag value is the same as for socket and socketpair. I think diverging here will only create confusion. Similar to the filesystem interfaces where the use of the O_* constants differs, it is acceptable here. The signal mask is handled as for pselect etc. The mask is temporarily installed for the thread and removed before the call returns. I modeled the code after pselect. If there is a problem it's likely also in pselect. For architectures which use socketcall I maintained this interface instead of adding a system call. The symmetry shouldn't be broken. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <errno.h> #include <fcntl.h> #include <pthread.h> #include <signal.h> #include <stdio.h> #include <unistd.h> #include <netinet/in.h> #include <sys/socket.h> #include <sys/syscall.h> #ifndef __NR_paccept # ifdef __x86_64__ # define __NR_paccept 288 # elif defined __i386__ # define SYS_PACCEPT 18 # define USE_SOCKETCALL 1 # else # error "need __NR_paccept" # endif #endif #ifdef USE_SOCKETCALL # define paccept(fd, addr, addrlen, mask, flags) \ ({ long args[6] = { \ (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \ syscall (__NR_socketcall, SYS_PACCEPT, args); }) #else # define paccept(fd, addr, addrlen, mask, flags) \ syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags) #endif #define PORT 57392 #define SOCK_CLOEXEC O_CLOEXEC static pthread_barrier_t b; static void * tf (void arg) { pthread_barrier_wait (&b); int s = socket (AF_INET, SOCK_STREAM, 0); struct sockaddr_in sin; sin.sin_family = AF_INET; sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK); sin.sin_port = htons (PORT); connect (s, (const struct sockaddr ) &sin, sizeof (sin)); close (s); pthread_barrier_wait (&b); s = socket (AF_INET, SOCK_STREAM, 0); sin.sin_port = htons (PORT); connect (s, (const struct sockaddr ) &sin, sizeof (sin)); close (s); pthread_barrier_wait (&b); pthread_barrier_wait (&b); sleep (2); pthread_kill ((pthread_t) arg, SIGUSR1); return NULL; } static void handler (int s) { } int main (void) { pthread_barrier_init (&b, NULL, 2); struct sockaddr_in sin; pthread_t th; if (pthread_create (&th, NULL, tf, (void ) pthread_self ()) != 0) { puts ("pthread_create failed"); return 1; } int s = socket (AF_INET, SOCK_STREAM, 0); int reuse = 1; setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse)); sin.sin_family = AF_INET; sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK); sin.sin_port = htons (PORT); bind (s, (struct sockaddr *) &sin, sizeof (sin)); listen (s, SOMAXCONN); pthread_barrier_wait (&b); int s2 = paccept (s, NULL, 0, NULL, 0); if (s2 < 0) { puts ("paccept(0) failed"); return 1; } int coe = fcntl (s2, F_GETFD); if (coe & FD_CLOEXEC) { puts ("paccept(0) set close-on-exec-flag"); return 1; } close (s2); pthread_barrier_wait (&b); s2 = paccept (s, NULL, 0, NULL, SOCK_CLOEXEC); if (s2 < 0) { puts ("paccept(SOCK_CLOEXEC) failed"); return 1; } coe = fcntl (s2, F_GETFD); if ((coe & FD_CLOEXEC) == 0) { puts ("paccept(SOCK_CLOEXEC) does not set close-on-exec flag"); return 1; } close (s2); pthread_barrier_wait (&b); struct sigaction sa; sa.sa_handler = handler; sa.sa_flags = 0; sigemptyset (&sa.sa_mask); sigaction (SIGUSR1, &sa, NULL); sigset_t ss; pthread_sigmask (SIG_SETMASK, NULL, &ss); sigaddset (&ss, SIGUSR1); pthread_sigmask (SIG_SETMASK, &ss, NULL); sigdelset (&ss, SIGUSR1); alarm (4); pthread_barrier_wait (&b); errno = 0 ; s2 = paccept (s, NULL, 0, &ss, 0); if (s2 != -1 \|\| errno != EINTR) { puts ("paccept did not fail with EINTR"); return 1; } close (s); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: make it compile] [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Roland McGrath <roland@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Uwe Kleine-König	82736f4d1d	generic irqs: handle failure of irqchip->set_type in setup_irq set_type returns an int indicating success or failure, but up to now setup_irq ignores that. In my case this resulted in a machine hang: gpio-keys requested IRQF_TRIGGER_RISING \| IRQF_TRIGGER_FALLING, but arm/ns9xxx can only trigger on one direction so set_type didn't touch the configuration which happens do default to a level sensitiveness and returned -EINVAL. setup_irq ignored that and unmasked the irq. This resulted in an endless triggering of the gpio-key interrupt service routine which effectively killed the machine. With this patch applied setup_irq propagates the error to the caller. Note that before in the case chip && !chip->set_type && !chip->name a NULL pointer was feed to printk. This is fixed, too. Signed-off-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:24 -07:00
David Howells	f0af566da6	pm: fix try_to_freeze_tasks()'s use of do_div() Fix try_to_freeze_tasks()'s use of do_div() on an s64 by making elapsed_csecs64 a u64 instead and dividing that. Possibly this should be guarded lest the interval calculation turn up negative, but the possible negativity of the result of the division is cast away anyway. This was introduced by patch `438e2ce68d`. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:24 -07:00
Zhang Rui	2f15fc4bdf	pm: schedule sysrq poweroff on boot cpu schedule sysrq poweroff on boot cpu. sysrq poweroff needs to disable nonboot cpus, and we need to run this on boot cpu to avoid any recursion. http://bugzilla.kernel.org/show_bug.cgi?id=10897 [kosaki.motohiro@jp.fujitsu.com: build fix] Signed-off-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Rus <harbour@sfinx.od.ua> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:24 -07:00
Zhang Rui	c1a220e7ac	pm: introduce new interfaces schedule_work_on() and queue_work_on() This interface allows adding a job on a specific cpu. Although a work struct on a cpu will be scheduled to other cpu if the cpu dies, there is a recursion if a work task tries to offline the cpu it's running on. we need to schedule the task to a specific cpu in this case. http://bugzilla.kernel.org/show_bug.cgi?id=10897 [oleg@tv-sign.ru: cleanups] Signed-off-by: Zhang Rui <rui.zhang@intel.com> Tested-by: Rus <harbour@sfinx.od.ua> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:23 -07:00
Akinobu Mita	0d83304c7e	pm: hibernation: simplify memory bitmap This patch simplifies the memory bitmap manipulations. - remove the member size in struct bm_block It is not necessary for struct bm_block to have the number of bit chunks that can be calculated by using end_pfn and start_pfn. - use find_next_bit() for memory_bm_next_pfn No need to invent the bitmap library only for the memory bitmap. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:23 -07:00
David Brownell	77437fd4e6	pm: boot time suspend selftest Boot-time test for system suspend states (STR or standby). The generic RTC framework triggers wakeup alarms, which are used to exit those states. - Measures some aspects of suspend time ... this uses "jiffies" until someone converts it to use a timebase that works properly even while timer IRQs are disabled. - Triggered by a command line parameter. By default nothing even vaguely troublesome will happen, but "test_suspend=mem" will give you a brief STR test during system boot. (Or you may need to use "test_suspend=standby" instead, if your hardware needs that.) This isn't without problems. It fires early enough during boot that for example both PCMCIA and MMC stacks have misbehaved. The workaround in those cases was to boot without such media cards inserted. [matthltc@us.ibm.com: fix compile failure in boot time suspend selftest] Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Machek <pavel@suse.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:22 -07:00
Pavel Machek	0d63081d41	swsusp: provide users with a hint about the no_console_suspend option Tell the user about the no_console_suspend option, so that we don't have to tell each bug reporter personally. [akpm@linux-foundation.org: clarify the text a little] Signed-off-by: Pavel Machek <pavel@suse.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:22 -07:00
Andrew G. Morgan	ab763c7112	security: filesystem capabilities refactor kernel code To date, we've tried hard to confine filesystem support for capabilities to the security modules. This has left a lot of the code in kernel/capability.c in a state where it looks like it supports something that filesystem support for capabilities actually suppresses when the LSM security/commmoncap.c code runs. What is left is a lot of code that uses sub-optimal locking in the main kernel With this change we refactor the main kernel code and make it explicit which locks are needed and that the only remaining kernel races in this area are associated with non-filesystem capability code. Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:22 -07:00

... 12 13 14 15 16 ...

5759 Commits