linux

Author	SHA1	Message	Date
Dan Carpenter	24a461d537	trace: strlen() return doesn't account for the NULL We need to add one to the strlen() return because of the NULL character. The type->name here generally comes from the kernel and I don't think any of them come close to being MAX_TRACER_SIZE (100) characters long so this is basically a cleanup. Signed-off-by: Dan Carpenter <error27@gmail.com> LKML-Reference: <20100710100644.GV19184@bicker> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-22 14:56:41 -04:00
Jason Wessel	edd63cb6b9	sysrq,kdb: Use __handle_sysrq() for kdb's sysrq function The kdb code should not toggle the sysrq state in case an end user wants to try and resume the normal kernel execution. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>	2010-07-21 19:27:07 -05:00
Jason Wessel	b0679c63db	debug_core,kdb: fix kgdb_connected bit set in the wrong place Immediately following an exit from the kdb shell the kgdb_connected variable should be set to zero, unless there are breakpoints planted. If the kgdb_connected variable is not zeroed out with kdb, it is impossible to turn off kdb. This patch is merely a work around for now, the real fix will check for the breakpoints. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-07-21 19:27:07 -05:00
Jason Wessel	9e8b624fca	Fix merge regression from external kdb to upstream kdb In the process of merging kdb to the mainline, the kdb lsmod command stopped printing the base load address of kernel modules. This is needed for using kdb in conjunction with external tools such as gdb. Simply restore the functionality by adding a kdb_printf for the base load address of the kernel modules. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-07-21 19:27:06 -05:00
Jason Wessel	fb82c0ff27	repair gdbstub to match the gdbserial protocol specification The gdbserial protocol handler should return an empty packet instead of an error string when ever it responds to a command it does not implement. The problem cases come from a debugger client sending qTBuffer, qTStatus, qSearch, qSupported. The incorrect response from the gdbstub leads the debugger clients to not function correctly. Recent versions of gdb will not detach correctly as a result of this behavior. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>	2010-07-21 19:27:05 -05:00
Martin Hicks	1396a21ba0	kdb: break out of kdb_ll() when command is terminated Without this patch the "ll" linked-list traversal command won't terminate when you hit q/Q. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-07-21 19:27:05 -05:00
Josh Hunt	eebef74695	sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug The sched_child_runs_first value in /proc/sched_debug is currently displayed using a macro meant to split ns time values. This patch uses the correct macro to display it as a plain decimal value. Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: peterz@infradead.org Cc: juhlenko@akamai.com Cc: efault@gmx.de LKML-Reference: <1279567876-25418-1-git-send-email-johunt@akamai.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-21 21:46:12 +02:00
Ingo Molnar	dca45ad8af	Merge branch 'linus' into sched/core Merge reason: Move from the -rc3 to the almost-rc6 base. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-21 21:45:08 +02:00
Ingo Molnar	23c2875725	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core	2010-07-21 21:44:18 +02:00
Ingo Molnar	9dcdbf7a33	Merge branch 'linus' into perf/core Merge reason: Pick up the latest perf fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-21 21:43:06 +02:00
KOSAKI Motohiro	ef710e100c	tracing: Shrink max latency ringbuffer if unnecessary Documentation/trace/ftrace.txt says buffer_size_kb: This sets or displays the number of kilobytes each CPU buffer can hold. The tracer buffers are the same size for each CPU. The displayed number is the size of the CPU buffer and not total size of all buffers. The trace buffers are allocated in pages (blocks of memory that the kernel uses for allocation, usually 4 KB in size). If the last page allocated has room for more bytes than requested, the rest of the page will be used, making the actual allocation bigger than requested. ( Note, the size may not be a multiple of the page size due to buffer management overhead. ) This can only be updated when the current_tracer is set to "nop". But it's incorrect. currently total memory consumption is 'buffer_size_kb x CPUs x 2'. Why two times difference is there? because ftrace implicitly allocate the buffer for max latency too. That makes sad result when admin want to use large buffer. (If admin want full logging and makes detail analysis). example, If admin have 24 CPUs machine and write 200MB to buffer_size_kb, the system consume ~10GB memory (200MB x 24 x 2). umm.. 5GB memory waste is usually unacceptable. Fortunatelly, almost all users don't use max latency feature. The max latency buffer can be disabled easily. This patch shrink buffer size of the max latency buffer if unnecessary. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> LKML-Reference: <20100701104554.DA2D.A69D9226@jp.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-21 10:20:17 -04:00
Lai Jiangshan	bc289ae98b	tracing: Reduce latency and remove percpu trace_seq __print_flags() and __print_symbolic() use percpu trace_seq: 1) Its memory is allocated at compile time, it wastes memory if we don't use tracing. 2) It is percpu data and it wastes more memory for multi-cpus system. 3) It disables preemption when it executes its core routine "trace_seq_printf(s, "%s: ", #call);" and introduces latency. So we move this trace_seq to struct trace_iterator. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4C078350.7090106@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-20 22:05:34 -04:00
Richard Kennedy	985023dee6	trace: Reorder struct ring_buffer_per_cpu to remove padding on 64bit Reorder structure to remove 8 bytes of padding on 64 bit builds. This shrinks the size to 128 bytes so allowing allocation from a smaller slab & needed one fewer cache lines. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> LKML-Reference: <1269516456.2054.8.camel@localhost> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-20 21:58:44 -04:00
Li Zefan	e870e9a124	tracing: Allow to disable cmdline recording We found that even enabling a single trace event that will rarely be triggered can add big overhead to context switch. (lmbench context switch test) ------------------------------------------------- 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ------ ------ ------ ------ ------ ------- ------- 2.19 2.3 2.21 2.56 2.13 2.54 2.07 2.39 2.51 2.35 2.75 2.27 2.81 2.24 The overhead is 6% ~ 11%. It's because when a trace event is enabled 3 tracepoints (sched_switch, sched_wakeup, sched_wakeup_new) will be activated to map pid to cmdname. We'd like to avoid this overhead, so add a trace option '(no)record-cmd' to allow to disable cmdline recording. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4C2D57F4.2050204@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-20 21:52:33 -04:00
Neil Horman	70d4bf6d46	drop_monitor: convert some kfree_skb call sites to consume_skb Convert a few calls from kfree_skb to consume_skb Noticed while I was working on dropwatch that I was detecting lots of internal skb drops in several places. While some are legitimate, several were not, freeing skbs that were at the end of their life, rather than being discarded due to an error. This patch converts those calls sites from using kfree_skb to consume_skb, which quiets the in-kernel drop_monitor code from detecting them as drops. Tested successfully by myself Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-20 13:28:05 -07:00
Tejun Heo	f2e005aaff	workqueue: fix mayday_mask handling on UP All cpumasks are assumed to have cpu 0 permanently set on UP, so it can't be used to signify whether there's something to be done for the CPU. workqueue was using cpumask to track which CPU requested rescuer assistance and this led rescuer thread to think there always are pending mayday requests on UP, which resulted in infinite busy loops. This patch fixes the problem by introducing mayday_mask_t and associated helpers which wrap cpumask on SMP and emulates its behavior using bitops and unsigned long on UP. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au>	2010-07-20 15:59:09 +02:00
Arnd Bergmann	b444786f1a	tracing: Use generic_file_llseek for debugfs The default for llseek will change to no_llseek, so the tracing debugfs files need to add explicit .llseek assignments. Since we're dealing with regular files from a VFS perspective, use generic_file_llseek. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <1278538820-1392-10-git-send-email-arnd@arndb.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-07-20 14:31:24 +02:00
Frederic Weisbecker	eb7beb5c09	tracing: Remove special traces Special traces type was only used by sysprof. Lets remove it now that sysprof ftrace plugin has been dropped. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Soeren Sandmann <sandmann@daimi.au.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2010-07-20 14:31:07 +02:00
Frederic Weisbecker	f376bf5ffb	tracing: Remove sysprof ftrace plugin The sysprof ftrace plugin doesn't seem to be seriously used somewhere. There is a branch in the sysprof tree that makes an interface to it, but the real sysprof tool uses either its own module or perf events. Drop the sysprof ftrace plugin then, as it's mostly useless. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Soeren Sandmann <sandmann@daimi.au.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2010-07-20 14:29:46 +02:00
Tejun Heo	931ac77ef6	workqueue: fix build problem on !CONFIG_SMP Commit `f3421797` (workqueue: implement unbound workqueue) incorrectly tested CONFIG_SMP as part of a C expression in alloc/free_cwqs(). As CONFIG_SMP is not defined in UP, this breaks build. Fix it by using Found during linux-next build test. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>	2010-07-20 11:15:14 +02:00
Catalin Marinas	9078370c0d	kmemleak: Add support for NO_BOOTMEM configurations With commits `08677214` and `59be5a8e`, alloc_bootmem()/free_bootmem() and friends use the early_res functions for memory management when NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the corresponding code paths for bootmem allocations. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: stable@kernel.org	2010-07-19 11:54:15 +01:00
Pavel Machek	a2531293db	update email address pavel@suse.cz no longer works, replace it with working address. Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-07-19 10:56:54 +02:00
Dan Kruchinin	5e017dc3f8	padata: Added sysfs primitives to padata subsystem Added sysfs primitives to padata subsystem. Now API user may embedded kobject each padata instance contains into any sysfs hierarchy. For now padata sysfs interface provides only two objects: serial_cpumask [RW] - cpumask for serial workers parallel_cpumask [RW] - cpumask for parallel workers Signed-off-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-19 13:50:19 +08:00
Dan Kruchinin	e15bacbebb	padata: Make two separate cpumasks The aim of this patch is to make two separate cpumasks for padata parallel and serial workers respectively. It allows user to make more thin and sophisticated configurations of padata framework. For example user may bind parallel and serial workers to non-intersecting CPU groups to gain better performance. Also each padata instance has notifiers chain for its cpumasks now. If either parallel or serial or both masks were changed all interested subsystems will get notification about that. It's especially useful if padata user uses algorithm for callback CPU selection according to serial cpumask. Signed-off-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-19 13:50:19 +08:00
Rafael J. Wysocki	ce4410116c	PM / Suspend: Fix ordering of calls in suspend error paths The ACPI suspend code calls suspend_nvs_free() at a wrong place, which may lead to a memory leak if there's an error executing acpi_pm_prepare(), because acpi_pm_finish() will not be called in that case. However, the root cause of this problem is the apparently confusing ordering of calls in suspend error paths that needs to be fixed. In addition to that, fix a typo in a label name in suspend.c. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>	2010-07-19 02:00:35 +02:00
Rafael J. Wysocki	d074ee023f	PM / Hibernate: Fix snapshot error code path There is an inconsistency between hibernation_platform_enter() and hibernation_snapshot(), because the latter calls hibernation_ops->end() after failing hibernation_ops->begin(), while the former doesn't do that. Make hibernation_snapshot() behave in the same way as hibernation_platform_enter() in that respect. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>	2010-07-19 02:00:35 +02:00
Rafael J. Wysocki	f6f71f1875	PM / Hibernate: Fix hibernation_platform_enter() The hibernation_platform_enter() function calls dpm_suspend_noirq() instead of dpm_resume_noirq() by mistake. Fix this. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>	2010-07-19 02:00:35 +02:00
James Bottomley	82f682514a	pm_qos: Get rid of the allocation in pm_qos_add_request() All current users of pm_qos_add_request() have the ability to supply the memory required by the pm_qos routines, so make them do this and eliminate the kmalloc() with pm_qos_add_request(). This has the double benefit of making the call never fail and allowing it to be called from atomic context. Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-07-19 02:00:34 +02:00
James Bottomley	5f279845f9	pm_qos: Reimplement using plists A lot of the pm_qos extremal value handling is really duplicating what a priority ordered list does, just in a less efficient fashion. Simply redoing the implementation in terms of a plist gets rid of a lot of this junk (although there are several other strange things that could do with tidying up, like pm_qos_request_list has to carry the pm_qos_class with every node, simply because it doesn't get passed in to pm_qos_update_request even though every caller knows full well what parameter it's updating). I think this redo is a win independent of android, so we should do something like this now. Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-07-19 02:00:18 +02:00
Rafael J. Wysocki	c125e96f04	PM: Make it possible to avoid races between wakeup and system sleep One of the arguments during the suspend blockers discussion was that the mainline kernel didn't contain any mechanisms making it possible to avoid races between wakeup and system suspend. Generally, there are two problems in that area. First, if a wakeup event occurs exactly when /sys/power/state is being written to, it may be delivered to user space right before the freezer kicks in, so the user space consumer of the event may not be able to process it before the system is suspended. Second, if a wakeup event occurs after user space has been frozen, it is not generally guaranteed that the ongoing transition of the system into a sleep state will be aborted. To address these issues introduce a new global sysfs attribute, /sys/power/wakeup_count, associated with a running counter of wakeup events and three helper functions, pm_stay_awake(), pm_relax(), and pm_wakeup_event(), that may be used by kernel subsystems to control the behavior of this attribute and to request the PM core to abort system transitions into a sleep state already in progress. The /sys/power/wakeup_count file may be read from or written to by user space. Reads will always succeed (unless interrupted by a signal) and return the current value of the wakeup events counter. Writes, however, will only succeed if the written number is equal to the current value of the wakeup events counter. If a write is successful, it will cause the kernel to save the current value of the wakeup events counter and to abort the subsequent system transition into a sleep state if any wakeup events are reported after the write has returned. [The assumption is that before writing to /sys/power/state user space will first read from /sys/power/wakeup_count. Next, user space consumers of wakeup events will have a chance to acknowledge or veto the upcoming system transition to a sleep state. Finally, if the transition is allowed to proceed, /sys/power/wakeup_count will be written to and if that succeeds, /sys/power/state will be written to as well. Still, if any wakeup events are reported to the PM core by kernel subsystems after that point, the transition will be aborted.] Additionally, put a wakeup events counter into struct dev_pm_info and make these per-device wakeup event counters available via sysfs, so that it's possible to check the activity of various wakeup event sources within the kernel. To illustrate how subsystems can use pm_wakeup_event(), make the low-level PCI runtime PM wakeup-handling code use it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: markgross <markgross@thegnar.org> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>	2010-07-19 01:58:48 +02:00
Cesar Eduardo Barros	9013367339	PM / Hibernate: Fix typos in comments in kernel/power/swap.c There are a few typos in kernel/power/swap.c. Fix them. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-07-19 01:58:47 +02:00
Pekka Enberg	68c38fc3cb	sched: No need for bootmem special cases As of commit `dcce284` ("mm: Extend gfp masking to the page allocator") and commit `7e85ee0` ("slab,slub: don't enable interrupts during early boot"), the slab allocator makes sure we don't attempt to sleep during boot. Therefore, remove bootmem special cases from the scheduler and use plain GFP_KERNEL instead. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1279225102-2572-1-git-send-email-penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:06:22 +02:00
Peter Zijlstra	396e894d28	sched: Revert nohz_ratelimit() for now Norbert reported that nohz_ratelimit() causes his laptop to burn about 4W (40%) extra. For now back out the change and see if we can adjust the power management code to make better decisions. Reported-by: Norbert Preining <preining@logic.at> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:05:44 +02:00
Peter Zijlstra	bbc8cb5bae	sched: Reduce update_group_power() calls Currently we update cpu_power() too often, update_group_power() only updates the local group's cpu_power but it gets called for all groups. Furthermore, CPU_NEWLY_IDLE invocations will result in all cpus calling it, even though a slow update of cpu_power is sufficient. Therefore move the update under 'idle != CPU_NEWLY_IDLE && local_group' to reduce superfluous invocations. Reported-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <1278612989.1900.176.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:05:14 +02:00
Suresh Siddha	5343bdb8fd	sched: Update rq->clock for nohz balanced cpus Suresh spotted that we don't update the rq->clock in the nohz load-balancer path. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1278626014.2834.74.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:02:08 +02:00
Jiri Slaby	c022a0acad	rlimits: implement prlimit64 syscall This patch adds the code to support the sys_prlimit64 syscall which modifies-and-returns the rlim values of a selected process atomically. The first parameter, pid, being 0 means current process. Unlike the current implementation, it is a generic interface, architecture indepentent so that we needn't handle compat stuff anymore. In the future, after glibc start to use this we can deprecate sys_setrlimit and sys_getrlimit in favor to clean up the code finally. It also adds a possibility of changing limits of other processes. We check the user's permissions to do that and if it succeeds, the new limits are propagated online. This is good for large scale applications such as SAP or databases where administrators need to change limits time by time (e.g. on crashes increase core size). And it is unacceptable to restart the service. For safety, all rlim users now either use accessors or doesn't need them due to - locking - the fact a process was just forked and nobody else knows about it yet (and nobody can't thus read/write limits) hence it is safe to modify limits now. The limitation is that we currently stay at ulong internal representation. So the rlim64_is_infinity check is used where value is compared against ULONG_MAX on 32-bit which is the maximum value there. And since internally the limits are held in struct rlimit, converters which are used before and after do_prlimit call in sys_prlimit64 are introduced. Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:48 +02:00
Jiri Slaby	b95183453a	rlimits: switch more rlimit syscalls to do_prlimit After we added more generic do_prlimit, switch sys_getrlimit to that. Also switch compat handling, so we can get rid of ugly __user casts and avoid setting process' address limit to kernel data and back. Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:48 +02:00
Jiri Slaby	5b41535aac	rlimits: redo do_setrlimit to more generic do_prlimit It now allows also reading of limits. I.e. all read and writes will later use this function. It takes two parameters, new and old limits which can be both NULL. If new is non-NULL, the value in it is set to rlimits. If old is non-NULL, current rlimits are stored there. If both are non-NULL, old are stored prior to setting the new ones, atomically. (Similar to sigaction.) Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:48 +02:00
Jiri Slaby	86f162f4c7	rlimits: do security check under task_lock Do security_task_setrlimit under task_lock. Other tasks may change limits under our hands while we are checking limits inside the function. From now on, they can't. Note that all the security work is done under a spinlock here now. Security hooks count with that, they are called from interrupt context (like security_task_kill) and with spinlocks already held (e.g. capable->security_capable). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: James Morris <jmorris@namei.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>	2010-07-16 09:48:47 +02:00
Jiri Slaby	1c1e618ddd	rlimits: allow setrlimit to non-current tasks Add locking to allow setrlimit accept task parameter other than current. Namely, lock tasklist_lock for read and check whether the task structure has sighand non-null. Do all the signal processing under that lock still held. There are some points: 1) security_task_setrlimit is now called with that lock held. This is not new, many security_* functions are called with this lock held already so it doesn't harm (all this security_* stuff does almost the same). 2) task->sighand->siglock (in update_rlimit_cpu) is nested in tasklist_lock. This dependence is already existing. 3) tsk->alloc_lock is nested in tasklist_lock. This is OK too, already existing dependence. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com>	2010-07-16 09:48:47 +02:00
Jiri Slaby	7855c35da7	rlimits: split sys_setrlimit Create do_setrlimit from sys_setrlimit and declare do_setrlimit in the resource header. This is the first phase to have generic do_prlimit which allows to be called from read, write and compat rlimits code. The new do_setrlimit also accepts a task pointer to change the limits of. Currently, it cannot be other than current, but this will change with locking later. Also pass tsk->group_leader to security_task_setrlimit to check whether current is allowed to change rlimits of the process and not its arbitrary thread because it makes more sense given that rlimit are per process and not per-thread. Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:46 +02:00
Oleg Nesterov	2fb9d2689a	rlimits: make sure ->rlim_max never grows in sys_setrlimit Mostly preparation for Jiri's changes, but probably makes sense anyway. sys_setrlimit() checks new_rlim.rlim_max <= old_rlim->rlim_max, but when it takes task_lock() old_rlim->rlim_max can be already lowered. Move this check under task_lock(). Currently this is not important, we can only race with our sub-thread, this means the application is stupid. But when we change the code to allow the update of !current task's limits, it becomes important to make sure ->rlim_max can be lowered "reliably" even if we race with the application doing sys_setrlimit(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:46 +02:00
Jiri Slaby	5ab46b345e	rlimits: add task_struct to update_rlimit_cpu Add task_struct as a parameter to update_rlimit_cpu to be able to set rlimit_cpu of different task than current. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: James Morris <jmorris@namei.org>	2010-07-16 09:48:45 +02:00
Jiri Slaby	8fd00b4d70	rlimits: security, add task_struct to setrlimit Add task_struct to task_setrlimit of security_operations to be able to set rlimit of task other than current. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Eric Paris <eparis@redhat.com> Acked-by: James Morris <jmorris@namei.org>	2010-07-16 09:48:45 +02:00
Frederic Weisbecker	5d550467b9	tracing: Remove ksym tracer The ksym (breakpoint) ftrace plugin has been superseded by perf tools that are much more poweful to use the cpu breakpoints. This tracer doesn't bring more feature. It has been deprecated for a while now, lets remove it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu>	2010-07-15 23:59:33 +02:00
Steffen Klassert	5f1a8c1bc7	padata: simplify serialization mechanism We count the number of processed objects on a percpu basis, so we need to go through all the percpu reorder queues to calculate the sequence number of the next object that needs serialization. This patch changes this to count the number of processed objects global. So we can calculate the sequence number and the percpu reorder queue of the next object that needs serialization without searching through the percpu reorder queues. This avoids some accesses to memory of foreign cpus. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:30 +08:00
Steffen Klassert	83f619f3c8	padata: make padata_do_parallel to return zero on success To return -EINPROGRESS on success in padata_do_parallel was considered to be odd. This patch changes this to return zero on success. Also the only user of padata, pcrypt is adapted to convert a return of zero to -EINPROGRESS within the crypto layer. This also removes the pcrypt fallback if padata_do_parallel was called on a not running padata instance as we can't handle it anymore. This fallback was unused, so it's save to remove it. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:29 +08:00
Steffen Klassert	33e5445068	padata: Handle empty padata cpumasks This patch fixes a bug when the padata cpumask does not intersect with the active cpumask. In this case we get a division by zero in padata_alloc_pd and we end up with a useless padata instance. Padata can end up with an empty cpumask for two reasons: 1. A user removed the last cpu that belongs to the padata cpumask and the active cpumask. 2. The last cpu that belongs to the padata cpumask and the active cpumask goes offline. We introduce a function padata_validate_cpumask to check if the padata cpumask does intersect with the active cpumask. If the cpumasks do not intersect we mark the instance as invalid, so it can't be used. We do not allocate the cpumask dependend recources in this case. This fixes the division by zero and keeps the padate instance in a consistent state. It's not possible to trigger this bug by now because the only padata user, pcrypt uses always the possible cpumask. Reported-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:29 +08:00
Steffen Klassert	ee83655512	padata: Block until the instance is unused on stop This patch makes padata_stop to block until the padata instance is unused. Also we split padata_stop to a locked and a unlocked version. This is in preparation to be able to change the cpumask after a call to patata stop. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:29 +08:00
Steffen Klassert	4c87917029	padata: Check for valid padata instance on start This patch introduces the PADATA_INVALID flag which is checked on padata start. This will be used to mark a padata instance as invalid, if the padata cpumask does not intersect with the active cpumask. we change padata_start to return an error if the PADATA_INVALID is set. Also we adapt the only padata user, pcrypt to this change. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:28 +08:00
Tejun Heo	9f9c23644b	workqueue: fix locking in retry path of maybe_create_worker() maybe_create_worker() mismanaged locking when worker creation fails and it has to retry. Fix locking and simplify lock manipulation. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Yong Zhang <yong.zhang@windriver.com>	2010-07-14 11:31:20 +02:00
Tejun Heo	083b804c4d	async: use workqueue for worker pool Replace private worker pool with system_unbound_wq. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Arjan van de Ven <arjan@infradead.org>	2010-07-14 11:29:46 +02:00
Uwe Kleine-König	698f93159a	fix comment/printk typos concerning "already" Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-07-11 21:45:40 +02:00
Arnd Bergmann	5e3d20a68f	init: Remove the BKL from startup code I have shown by code review that no driver takes the BKL at init time any more, so whatever the init code was locking against is no longer there and it is now safe to remove the BKL there. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Steven Rostedt <rostedt@goodmis> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-07-09 15:40:32 +02:00
Benjamin Herrenschmidt	5f07aa7524	Merge commit 'paulus-perf/master' into next	2010-07-09 11:25:48 +10:00
Kulikov Vasiliy	eb703f9819	kernel/watchdog: Initialize 'result' Variable on the stack is not initialized to zero, do it explicitly. This bug was found by a compiler warning: kernel/watchdog.c:463: warning: 'result' may be used uninitialized in this function Signed-off-by: Kulikov Vasiliy <segooon@gmail.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1278316854-28442-1-git-send-email-segooon@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-07 08:46:42 +02:00
Ingo Molnar	8bd0e1be25	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core	2010-07-06 09:00:32 +02:00
Masami Hiramatsu	e09c8614b3	tracing/kprobes: Support "string" type Support string type tracing and printing in kprobe-tracer. This allows user to trace string data in kernel including __user data. Note that sometimes __user data may not be accessed if it is paged-out (sorry, but kprobes operation should be done in atomic, we can not wait for page-in). Commiter note: Fixed up conflicts with `b7e2ece`. Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100519195724.2885.18788.stgit@localhost6.localdomain6> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-07-05 15:54:45 -03:00
Ingo Molnar	08f8ba0799	Merge commit 'v2.6.35-rc4' into perf/core Merge reason: Pick up the latest perf fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-05 08:30:58 +02:00
Yehuda Sadeh	ff49d74ad3	module: initialize module dynamic debug later We should initialize the module dynamic debug datastructures only after determining that the module is not loaded yet. This fixes a bug that introduced in 2.6.35-rc2, where when a trying to load a module twice, we also load it's dynamic printing data twice which causes all sorts of nasty issues. Also handle the dynamic debug cleanup later on failure. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed a #ifdef) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-07-04 20:17:22 -07:00
Linus Torvalds	123f94f22e	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Cure nr_iowait_cpu() users init: Fix comment init, sched: Fix race between init and kthreadd	2010-07-02 09:52:58 -07:00
Tejun Heo	c7fc77f78f	workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead WQ_SINGLE_CPU combined with @max_active of 1 is used to achieve full ordering among works queued to a workqueue. The same can be achieved using WQ_UNBOUND as unbound workqueues always use the gcwq for WORK_CPU_UNBOUND. As @max_active is always one and benefits from cpu locality isn't accessible anyway, serving them with unbound workqueues should be fine. Drop WQ_SINGLE_CPU support and use WQ_UNBOUND instead. Note that most single thread workqueue users will be converted to use multithread or non-reentrant instead and only the ones which require strict ordering will keep using WQ_UNBOUND + @max_active of 1. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 11:00:08 +02:00
Tejun Heo	f34217977d	workqueue: implement unbound workqueue This patch implements unbound workqueue which can be specified with WQ_UNBOUND flag on creation. An unbound workqueue has the following properties. * It uses a dedicated gcwq with a pseudo CPU number WORK_CPU_UNBOUND. This gcwq is always online and disassociated. * Workers are not bound to any CPU and not concurrency managed. Works are dispatched to workers as soon as possible and the only applied limitation is @max_active. IOW, all unbound workqeueues are implicitly high priority. Unbound workqueues can be used as simple execution context provider. Contexts unbound to any cpu are served as soon as possible. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Howells <dhowells@redhat.com>	2010-07-02 11:00:02 +02:00
Tejun Heo	bdbc5dd7de	workqueue: prepare for WQ_UNBOUND implementation In preparation of WQ_UNBOUND addition, make the following changes. * Add WORK_CPU_* constants for pseudo cpu id numbers used (currently only WORK_CPU_NONE) and use them instead of NR_CPUS. This is to allow another pseudo cpu id for unbound cpu. * Reorder WQ_* flags. * Make workqueue_struct->cpu_wq a union which contains a percpu pointer, regular pointer and an unsigned long value and use kzalloc/kfree() in UP allocation path. This will be used to implement unbound workqueues which will use only one cwq on SMPs. * Move alloc_cwqs() allocation after initialization of wq fields, so that alloc_cwqs() has access to wq->flags. * Trivial relocation of wq local variables in freeze functions. These changes don't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:59:57 +02:00
Tejun Heo	d313dd85ad	workqueue: fix worker management invocation without pending works When there's no pending work to do, worker_thread() goes back to sleep after waking up without checking whether worker management is necessary. This means that idle worker exit requests can be ignored if the gcwq stays empty. Fix it by making worker_thread() always check whether worker management is necessary before going to sleep. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:51 +02:00
Tejun Heo	a1e453d279	workqueue: fix incorrect cpu number BUG_ON() in get_work_gcwq() get_work_gcwq() was incorrectly triggering BUG_ON() if cpu number is equal to or higher than num_possible_cpus() instead of nr_cpu_ids. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:51 +02:00
Tejun Heo	4ce48b37bf	workqueue: fix race condition in flush_workqueue() When one flusher is cascading to the next flusher, it first sets wq->first_flusher to the next one and sets up the next flush cycle. If there's nothing to do for the next cycle, it clears wq->flush_flusher and proceeds to the one after that. If the woken up flusher checks wq->first_flusher before it gets cleared, it will incorrectly assume the role of the first flusher, which triggers BUG_ON() sanity check. Fix it by checking wq->first_flusher again after grabbing the mutex. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:51 +02:00
Tejun Heo	cb44476699	workqueue: use worker_set/clr_flags() only from worker itself worker_set/clr_flags() assume that if none of NOT_RUNNING flags is set the worker must be contributing to nr_running which is only true if the worker is actually running. As when called from self, it is guaranteed that the worker is running, those functions can be safely used from the worker itself and they aren't necessary from other places anyway. Make the following changes to fix the bug. * Make worker_set/clr_flags() whine if not called from self. * Convert all places which called those functions from other tasks to manipulate flags directly. * Make trustee_thread() directly clear nr_running after setting WORKER_ROGUE on all workers. This is the only place where nr_running manipulation is necessary outside of workers themselves. * While at it, add sanity check for nr_running in worker_enter_idle(). Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:50 +02:00
Peter Zijlstra	8c215bd389	sched: Cure nr_iowait_cpu() users Commit `0224cf4c5e` (sched: Intoduce get_cpu_iowait_time_us()) broke things by not making sure preemption was indeed disabled by the callers of nr_iowait_cpu() which took the iowait value of the current cpu. This resulted in a heap of preempt warnings. Cure this by making nr_iowait_cpu() take a cpu number and fix up the callers to pass in the right number. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Jiri Slaby <jslaby@suse.cz> Cc: linux-pm@lists.linux-foundation.org LKML-Reference: <1277968037.1868.120.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-01 09:39:48 +02:00
Ingo Molnar	0a54cec0c2	Merge branch 'linus' into core/rcu Conflicts: fs/fs-writeback.c Merge reason: Resolve the conflict Note, i picked the version from Linus's tree, which effectively reverts the fs-writeback.c bits of: `b97181f`: fs: remove all rcu head initializations, except on_stack initializations As the upstream changes to this file changed this code heavily and the first attempt to resolve the conflict resulted in a non-booting kernel. It's safer to re-try this portion of the commit cleanly. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-01 09:31:25 +02:00
Michal Hocko	7a0ea09ad5	futex: futex_find_get_task remove credentails check futex_find_get_task is currently used (through lookup_pi_state) from two contexts, futex_requeue and futex_lock_pi_atomic. None of the paths looks it needs the credentials check, though. Different (e)uids shouldn't matter at all because the only thing that is important for shared futex is the accessibility of the shared memory. The credentail check results in glibc assert failure or process hang (if glibc is compiled without assert support) for shared robust pthread mutex with priority inheritance if a process tries to lock already held lock owned by a process with a different euid: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 \|\| !robust' failed. The problem is that futex_lock_pi_atomic which is called when we try to lock already held lock checks the current holder (tid is stored in the futex value) to get the PI state. It uses lookup_pi_state which in turn gets task struct from futex_find_get_task. ESRCH is returned either when the task is not found or if credentials check fails. futex_lock_pi_atomic simply returns if it gets ESRCH. glibc code, however, doesn't expect that robust lock returns with ESRCH because it should get either success or owner died. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Darren Hart <dvhltc@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-30 15:43:44 -07:00
Pavan Naregundi	e05bd3367b	kexec: fix Oops in crash_shrink_memory() When crashkernel is not enabled, "echo 0 > /sys/kernel/kexec_crash_size" OOPSes the kernel in crash_shrink_memory. This happens when crash_shrink_memory tries to release the 'crashk_res' resource which are not reserved. Also value of "/sys/kernel/kexec_crash_size" shows as 1, which should be 0. This patch fixes the OOPS in crash_shrink_memory and shows "/sys/kernel/kexec_crash_size" as 0 when crash kernel memory is not reserved. Signed-off-by: Pavan Naregundi <pavan@linux.vnet.ibm.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Simon Horman <horms@verge.net.au> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-29 15:29:31 -07:00
Michael Neuling	2ec57d448b	sched: Fix spelling of sibling No logic changes, only spelling. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: linuxppc-dev@ozlabs.org Cc: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <15249.1277776921@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-29 10:44:29 +02:00
Tejun Heo	fb0e7beb5c	workqueue: implement cpu intensive workqueue This patch implements cpu intensive workqueue which can be specified with WQ_CPU_INTENSIVE flag on creation. Works queued to a cpu intensive workqueue don't participate in concurrency management. IOW, it doesn't contribute to gcwq->nr_running and thus doesn't delay excution of other works. Note that although cpu intensive works won't delay other works, they can be delayed by other works. Combine with WQ_HIGHPRI to avoid being delayed by other works too. As the name suggests this is useful when using workqueue for cpu intensive works. Workers executing cpu intensive works are not considered for workqueue concurrency management and left for the scheduler to manage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-06-29 10:07:15 +02:00
Tejun Heo	649027d73a	workqueue: implement high priority workqueue This patch implements high priority workqueue which can be specified with WQ_HIGHPRI flag on creation. A high priority workqueue has the following properties. * A work queued to it is queued at the head of the worklist of the respective gcwq after other highpri works, while normal works are always appended at the end. * As long as there are highpri works on gcwq->worklist, [__]need_more_worker() remains %true and process_one_work() wakes up another worker before it start executing a work. The above two properties guarantee that works queued to high priority workqueues are dispatched to workers and start execution as soon as possible regardless of the state of other works. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	dcd989cb73	workqueue: implement several utility APIs Implement the following utility APIs. workqueue_set_max_active() : adjust max_active of a wq workqueue_congested() : test whether a wq is contested work_cpu() : determine the last / current cpu of a work work_busy() : query whether a work is busy * Anton Blanchard fixed missing ret initialization in work_busy(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Anton Blanchard <anton@samba.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	d320c03830	workqueue: s/__create_workqueue()/alloc_workqueue()/, and add system workqueues This patch makes changes to make new workqueue features available to its users. * Now that workqueue is more featureful, there should be a public workqueue creation function which takes paramters to control them. Rename __create_workqueue() to alloc_workqueue() and make 0 max_active mean WQ_DFL_ACTIVE. In the long run, all create_workqueue_() will be converted over to alloc_workqueue(). To further unify access interface, rename keventd_wq to system_wq and export it. * Add system_long_wq and system_nrt_wq. The former is to host long running works separately (so that flush_scheduled_work() dosen't take so long) and the latter guarantees any queued work item is never executed in parallel by multiple CPUs. These will be used by future patches to update workqueue users. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	b71ab8c202	workqueue: increase max_active of keventd and kill current_is_keventd() Define WQ_MAX_ACTIVE and create keventd with max_active set to half of it which means that keventd now can process upto WQ_MAX_ACTIVE / 2 - 1 works concurrently. Unless some combination can result in dependency loop longer than max_active, deadlock won't happen and thus it's unnecessary to check whether current_is_keventd() before trying to schedule a work. Kill current_is_keventd(). (Lockdep annotations are broken. We need lock_map_acquire_read_norecurse()) Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com>	2010-06-29 10:07:14 +02:00
Tejun Heo	e22bee782b	workqueue: implement concurrency managed dynamic worker pool Instead of creating a worker for each cwq and putting it into the shared pool, manage per-cpu workers dynamically. Works aren't supposed to be cpu cycle hogs and maintaining just enough concurrency to prevent work processing from stalling due to lack of processing context is optimal. gcwq keeps the number of concurrent active workers to minimum but no less. As long as there's one or more running workers on the cpu, no new worker is scheduled so that works can be processed in batch as much as possible but when the last running worker blocks, gcwq immediately schedules new worker so that the cpu doesn't sit idle while there are works to be processed. gcwq always keeps at least single idle worker around. When a new worker is necessary and the worker is the last idle one, the worker assumes the role of "manager" and manages the worker pool - ie. creates another worker. Forward-progress is guaranteed by having dedicated rescue workers for workqueues which may be necessary while creating a new worker. When the manager is having problem creating a new worker, mayday timer activates and rescue workers are summoned to the cpu and execute works which might be necessary to create new workers. Trustee is expanded to serve the role of manager while a CPU is being taken down and stays down. As no new works are supposed to be queued on a dead cpu, it just needs to drain all the existing ones. Trustee continues to try to create new workers and summon rescuers as long as there are pending works. If the CPU is brought back up while the trustee is still trying to drain the gcwq from the previous offlining, the trustee will kill all idles ones and tell workers which are still busy to rebind to the cpu, and pass control over to gcwq which assumes the manager role as necessary. Concurrency managed worker pool reduces the number of workers drastically. Only workers which are necessary to keep the processing going are created and kept. Also, it reduces cache footprint by avoiding unnecessarily switching contexts between different workers. Please note that this patch does not increase max_active of any workqueue. All workqueues can still only process one work per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	d302f01782	workqueue: implement worker_{set\|clr}_flags() Implement worker_{set\|clr}_flags() to manipulate worker flags. These are currently simple wrappers but logics to track the current worker state and the current level of concurrency will be added. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	7e11629d0e	workqueue: use shared worklist and pool all workers per cpu Use gcwq->worklist instead of cwq->worklist and break the strict association between a cwq and its worker. All works queued on a cpu are queued on gcwq->worklist and processed by any available worker on the gcwq. As there no longer is strict association between a cwq and its worker, whether a work is executing can now only be determined by calling [__]find_worker_executing_work(). After this change, the only association between a cwq and its worker is that a cwq puts a worker into shared worker pool on creation and kills it on destruction. As all workqueues are still limited to max_active of one, this means that there are always at least as many workers as active works and thus there's no danger for deadlock. The break of strong association between cwqs and workers requires somewhat clumsy changes to current_is_keventd() and destroy_workqueue(). Dynamic worker pool management will remove both clumsy changes. current_is_keventd() won't be necessary at all as the only reason it exists is to avoid queueing a work from a work which will be allowed just fine. The clumsy part of destroy_workqueue() is added because a worker can only be destroyed while idle and there's no guarantee a worker is idle when its wq is going down. With dynamic pool management, workers are not associated with workqueues at all and only idle ones will be submitted to destroy_workqueue() so the code won't be necessary anymore. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	18aa9effad	workqueue: implement WQ_NON_REENTRANT With gcwq managing all the workers and work->data pointing to the last gcwq it was on, non-reentrance can be easily implemented by checking whether the work is still running on the previous gcwq on queueing. Implement it. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	7a22ad757e	workqueue: carry cpu number in work data once execution starts To implement non-reentrant workqueue, the last gcwq a work was executed on must be reliably obtainable as long as the work structure is valid even if the previous workqueue has been destroyed. To achieve this, work->data will be overloaded to carry the last cpu number once execution starts so that the previous gcwq can be located reliably. This means that cwq can't be obtained from work after execution starts but only gcwq. Implement set_work_{cwq\|cpu}(), get_work_[g]cwq() and clear_work_data() to set work data to the cpu number when starting execution, access the overloaded work data and clear it after cancellation. queue_delayed_work_on() is updated to preserve the last cpu while in-flight in timer and other callers which depended on getting cwq from work after execution starts are converted to depend on gcwq instead. * Anton Blanchard fixed compile error on powerpc due to missing linux/threads.h include. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Anton Blanchard <anton@samba.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	8cca0eea39	workqueue: add find_worker_executing_work() and track current_cwq Now that all the workers are tracked by gcwq, we can find which worker is executing a work from gcwq. Implement find_worker_executing_work() and make worker track its current_cwq so that we can find things the other way around. This will be used to implement non-reentrant wqs. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	502ca9d819	workqueue: make single thread workqueue shared worker pool friendly Reimplement st (single thread) workqueue so that it's friendly to shared worker pool. It was originally implemented by confining st workqueues to use cwq of a fixed cpu and always having a worker for the cpu. This implementation isn't very friendly to shared worker pool and suboptimal in that it ends up crossing cpu boundaries often. Reimplement st workqueue using dynamic single cpu binding and cwq->limit. WQ_SINGLE_THREAD is replaced with WQ_SINGLE_CPU. In a single cpu workqueue, at most single cwq is bound to the wq at any given time. Arbitration is done using atomic accesses to wq->single_cpu when queueing a work. Once bound, the binding stays till the workqueue is drained. Note that the binding is never broken while a workqueue is frozen. This is because idle cwqs may have works waiting in delayed_works queue while frozen. On thaw, the cwq is restarted if there are any delayed works or unbound otherwise. When combined with max_active limit of 1, single cpu workqueue has exactly the same execution properties as the original single thread workqueue while allowing sharing of per-cpu workers. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	db7bccf45c	workqueue: reimplement CPU hotplugging support using trustee Reimplement CPU hotplugging support using trustee thread. On CPU down, a trustee thread is created and each step of CPU down is executed by the trustee and workqueue_cpu_callback() simply drives and waits for trustee state transitions. CPU down operation no longer waits for works to be drained but trustee sticks around till all pending works have been completed. If CPU is brought back up while works are still draining, workqueue_cpu_callback() tells trustee to step down and tell workers to rebind to the cpu. As it's difficult to tell whether cwqs are empty if it's freezing or frozen, trustee doesn't consider draining to be complete while a gcwq is freezing or frozen (tracked by new GCWQ_FREEZING flag). Also, workers which get unbound from their cpu are marked with WORKER_ROGUE. Trustee based implementation doesn't bring any new feature at this point but it will be used to manage worker pool when dynamic shared worker pool is implemented. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	c8e55f3602	workqueue: implement worker states Implement worker states. After created, a worker is STARTED. While a worker isn't processing a work, it's IDLE and chained on gcwq->idle_list. While processing a work, a worker is BUSY and chained on gcwq->busy_hash. Also, gcwq now counts the number of all workers and idle ones. worker_thread() is restructured to reflect state transitions. cwq->more_work is removed and waking up a worker makes it check for events. A worker is killed by setting DIE flag while it's IDLE and waking it up. This gives gcwq better visibility of what's going on and allows it to find out whether a work is executing quickly which is necessary to have multiple workers processing the same cwq. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	8b03ae3cde	workqueue: introduce global cwq and unify cwq locks There is one gcwq (global cwq) per each cpu and all cwqs on an cpu point to it. A gcwq contains a lock to be used by all cwqs on the cpu and an ida to give IDs to workers belonging to the cpu. This patch introduces gcwq, moves worker_ida into gcwq and make all cwqs on the same cpu use the cpu's gcwq->lock instead of separate locks. gcwq->ida is now protected by gcwq->lock too. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	a0a1a5fd4f	workqueue: reimplement workqueue freeze using max_active Currently, workqueue freezing is implemented by marking the worker freezeable and calling try_to_freeze() from dispatch loop. Reimplement it using cwq->limit so that the workqueue is frozen instead of the worker. * workqueue_struct->saved_max_active is added which stores the specified max_active on initialization. * On freeze, all cwq->max_active's are quenched to zero. Freezing is complete when nr_active on all cwqs reach zero. * On thaw, all cwq->max_active's are restored to wq->saved_max_active and the worklist is repopulated. This new implementation allows having single shared pool of workers per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	1e19ffc63d	workqueue: implement per-cwq active work limit Add cwq->nr_active, cwq->max_active and cwq->delayed_work. nr_active counts the number of active works per cwq. A work is active if it's flushable (colored) and is on cwq's worklist. If nr_active reaches max_active, new works are queued on cwq->delayed_work and activated later as works on the cwq complete and decrement nr_active. cwq->max_active can be specified via the new @max_active parameter to __create_workqueue() and is set to 1 for all workqueues for now. As each cwq has only single worker now, this double queueing doesn't cause any behavior difference visible to its users. This will be used to reimplement freeze/thaw and implement shared worker pool. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	affee4b294	workqueue: reimplement work flushing using linked works A work is linked to the next one by having WORK_STRUCT_LINKED bit set and these links can be chained. When a linked work is dispatched to a worker, all linked works are dispatched to the worker's newly added ->scheduled queue and processed back-to-back. Currently, as there's only single worker per cwq, having linked works doesn't make any visible behavior difference. This change is to prepare for multiple shared workers per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	c34056a3fd	workqueue: introduce worker Separate out worker thread related information to struct worker from struct cpu_workqueue_struct and implement helper functions to deal with the new struct worker. The only change which is visible outside is that now workqueue worker are all named "kworker/CPUID:WORKERID" where WORKERID is allocated from per-cpu ida. This is in preparation of concurrency managed workqueue where shared multiple workers would be available per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:11 +02:00
Tejun Heo	73f53c4aa7	workqueue: reimplement workqueue flushing using color coded works Reimplement workqueue flushing using color coded works. wq has the current work color which is painted on the works being issued via cwqs. Flushing a workqueue is achieved by advancing the current work colors of cwqs and waiting for all the works which have any of the previous colors to drain. Currently there are 16 possible colors, one is reserved for no color and 15 colors are useable allowing 14 concurrent flushes. When color space gets full, flush attempts are batched up and processed together when color frees up, so even with many concurrent flushers, the new implementation won't build up huge queue of flushers which has to be processed one after another. Only works which are queued via __queue_work() are colored. Works which are directly put on queue using insert_work() use NO_COLOR and don't participate in workqueue flushing. Currently only works used for work-specific flush fall in this category. This new implementation leaves only cleanup_workqueue_thread() as the user of flush_cpu_workqueue(). Just make its users use flush_workqueue() and kthread_stop() directly and kill cleanup_workqueue_thread(). As workqueue flushing doesn't use barrier request anymore, the comment describing the complex synchronization around it in cleanup_workqueue_thread() is removed together with the function. This new implementation is to allow having and sharing multiple workers per cpu. Please note that one more bit is reserved for a future work flag by this patch. This is to avoid shifting bits and updating comments later. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:11 +02:00
Tejun Heo	0f900049cb	workqueue: update cwq alignement work->data field is used for two purposes. It points to cwq it's queued on and the lower bits are used for flags. Currently, two bits are reserved which is always safe as 4 byte alignment is guaranteed on every architecture. However, future changes will need more flag bits. On SMP, the percpu allocator is capable of honoring larger alignment (there are other users which depend on it) and larger alignment works just fine. On UP, percpu allocator is a thin wrapper around kzalloc/kfree() and don't honor alignment request. This patch introduces WORK_STRUCT_FLAG_BITS and implements alloc/free_cwqs() which guarantees max(1 << WORK_STRUCT_FLAG_BITS, __alignof__(unsigned long long) alignment both on SMP and UP. On SMP, simply wrapping percpu allocator is enough. On UP, extra space is allocated so that cwq can be aligned and the original pointer can be stored after it which is used in the free path. * Alignment problem on UP is reported by Michal Simek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Reported-by: Michal Simek <michal.simek@petalogix.com>	2010-06-29 10:07:11 +02:00
Tejun Heo	1537663f57	workqueue: kill cpu_populated_map Worker management is about to be overhauled. Simplify things by removing cpu_populated_map, creating workers for all possible cpus and making single threaded workqueues behave more like multi threaded ones. After this patch, all cwqs are always initialized, all workqueues are linked on the workqueues list and workers for all possibles cpus always exist. This also makes CPU hotplug support simpler - checking ->cpus_allowed before processing works in worker_thread() and flushing cwqs on CPU_POST_DEAD are enough. While at it, make get_cwq() always return the cwq for the specified cpu, add target_cwq() for cases where single thread distinction is necessary and drop all direct usage of per_cpu_ptr() on wq->cpu_wq. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:11 +02:00
Tejun Heo	6416669975	workqueue: temporarily remove workqueue tracing Strip tracing code from workqueue and remove workqueue tracing. This is temporary measure till concurrency managed workqueue is complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>	2010-06-29 10:07:11 +02:00
Tejun Heo	a62428c0ae	workqueue: separate out process_one_work() Separate out process_one_work() out of run_workqueue(). This patch doesn't cause any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:10 +02:00
Tejun Heo	22df02bb3f	workqueue: define masks for work flags and conditionalize STATIC flags Work flags are about to see more traditional mask handling. Define WORK_STRUCT__BIT as the bit position constant and redefine WORK_STRUCT_ as bit masks. Also, make WORK_STRUCT_STATIC_* flags conditional While at it, re-define these constants as enums and use WORK_STRUCT_STATIC instead of hard-coding 2 in WORK_DATA_STATIC_INIT(). Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:10 +02:00
Tejun Heo	97e37d7b9e	workqueue: merge feature parameters into flags Currently, __create_workqueue_key() takes @singlethread and @freezeable paramters and store them separately in workqueue_struct. Merge them into a single flags parameter and field and use WQ_FREEZEABLE and WQ_SINGLE_THREAD. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:10 +02:00
Tejun Heo	4690c4ab56	workqueue: misc/cosmetic updates Make the following updates in preparation of concurrency managed workqueue. None of these changes causes any visible behavior difference. * Add comments and adjust indentations to data structures and several functions. * Rename wq_per_cpu() to get_cwq() and swap the position of two parameters for consistency. Convert a direct per_cpu_ptr() access to wq->cpu_wq to get_cwq(). * Add work_static() and Update set_wq_data() such that it sets the flags part to WORK_STRUCT_PENDING \| WORK_STRUCT_STATIC if static \| @extra_flags. * Move santiy check on work->entry emptiness from queue_work_on() to __queue_work() which all queueing paths share. * Make __queue_work() take @cpu and @wq instead of @cwq. * Restructure flush_work() and __create_workqueue_key() to make them easier to modify. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:10 +02:00
Tejun Heo	c790bce048	workqueue: kill RT workqueue With stop_machine() converted to use cpu_stop, RT workqueue doesn't have any user left. Kill RT workqueue support. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:09 +02:00
Tejun Heo	82805ab77d	kthread: implement kthread_data() Implement kthread_data() which takes @task pointing to a kthread and returns @data specified when creating the kthread. The caller is responsible for ensuring the validity of @task when calling this function. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:09 +02:00
Tejun Heo	b56c0d8937	kthread: implement kthread_worker Implement simple work processor for kthread. This is to ease using kthread. Single thread workqueue used to be used for things like this but workqueue won't guarantee fixed kthread association anymore to enable worker sharing. This can be used in cases where specific kthread association is necessary, for example, when it should have RT priority or be assigned to certain cgroup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-06-29 10:07:09 +02:00
Steven Rostedt	a1d0ce8213	tracing: Use class->reg() for all registering of events Because kprobes and syscalls need special processing to register events, the class->reg() method was created to handle the differences. But instead of creating a default ->reg for perf and ftrace events, the code was scattered with: if (class->reg) class->reg(); else default_reg(); This is messy and can also lead to bugs. This patch cleans up this code and creates a default reg() entry for the events allowing for the code to directly call the class->reg() without the condition. Reported-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 21:13:14 -04:00
Chase Douglas	d62f85d1e2	tracing/function-graph: Use correct string size for snprintf The nsecs_str string is a local variable defined as: char nsecs_str[5]; It is possible for the snprintf call to use a size value larger than the size of the string. This should not cause a buffer overrun as it is written now due to the value for the string format "%03lu" can not be larger than 1000. However, this change makes it correct. By making the size correct we guard against potential future changes that could actually cause a buffer overrun. Signed-off-by: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1276619355-18116-1-git-send-email-chase.douglas@canonical.com> [ added 'UL' to number 8 to fix gcc warning comparing it to sizeof() ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 21:11:39 -04:00
Li Zefan	67ead0a6ce	tracing: Remove open-coded __trace_add_event_call() Let trace_module_add_events() and event_trace_init() call __trace_add_event_call(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA37E9.1020106@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 17:12:55 -04:00
Li Zefan	ffb9f99528	tracing: Remove redundant raw_init callbacks raw_init callback is optional. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA37D4.7070500@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 17:12:53 -04:00
Li Zefan	c9d932cf8a	tracing: Remove test of NULL define_fields callback Every event (or event class) has it's define_fields callback, so the test is redundant. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA37BC.8080707@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 17:12:52 -04:00
Li Zefan	8728fe501e	tracing: Don't allocate common fields for every trace events Every event has the same common fields, so it's a big waste of memory to have a copy of those fields for every event. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA3759.30105@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 17:12:46 -04:00
Li Zefan	c9642c49aa	tracing: Use a global field list for all syscall exit events All syscall exit events have the same fields. The kernel size drops 2.5K: text data bss dec hex filename 7018612 2034376 7251132 16304120 f8c7f8 vmlinux.o.orig 7018612 2031888 7251132 16301632 f8be40 vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BFA3746.8070100@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-28 17:12:44 -04:00
Thomas Gleixner	f384c954c9	Merge branch 'linus' into perf/core Reason: Further changes conflict with upstream fixes Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-06-28 22:33:24 +02:00
Linus Torvalds	5904b3b81d	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix undeclared ENOSYS in include/linux/tracepoint.h perf record: prevent kill(0, SIGTERM); perf session: Remove threads from tree on PERF_RECORD_EXIT perf/tracing: Fix regression of perf losing kprobe events perf_events: Fix Intel Westmere event constraints perf record: Don't call newt functions when not initialized	2010-06-28 12:24:43 -07:00
Linus Torvalds	f3866db8f7	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Deal with desc->set_type() changing desc->chip	2010-06-28 12:23:12 -07:00
Linus Torvalds	f014d937d6	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Prevent compiler from optimising the sched_avg_update() loop sched: Fix over-scheduling bug sched: Fix PROVE_RCU vs cpu_cgroup	2010-06-28 12:18:30 -07:00
Linus Torvalds	cf91b415c8	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: nohz: Fix nohz ratelimit	2010-06-28 12:18:02 -07:00
Linus Torvalds	e6cb6281ef	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: silence PROVE_RCU in sched_fork() idr: fix RCU lockdep splat in idr_get_next() rcu: apply RCU protection to wake_affine()	2010-06-28 12:17:40 -07:00
Will Deacon	0d98bb2656	sched: Prevent compiler from optimising the sched_avg_update() loop GCC 4.4.1 on ARM has been observed to replace the while loop in sched_avg_update with a call to uldivmod, resulting in the following build failure at link-time: kernel/built-in.o: In function `sched_avg_update': kernel/sched.c:1261: undefined reference to `__aeabi_uldivmod' kernel/sched.c:1261: undefined reference to `__aeabi_uldivmod' make: *** [.tmp_vmlinux1] Error 1 This patch introduces a fake data hazard to the loop body to prevent the compiler optimising the loop away. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-25 16:11:50 +02:00
Frederic Weisbecker	45a73372ef	hw_breakpoints: Fix per task breakpoint tracking Freeing a perf event can happen in several ways. A task calls perf_event_exit_task() right before exiting. This helper will detach all the events from the task context and queue their removal through free_event() if they are child tasks. The task also loses its context reference there. Releasing the breakpoint slot from the constraint table is made from free_event() that calls release_bp_slot(). We count the number of breakpoints this task is running by looking at the task's perf_event_ctxp and iterating through its attached events. But at this time, the reference to this context has been cleaned up already. So looking at the event->ctx instead of task->perf_event_ctxp to count the remaining breakpoints should solve the problem. At least it would for child breakpoints, but not for parent ones. If the parent exits before the child, it will remove all its events from the context but free_event() will be called later, on fd release time. And checking the number of breakpoints the task has attached to its context at this time is unreliable as all events have been removed from the context. To solve this, we keep track of the list of per task breakpoints. On top of it, we maintain our array of numbers of breakpoints used by the tasks. We use the context address as a task id. So, instead of looking at the number of events attached to a context, we walk through our list of per task breakpoints and count the number of breakpoints that use the same ctx than the one to be reserved or released from the constraint table, and update the count on top of this result. In the meantime it solves a bad refcounting, it also solves a warning, reported by Paul. Badness at /home/paulus/kernel/perf/kernel/hw_breakpoint.c:114 NIP: c0000000000cb470 LR: c0000000000cb46c CTR: c00000000032d9b8 REGS: c000000118e7b570 TRAP: 0700 Not tainted (2.6.35-rc3-perf-00008-g76b0f13 ) MSR: 9000000000029032 <EE,ME,CE,IR,DR> CR: 44004424 XER: 000fffff TASK = c0000001187dcad0[3143] 'perf' THREAD: c000000118e78000 CPU: 1 GPR00: c0000000000cb46c c000000118e7b7f0 c0000000009866a0 0000000000000020 GPR04: 0000000000000000 000000000000001d 0000000000000000 0000000000000001 GPR08: c0000000009bed68 c00000000086dff8 c000000000a5bf10 0000000000000001 GPR12: 0000000024004422 c00000000ffff200 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000018 00000000101150f4 GPR20: 0000000010206b40 0000000000000000 0000000000000000 00000000101150f4 GPR24: c0000001199090c0 0000000000000001 0000000000000000 0000000000000001 GPR28: 0000000000000000 0000000000000000 c0000000008ec290 0000000000000000 NIP [c0000000000cb470] .task_bp_pinned+0x5c/0x12c LR [c0000000000cb46c] .task_bp_pinned+0x58/0x12c Call Trace: [c000000118e7b7f0] [c0000000000cb46c] .task_bp_pinned+0x58/0x12c (unreliable) [c000000118e7b8a0] [c0000000000cb584] .toggle_bp_task_slot+0x44/0xe4 [c000000118e7b940] [c0000000000cb6c8] .toggle_bp_slot+0xa4/0x164 [c000000118e7b9f0] [c0000000000cbafc] .release_bp_slot+0x44/0x6c [c000000118e7ba80] [c0000000000c4178] .bp_perf_event_destroy+0x10/0x24 [c000000118e7bb00] [c0000000000c4aec] .free_event+0x180/0x1bc [c000000118e7bbc0] [c0000000000c54c4] .perf_event_release_kernel+0x14c/0x170 Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Jason Wessel <jason.wessel@windriver.com>	2010-06-24 23:33:40 +02:00
Peter Zijlstra	8695159967	sched: silence PROVE_RCU in sched_fork() Because cgroup_fork() is ran before sched_fork() [ from copy_process() ] and the child's pid is not yet visible the child is pinned to its cgroup. Therefore we can silence this warning. A nicer solution would be moving cgroup_fork() to right after dup_task_struct() and exclude PF_STARTING from task_subsys_state(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-06-23 15:14:09 -07:00
Daniel J Blueman	f3b577dec1	rcu: apply RCU protection to wake_affine() The task_group() function returns a pointer that must be protected by either RCU, the ->alloc_lock, or the cgroup lock (see the rcu_dereference_check() in task_subsys_state(), which is invoked by task_group()). The wake_affine() function currently does none of these, which means that a concurrent update would be within its rights to free the structure returned by task_group(). Because wake_affine() uses this structure only to compute load-balancing heuristics, there is no reason to acquire either of the two locks. Therefore, this commit introduces an RCU read-side critical section that starts before the first call to task_group() and ends after the last use of the "tg" pointer returned from task_group(). Thanks to Li Zefan for pointing out the need to extend the RCU read-side critical section from that proposed by the original patch. Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-06-23 06:50:44 -07:00
K.Prasad	f7136c5150	hw_breakpoints: Allow arch-specific cleanup before breakpoint unregistration Certain architectures (such as PowerPC) have a need to clean up data structures before a breakpoint is unregistered. This introduces an arch-specific hook in release_bp_slot() along with a weak definition in the form of a stub function. Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul Mackerras <paulus@samba.org>	2010-06-22 19:40:50 +10:00
Tejun Heo	0b2e918aa9	sched, cpuset: Drop __cpuexit from cpu hotplug callbacks Commit `3a101d05` (sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining) added hotplug notifiers marked with __cpuexit; however, ia64 drops text in __cpuexit during link unlike x86. This means that functions which are referenced during init but used only for cpu hot unplugging afterwards shouldn't be marked with __cpuexit. Drop __cpuexit from those functions. Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <4C1FDF5B.1040301@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-22 08:07:39 +02:00
Ingo Molnar	646b1db495	Merge commit 'v2.6.35-rc3' into perf/core Merge reason: Go from -rc1 base to -rc3 base, merge in fixes.	2010-06-18 10:53:19 +02:00
Oleg Nesterov	8d1f431cbe	sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check() fastpath_timer_check()->thread_group_cputimer() is racy and unneeded. It is racy because another thread can clear ->running before thread_group_cputimer() takes cputimer->lock. In this case thread_group_cputimer() will set ->running = true again and call thread_group_cputime(). But since we do not hold tasklist or siglock, we can race with fork/exit and copy the wrong results into cputimer->cputime. It is unneeded because if ->running == true we can just use the numbers in cputimer->cputime we already have. Change fastpath_timer_check() to copy cputimer->cputime into the local variable under cputimer->lock. We do not re-check ->running under cputimer->lock, run_posix_cpu_timers() does this check later. Note: we can add more optimizations on top of this change. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100611180446.GA13025@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-18 10:46:57 +02:00
Oleg Nesterov	0bdd2ed413	sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand() run_posix_cpu_timers() doesn't work if current has already passed exit_notify(). This was needed to prevent the races with do_wait(). Since `ea6d290c` ->signal is always valid and can't go away. We can remove the "tsk->exit_state == 0" in fastpath_timer_check() and convert run_posix_cpu_timers() to use lock_task_sighand(). Note: it makes sense to take group_leader's sighand instead, the sub-thread still uses CPU after release_task(). But we need more changes to do this. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610231018.GA25942@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-18 10:46:57 +02:00
Oleg Nesterov	bfac700918	sched: thread_group_cputime: Simplify, document the "alive" check thread_group_cputime() looks as if it is rcu-safe, but in fact this was wrong until `ea6d290c` which pins task->signal to task_struct. It checks ->sighand != NULL under rcu, but this can't help if ->signal can go away. Fortunately the caller either holds ->siglock, or it is fastpath_timer_check() which uses current and checks exit_state == 0. - Since `ea6d290c` commit tsk->signal is stable, we can read it first and avoid the initialization from INIT_CPUTIME. - Even if tsk->signal is always valid, we still have to check it is safe to use next_thread() under rcu_read_lock(). Currently the code checks ->sighand != NULL, change it to use pid_alive() which is commonly used to ensure the task wasn't unhashed before we take rcu_read_lock(). Add the comment to explain this check. - Change the main loop to use the while_each_thread() helper. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610230956.GA25921@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-18 10:46:56 +02:00
Oleg Nesterov	48286d5088	sched: Remove the obsolete exit_state/signal hacks account_group_xxx() functions check ->exit_state to ensure that current->signal is valid and can't go away. This is not needed since `ea6d290c`, task->signal is pinned to task_struct. The comment and another hack in account_group_exec_runtime() refers to task_rq_unlock_wait() which was already removed by `b7b8ff63`. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610230952.GA25914@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-18 10:46:56 +02:00
Oleg Nesterov	c32b4fce79	sched: task_tick_rt: Remove the obsolete ->signal != NULL check Remove the obsolete ->signal != NULL check in watchdog(). Since `ea6d290c` ->signal can't be NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610230948.GA25911@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-18 10:46:56 +02:00
Oleg Nesterov	a44702e885	sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless __sched_setscheduler() takes lock_task_sighand() to access task->signal. This is not needed since `ea6d290c`, ->signal can't go away. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100610230944.GA25903@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-18 10:46:55 +02:00
Michael Neuling	b6b1229440	sched: Fix comments to make them DocBook happy Docbook fails in sched_fair.c due to comments added in the asymmetric packing patch series. This fixes these errors. No code changes. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <24737.1276135581@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-18 10:46:55 +02:00
Michael Neuling	694f5a1112	sched: Fix fix_small_capacity The CPU power test is the wrong way around in fix_small_capacity. This was due to a small changes made in the posted patch on lkml to what was was taken upstream. This patch fixes asymmetric packing for POWER7. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <12629.1276124617@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-18 10:46:54 +02:00
Ingo Molnar	4cb6948e53	Merge commit 'v2.6.35-rc3' into sched/core Merge reason: Update to the latest -rc.	2010-06-18 10:46:35 +02:00
Alex,Shi	3c93717cfa	sched: Fix over-scheduling bug Commit `e70971591` ("sched: Optimize unused cgroup configuration") introduced an imbalanced scheduling bug. If we do not use CGROUP, function update_h_load won't update h_load. When the system has a large number of tasks far more than logical CPU number, the incorrect cfs_rq[cpu]->h_load value will cause load_balance() to pull too many tasks to the local CPU from the busiest CPU. So the busiest CPU keeps going in a round robin. That will hurt performance. The issue was found originally by a scientific calculation workload that developed by Yanmin. With that commit, the workload performance drops about 40%. CPU before after 00 : 2 : 7 01 : 1 : 7 02 : 11 : 6 03 : 12 : 7 04 : 6 : 6 05 : 11 : 7 06 : 10 : 6 07 : 12 : 7 08 : 11 : 6 09 : 12 : 6 10 : 1 : 6 11 : 1 : 6 12 : 6 : 6 13 : 2 : 6 14 : 2 : 6 15 : 1 : 6 Reviewed-by: Yanmin zhang <yanmin.zhang@intel.com> Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1276754893.9452.5442.camel@debian> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-18 10:45:25 +02:00
Peter Zijlstra	3310d4d38f	nohz: Fix nohz ratelimit Chris Wedgwood reports that `39c0cbe` (sched: Rate-limit nohz) causes a serial console regression, unresponsiveness, and indeed it does. The reason is that the nohz code is skipped even when the tick was already stopped before the nohz_ratelimit(cpu) condition changed. Move the nohz_ratelimit() check to the other conditions which prevent long idle sleeps. Reported-by: Chris Wedgwood <cw@f00f.org> Tested-by: Brian Bloniarz <bmb@athenacr.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Greg KH <gregkh@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Jef Driesen <jefdriesen@telenet.be> LKML-Reference: <1276790557.27822.516.camel@twins> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-06-17 19:37:29 +02:00
Eric W. Biederman	5c1469de75	user_ns: Introduce user_nsmap_uid and user_ns_map_gid. Define what happens when a we view a uid from one user_namespace in another user_namepece. - If the user namespaces are the same no mapping is necessary. - For most cases of difference use overflowuid and overflowgid, the uid and gid currently used for 16bit apis when we have a 32bit uid that does fit in 16bits. Effectively the situation is the same, we want to return a uid or gid that is not assigned to any user. - For the case when we happen to be mapping the uid or gid of the creator of the target user namespace use uid 0 and gid as confusing that user with root is not a problem. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-16 14:55:34 -07:00
Jiri Kosina	f1bbbb6912	Merge branch 'master' into for-next	2010-06-16 18:08:13 +02:00
Uwe Kleine-König	732bee7af3	fix typos concerning "hierarchy" Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-06-16 18:03:14 +02:00
Paul E. McKenney	a25909a4d4	lockdep: Add an in_workqueue_context() lockdep-based test function Some recent uses of RCU make use of workqueues. In these uses, execution within the context of a specific workqueue takes the place of the usual RCU read-side primitives such as rcu_read_lock(), and flushing of workqueues takes the place of the usual RCU grace-period primitives. Checking for correct use of rcu_dereference() in such cases requires a test of whether the code is executing in the context of a particular workqueue. This commit adds an in_workqueue_context() function that provides this test. This new function is only defined when lockdep is enabled, which allows it to be used as the second argument of rcu_dereference_check(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-06-14 16:37:26 -07:00
Mathieu Desnoyers	551d55a944	tree/tiny rcu: Add debug RCU head objects Helps finding racy users of call_rcu(), which results in hangs because list entries are overwritten and/or skipped. Changelog since v4: - Bissectability is now OK - Now generate a WARN_ON_ONCE() for non-initialized rcu_head passed to call_rcu(). Statically initialized objects are detected with object_is_static(). - Rename rcu_head_init_on_stack to init_rcu_head_on_stack. - Remove init_rcu_head() completely. Changelog since v3: - Include comments from Lai Jiangshan This new patch version is based on the debugobjects with the newly introduced "active state" tracker. Non-initialized entries are all considered as "statically initialized". An activation fixup (triggered by call_rcu()) takes care of performing the debug object initialization without issuing any warning. Since we cannot increase the size of struct rcu_head, I don't see much room to put an identifier for statically initialized rcu_head structures. So for now, we have to live without "activation without explicit init" detection. But the main purpose of this debug option is to detect double-activations (double call_rcu() use of a rcu_head before the callback is executed), which is correctly addressed here. This also detects potential internal RCU callback corruption, which would cause the callbacks to be executed twice. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: David S. Miller <davem@davemloft.net> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> CC: akpm@linux-foundation.org CC: mingo@elte.hu CC: laijs@cn.fujitsu.com CC: dipankar@in.ibm.com CC: josh@joshtriplett.org CC: dvhltc@us.ibm.com CC: niv@us.ibm.com CC: tglx@linutronix.de CC: peterz@infradead.org CC: rostedt@goodmis.org CC: Valdis.Kletnieks@vt.edu CC: dhowells@redhat.com CC: eric.dumazet@gmail.com CC: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>	2010-06-14 16:37:26 -07:00
Tejun Heo	53c5f5ba42	Merge branch 'sched-wq' of ../wq into cmwq-base	2010-06-13 18:19:48 +02:00
Len Brown	42de5532f4	Merge branch 'bugzilla-13931-sleep-nvs' into release Conflicts: drivers/acpi/sleep.c Signed-off-by: Len Brown <len.brown@intel.com>	2010-06-12 01:15:40 -04:00
Steven Rostedt	a8fb260805	perf/tracing: Fix regression of perf losing kprobe events With the addition of the code to shrink the kernel tracepoint infrastructure, we lost kprobes being traced by perf. The reason is that I tested if the "tp_event->class->perf_probe" existed before enabling it. This prevents "ftrace only" events (like the function trace events) from being enabled by perf. Unfortunately, kprobe events do not use perf_probe. This causes kprobes to be missed by perf. To fix this, we add the test to see if "tp_event->class->reg" exists as well as perf_probe. Normal trace events have only "perf_probe" but no "reg" function, and kprobes and syscalls have the "reg" but no "perf_probe". The ftrace unique events do not have either, so this is a valid test. If a kprobe or syscall is not to be probed by perf, the "reg" function is called anyway, and will return a failure and prevent perf from probing it. Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-10 20:56:54 -04:00
Linus Torvalds	85ca7886f5	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix null pointer deref with SEND_SIG_FORCED perf: Fix signed comparison in perf_adjust_period() powerpc/oprofile: fix potential buffer overrun in op_model_cell.c perf symbols: Set the DSO long name when using symbol_conf.vmlinux_name	2010-06-10 09:30:09 -07:00
Matthew Garrett	dd4c4f17d7	suspend: Move NVS save/restore code to generic suspend functionality Saving platform non-volatile state may be required for suspend to RAM as well as hibernation. Move it to more generic code. Signed-off-by: Matthew Garrett <mjg@redhat.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Tested-by: Maxim Levitsky <maximlevitsky@gmail.com> Signed-off-by: Len Brown <len.brown@intel.com>	2010-06-10 11:02:34 -04:00
Ingo Molnar	c726b61c6a	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core	2010-06-09 18:55:57 +02:00
Li Zefan	039ca4e74a	tracing: Remove kmemtrace ftrace plugin We have been resisting new ftrace plugins and removing existing ones, and kmemtrace has been superseded by kmem trace events and perf-kmem, so we remove it. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> [ remove kmemtrace from the makefile, handle slob too ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-06-09 17:31:22 +02:00
Thomas Gleixner	4673247562	genirq: Deal with desc->set_type() changing desc->chip The set_type() function can change the chip implementation when the trigger mode changes. That might result in using an non-initialized irq chip when called from __setup_irq() or when called via set_irq_type() on an already enabled irq. The set_irq_type() function should not be called on an enabled irq, but because we forgot to put a check into it, we have a bunch of users which grew the habit of doing that and it never blew up as the function is serialized via desc->lock against all users of desc->chip and they never hit the non-initialized irq chip issue. The easy fix for the __setup_irq() issue would be to move the irq_chip_set_defaults(desc->chip) call after the trigger setting to make sure that a chip change is covered. But as we have already users, which do the type setting after request_irq(), the safe fix for now is to call irq_chip_set_defaults() from __irq_set_trigger() when desc->set_type() changed the irq chip. It needs a deeper analysis whether we should refuse to change the chip on an already enabled irq, but that'd be a large scale change to fix all the existing users. So that's neither stable nor 2.6.35 material. Reported-by: Esben Haabendal <eha@doredevelopment.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: linuxppc-dev <linuxppc-dev@ozlabs.org> Cc: stable@kernel.org	2010-06-09 17:05:08 +02:00
Peter Zijlstra	e78505958c	perf: Convert perf_event to local_t Since now all modification to event->count (and ->prev_count and ->period_left) are local to a cpu, change then to local64_t so we avoid the LOCK'ed ops. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 11:12:37 +02:00
Peter Zijlstra	a6e6dea68c	perf: Add perf_event::child_count Only child counters adding back their values into the parent counter are responsible for cross-cpu updates to event->count. So if we pull that out into a new child_count variable, we get an event->count that is only modified locally. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 11:12:37 +02:00
Peter Zijlstra	b5e58793c7	perf: Add perf_event_count() Create a helper function for those sites that want to read the event count. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 11:12:36 +02:00
Peter Zijlstra	d57e34fdd6	perf: Simplify the ring-buffer logic: make perf_buffer_alloc() do everything needed Currently there are perf_buffer_alloc() + perf_buffer_init() + some separate bits, fold it all into a single perf_buffer_alloc() and only leave the attachment to the event separate. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 11:12:35 +02:00
Peter Zijlstra	ca5135e6b4	perf: Rename perf_mmap_data to perf_buffer Rename to clarify code. s/perf_mmap_data/perf_buffer/g and selective s/data/buffer/g Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 11:12:35 +02:00
Peter Zijlstra	8d2cacbbb8	perf: Cleanup {start,commit,cancel}_txn details Clarify some of the transactional group scheduling API details and change it so that a successfull ->commit_txn also closes the transaction. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1274803086.5882.1752.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 11:12:34 +02:00
Eric B Munson	3af9e85928	perf: Add non-exec mmap() tracking Add the capacility to track data mmap()s. This can be used together with PERF_SAMPLE_ADDR for data profiling. Signed-off-by: Anton Blanchard <anton@samba.org> [Updated code for stable perf ABI] Signed-off-by: Eric B Munson <ebmunson@us.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1274193049-25997-1-git-send-email-ebmunson@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 11:12:34 +02:00
Peter Zijlstra	8ed92280be	perf, trace: Remove superfluous rcu_read_lock() __DO_TRACE() already calls the callbacks under rcu_read_lock_sched(), which is sufficient for our needs, avoid doing it again. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 11:12:33 +02:00
Peter Zijlstra	ecc55f84b2	perf, trace: Inline perf_swevent_put_recursion_context() Inline perf_swevent_put_recursion_context into perf_tp_event(), this shrinks the per trace template code footprint and saves a function call. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 11:12:33 +02:00
Michael Neuling	532cb4c401	sched: Add asymmetric group packing option for sibling domain Check to see if the group is packed in a sched doman. This is primarily intended to used at the sibling level. Some cores like POWER7 prefer to use lower numbered SMT threads. In the case of POWER7, it can move to lower SMT modes only when higher threads are idle. When in lower SMT modes, the threads will perform better since they share less core resources. Hence when we have idle threads, we want them to be the higher ones. This adds a hook into f_b_g() called check_asym_packing() to check the packing. This packing function is run on idle threads. It checks to see if the busiest CPU in this domain (core in the P7 case) has a higher CPU number than what where the packing function is being run on. If it is, calculate the imbalance and return the higher busier thread as the busiest group to f_b_g(). Here we are assuming a lower CPU number will be equivalent to a lower SMT thread number. It also creates a new SD_ASYM_PACKING flag to enable this feature at any scheduler domain level. It also creates an arch hook to enable this feature at the sibling level. The default function doesn't enable this feature. Based heavily on patch from Peter Zijlstra. Fixes from Srivatsa Vaddagiri. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20100608045702.2936CCC897@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 10:34:55 +02:00
Srivatsa Vaddagiri	9d5efe05eb	sched: Fix capacity calculations for SMT4 Handle cpu capacity being reported as 0 on cores with more number of hardware threads. For example on a Power7 core with 4 hardware threads, core power is 1177 and thus power of each hardware thread is 1177/4 = 294. This low power can lead to capacity for each hardware thread being calculated as 0, which leads to tasks bouncing within the core madly! Fix this by reporting capacity for hardware threads as 1, provided their power is not scaled down significantly because of frequency scaling or real-time tasks usage of cpu. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20100608045702.21D03CC895@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 10:34:54 +02:00
Venkatesh Pallipadi	83cd4fe27a	sched: Change nohz idle load balancing logic to push model In the new push model, all idle CPUs indeed go into nohz mode. There is still the concept of idle load balancer (performing the load balancing on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz balancer when any of the nohz CPUs need idle load balancing. The kickee CPU does the idle load balancing on behalf of all idle CPUs instead of the normal idle balance. This addresses the below two problems with the current nohz ilb logic: * the idle load balancer continued to have periodic ticks during idle and wokeup frequently, even though it did not have any rebalancing to do on behalf of any of the idle CPUs. * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this periodic wakeup can result in a periodic additional interrupt on a CPU doing the timer broadcast. Also currently we are migrating the unpinned timers from an idle to the cpu doing idle load balancing (when all the cpus in the system are idle, there is no idle load balancing cpu and timers get added to the same idle cpu where the request was made. So the existing optimization works only on semi idle system). And In semi idle system, we no longer have periodic ticks on the idle load balancer CPU. Using that cpu will add more delays to the timers than intended (as that cpu's timer base may not be uptodate wrt jiffies etc). This was causing mysterious slowdowns during boot etc. For now, in the semi idle case, use the nearest busy cpu for migrating timers from an idle cpu. This is good for power-savings anyway. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 10:34:52 +02:00
Venkatesh Pallipadi	fdf3e95d39	sched: Avoid side-effect of tickless idle on update_cpu_load tickless idle has a negative side effect on update_cpu_load(), which in turn can affect load balancing behavior. update_cpu_load() is supposed to be called every tick, to keep track of various load indicies. With tickless idle, there are no scheduler ticks called on the idle CPUs. Idle CPUs may still do load balancing (with idle_load_balance CPU) using the stale cpu_load. It will also cause problems when all CPUs go idle for a while and become active again. In this case loads would not degrade as expected. This is how rq->nr_load_updates change looks like under different conditions: <cpu_num> <nr_load_updates change> All CPUS idle for 10 seconds (HZ=1000) 0 1621 10 496 11 139 12 875 13 1672 14 12 15 21 1 1472 2 2426 3 1161 4 2108 5 1525 6 701 7 249 8 766 9 1967 One CPU busy rest idle for 10 seconds 0 10003 10 601 11 95 12 966 13 1597 14 114 15 98 1 3457 2 93 3 6679 4 1425 5 1479 6 595 7 193 8 633 9 1687 All CPUs busy for 10 seconds 0 10026 10 10026 11 10026 12 10026 13 10025 14 10025 15 10025 1 10026 2 10026 3 10026 4 10026 5 10026 6 10026 7 10026 8 10026 9 10026 That is update_cpu_load works properly only when all CPUs are busy. If all are idle, all the CPUs get way lower updates. And when few CPUs are busy and rest are idle, only busy and ilb CPU does proper updates and rest of the idle CPUs will do lower updates. The patch keeps track of when a last update was done and fixes up the load avg based on current time. On one of my test system SPECjbb with warehouse 1..numcpus, patch improves throughput numbers by ~1% (average of 6 runs). On another test system (with different domain hierarchy) there is no noticable change in perf. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <AANLkTilLtDWQsAUrIxJ6s04WTgmw9GuOODc5AOrYsaR5@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 10:34:51 +02:00
Oleg Nesterov	246d86b518	sched: Simplify the reacquire_kernel_lock() logic - Contrary to what `6d558c3a` says, there is no need to reload prev = rq->curr after the context switch. You always schedule back to where you came from, prev must be equal to current even if cpu/rq was changed. - This also means reacquire_kernel_lock() can use prev instead of current. - No need to reassign switch_count if reacquire_kernel_lock() reports need_resched(), we can just move the initial assignment down, under the "need_resched_nonpreemptible:" label. - Try to update the comment after context_switch(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100519125711.GA30199@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 10:34:50 +02:00
Peter Zijlstra	c676329abb	sched_clock: Add local_clock() API and improve documentation For people who otherwise get to write: cpu_clock(smp_processor_id()), there is now: local_clock(). Also, as per suggestion from Andrew, provide some documentation on the various clock interfaces, and minimize the unsigned long long vs u64 mess. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jens Axboe <jaxboe@fusionio.com> LKML-Reference: <1275052414.1645.52.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-09 10:34:49 +02:00
Américo Wang	30dbb20e68	tracing: Remove boot tracer The boot tracer is useless. It simply logs the initcalls but in fact these initcalls are also logged through printk while using the initcall_debug kernel parameter. Nobody seem to be using it so far. Then just remove it. Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Chase Douglas <chase.douglas@canonical.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <20100526105753.GA5677@cr0.nay.redhat.com> [ remove the hooks in main.c, and the headers ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-06-08 23:31:28 +02:00
Frederic Weisbecker	b0f82b81fe	perf: Drop the skip argument from perf_arch_fetch_regs_caller Drop this argument now that we always want to rewind only to the state of the first caller. It means frame pointers are not necessary anymore to reliably get the source of an event. But this also means we need this helper to be a macro now, as an inline function is not an option since we need to know when to provide a default implentation. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: David Miller <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-06-08 23:31:27 +02:00
Frederic Weisbecker	c9cf4dbb4d	x86: Unify dumpstack.h and stacktrace.h arch/x86/include/asm/stacktrace.h and arch/x86/kernel/dumpstack.h declare headers of objects that deal with the same topic. Actually most of the files that include stacktrace.h also include dumpstack.h Although dumpstack.h seems more reserved for internals of stack traces, those are quite often needed to define specialized stack trace operations. And perf event arch headers are going to need access to such low level operations anyway. So don't continue to bother with dumpstack.h as it's not anymore about isolated deep internals. v2: fix struct stack_frame definition conflict in sysprof Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Soeren Sandmann <sandmann@daimi.au.dk>	2010-06-08 23:29:52 +02:00
Ingo Molnar	95ae3c59fa	Merge branch 'sched-wq' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into sched/core	2010-06-08 23:20:59 +02:00
Tejun Heo	21aa9af03d	sched: add hooks for workqueue Concurrency managed workqueue needs to know when workers are going to sleep and waking up. Using these two hooks, cmwq keeps track of the current concurrency level and throttles execution of new works if it's too high and wakes up another worker from the sleep hook if it becomes too low. This patch introduces PF_WQ_WORKER to identify workqueue workers and adds the following two hooks. * wq_worker_waking_up(): called when a worker is woken up. * wq_worker_sleeping(): called when a worker is going to sleep and may return a pointer to a local task which should be woken up. The returned task is woken up using try_to_wake_up_local() which is simplified ttwu which is called under rq lock and can only wake up local tasks. Both hooks are currently defined as noop in kernel/workqueue_sched.h. Later cmwq implementation will replace them with proper implementation. These hooks are hard coded as they'll always be enabled. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Ingo Molnar <mingo@elte.hu>	2010-06-08 21:40:37 +02:00
Tejun Heo	9ed3811a6c	sched: refactor try_to_wake_up() Factor ttwu_activate() and ttwu_woken_up() out of try_to_wake_up(). The factoring out doesn't affect try_to_wake_up() much code-generation-wise. Depending on configuration options, it ends up generating the same object code as before or slightly different one due to different register assignment. This is to help future implementation of try_to_wake_up_local(). Mike Galbraith suggested rename to ttwu_post_activation() from ttwu_woken_up() and comment update in try_to_wake_up(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Ingo Molnar <mingo@elte.hu>	2010-06-08 21:40:36 +02:00
Tejun Heo	3a101d0548	sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining Currently, when a cpu goes down, cpu_active is cleared before CPU_DOWN_PREPARE starts and cpuset configuration is updated from a default priority cpu notifier. When a cpu is coming up, it's set before CPU_ONLINE but cpuset configuration again is updated from the same cpu notifier. For cpu notifiers, this presents an inconsistent state. Threads which a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be migrated to other cpus because the cpu is no more inactive. Fix it by updating cpu_active in the highest priority cpu notifier and cpuset configuration in the second highest when a cpu is coming up. Down path is updated similarly. This guarantees that all other cpu notifiers see consistent cpu_active and cpuset configuration. cpuset_track_online_cpus() notifier is converted to cpuset_update_active_cpus() which just updates the configuration and now called from cpuset_cpu_[in]active() notifiers registered from sched_init_smp(). If cpuset is disabled, cpuset_update_active_cpus() degenerates into partition_sched_domains() making separate notifier for !CONFIG_CPUSETS unnecessary. This problem is triggered by cmwq. During CPU_DOWN_PREPARE, hotplug callback creates a kthread and kthread_bind()s it to the target cpu, and the thread is expected to run on that cpu. * Ingo's test discovered __cpuinit/exit markups were incorrect. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Menage <menage@google.com>	2010-06-08 21:40:36 +02:00
Tejun Heo	50a323b730	sched: define and use CPU_PRI_* enums for cpu notifier priorities Instead of hardcoding priority 10 and 20 in sched and perf, collect them into CPU_PRI_* enums. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-06-08 21:40:36 +02:00
Ingo Molnar	6113e45f83	Merge branch 'tip/perf/core-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2010-06-08 19:34:40 +02:00
Peter Zijlstra	dc61b1d65e	sched: Fix PROVE_RCU vs cpu_cgroup PROVE_RCU has a few issues with the cpu_cgroup because the scheduler typically holds rq->lock around the css rcu derefs but the generic cgroup code doesn't (and can't) know about that lock. Provide means to add extra checks to the css dereference and use that in the scheduler to annotate its users. The addition of rq->lock to these checks is correct because the cgroup_subsys::attach() method takes the rq->lock for each task it moves, therefore by holding that lock, we ensure the task is pinned to the current cgroup and the RCU derefence is valid. That leaves one genuine race in __sched_setscheduler() where we used task_group() without holding any of the required locks and thus raced with the cgroup code. Solve this by moving the check under the appropriate lock. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-08 18:44:04 +02:00
Peter Zijlstra	f6ab91add6	perf: Fix signed comparison in perf_adjust_period() Frederic reported that frequency driven swevents didn't work properly and even caused a division-by-zero error. It turns out there are two bugs, the division-by-zero comes from a failure to deal with that in perf_calculate_period(). The other was more interesting and turned out to be a wrong comparison in perf_adjust_period(). The comparison was between an s64 and u64 and got implicitly converted to an unsigned comparison. The problem is that period_left is typically < 0, so it ended up being always true. Cure this by making the local period variables s64. Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Tested-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-08 18:43:00 +02:00
Linus Torvalds	90ec781973	Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: fix bne2 "gave up waiting for init of module libcrc32c" module: verify_export_symbols under the lock module: move find_module check to end module: make locking more fine-grained. module: Make module sysfs functions private. module: move sysfs exposure to end of load_module module: fix kdb's illicit use of struct module_use. module: Make the 'usage' lists be two-way	2010-06-04 21:09:48 -07:00
Rusty Russell	9bea7f2395	module: fix bne2 "gave up waiting for init of module libcrc32c" Problem: it's hard to avoid an init routine stumbling over a request_module these days. And it's not clear it's always a bad idea: for example, a module like kvm with dynamic dependencies on kvm-intel or kvm-amd would be neater if it could simply request_module the right one. In this particular case, it's libcrc32c: libcrc32c_mod_init crypto_alloc_shash crypto_alloc_tfm crypto_find_alg crypto_alg_mod_lookup crypto_larval_lookup request_module If another module is waiting inside resolve_symbol() for libcrc32c to finish initializing (ie. bne2 depends on libcrc32c) then it does so holding the module lock, and our request_module() can't make progress until that is released. Waiting inside resolve_symbol() without the lock isn't all that hard: we just need to pass the -EBUSY up the call chain so we can sleep where we don't hold the lock. Error reporting is a bit trickier: we need to copy the name of the unfinished module before releasing the lock. Other notes: 1) This also fixes a theoretical issue where a weak dependency would allow symbol version mismatches to be ignored. 2) We rename use_module to ref_module to make life easier for the only external user (the out-of-tree ksplice patches). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tim Abbot <tabbott@ksplice.com> Tested-by: Brandon Philips <bphilips@suse.de>	2010-06-05 11:17:37 +09:30
Rusty Russell	be593f4ce4	module: verify_export_symbols under the lock It disabled preempt so it was "safe", but nothing stops another module slipping in before this module is added to the global list now we don't hold the lock the whole time. So we check this just after we check for duplicate modules, and just before we put the module in the global list. (find_symbol finds symbols in coming and going modules, too). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-06-05 11:17:37 +09:30
Linus Torvalds	3bafeb6247	module: move find_module check to end I think Rusty may have made the lock a bit _too_ finegrained there, and didn't add it to some places that needed it. It looks, for example, like PATCH 1/2 actually drops the lock in places where it's needed ("find_module()" is documented to need it, but now load_module() didn't hold it at all when it did the find_module()). Rather than adding a new "module_loading" list, I think we should be able to just use the existing "modules" list, and just fix up the locking a bit. In fact, maybe we could just move the "look up existing module" a bit later - optimistically assuming that the module doesn't exist, and then just undoing the work if it turns out that we were wrong, just before adding ourselves to the list. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-06-05 11:17:37 +09:30
Rusty Russell	75676500f8	module: make locking more fine-grained. Kay Sievers <kay.sievers@vrfy.org> reports that we still have some contention over module loading which is slowing boot. Linus also disliked a previous "drop lock and regrab" patch to fix the bne2 "gave up waiting for init of module libcrc32c" message. This is more ambitious: we only grab the lock where we need it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Brandon Philips <brandon@ifup.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-05 11:17:36 +09:30
Rusty Russell	6407ebb271	module: Make module sysfs functions private. These were placed in the header in `ef665c1a06` to get the various SYSFS/MODULE config combintations to compile. That may have been necessary then, but it's not now. These functions are all local to module.c. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Randy Dunlap <randy.dunlap@oracle.com>	2010-06-05 11:17:36 +09:30
Rusty Russell	80a3d1bb41	module: move sysfs exposure to end of load_module This means a little extra work, but is more logical: we don't put anything in sysfs until we're about to put the module into the global list an parse its parameters. This also gives us a logical place to put duplicate module detection in the next patch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-06-05 11:17:36 +09:30
Rusty Russell	c8e21ced08	module: fix kdb's illicit use of struct module_use. Linus changed the structure, and luckily this didn't compile any more. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Martin Hicks <mort@sgi.com>	2010-06-05 11:17:36 +09:30
Linus Torvalds	2c02dfe7fe	module: Make the 'usage' lists be two-way When adding a module that depends on another one, we used to create a one-way list of "modules_which_use_me", so that module unloading could see who needs a module. It's actually quite simple to make that list go both ways: so that we not only can see "who uses me", but also see a list of modules that are "used by me". In fact, we always wanted that list in "module_unload_free()": when we unload a module, we want to also release all the other modules that are used by that module. But because we didn't have that list, we used to first iterate over all modules, and then iterate over each "used by me" list of that module. By making the list two-way, we simplify module_unload_free(), and it allows for some trivial fixes later too. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (cleaned & rebased)	2010-06-05 11:17:35 +09:30
Linus Torvalds	d2dd328b7f	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits) block: make blk_init_free_list and elevator_init idempotent block: avoid unconditionally freeing previously allocated request_queue pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface pipe: change the privilege required for growing a pipe beyond system max pipe: adjust minimum pipe size to 1 page block: disable preemption before using sched_clock() cciss: call BUG() earlier Preparing 8.3.8rc2 drbd: Reduce verbosity drbd: use drbd specific ratelimit instead of global printk_ratelimit drbd: fix hang on local read errors while disconnected drbd: Removed the now empty w_io_error() function drbd: removed duplicated #includes drbd: improve usage of MSG_MORE drbd: need to set socket bufsize early to take effect drbd: improve network latency, TCP_QUICKACK drbd: Revert "drbd: Create new current UUID as late as possible" brd: support discard Revert "writeback: fix WB_SYNC_NONE writeback from umount" Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync" ...	2010-06-04 15:37:44 -07:00
Akinobu Mita	9e506f7adc	kernel/: fix BUG_ON checks for cpu notifier callbacks direct call The commit `80b5184cc5` ("kernel/: convert cpu notifier to return encapsulate errno value") changed the return value of cpu notifier callbacks. Those callbacks don't return NOTIFY_BAD on failures anymore. But there are a few callbacks which are called directly at init time and checking the return value. I forgot to change BUG_ON checking by the direct callers in the commit. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-04 15:21:45 -07:00
Greg Thelen	94b3dd0f7b	cgroups: alloc_css_id() increments hierarchy depth Child groups should have a greater depth than their parents. Prior to this change, the parent would incorrectly report zero memory usage for child cgroups when use_hierarchy is enabled. test script: mount -t cgroup none /cgroups -o memory cd /cgroups mkdir cg1 echo 1 > cg1/memory.use_hierarchy mkdir cg1/cg11 echo $$ > cg1/cg11/tasks dd if=/dev/zero of=/tmp/foo bs=1M count=1 echo echo CHILD grep cache cg1/cg11/memory.stat echo echo PARENT grep cache cg1/memory.stat echo $$ > tasks rmdir cg1/cg11 cg1 cd / umount /cgroups Using `fae9c79`, a recent patch that changed alloc_css_id() depth computation, the parent incorrectly reports zero usage: root@ubuntu:~# ./test 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.0151844 s, 69.1 MB/s CHILD cache 1048576 total_cache 1048576 PARENT cache 0 total_cache 0 With this patch, the parent correctly includes child usage: root@ubuntu:~# ./test 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.0136827 s, 76.6 MB/s CHILD cache 1052672 total_cache 1052672 PARENT cache 0 total_cache 1052672 Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Paul Menage <menage@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: <stable@kernel.org> [2.6.34.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-04 15:21:45 -07:00
Oleg Nesterov	485d527686	sys_personality: change sys_personality() to accept "unsigned int" instead of u_long task_struct->pesonality is "unsigned int", but sys_personality() paths use "unsigned long pesonality". This means that every assignment or comparison is not right. In particular, if this argument does not fit into "unsigned int" __set_personality() changes the caller's personality and then sys_personality() returns -EINVAL. Turn this argument into "unsigned int" and avoid overflows. Obviously, this is the user-visible change, we just ignore the upper bits. But this can't break the sane application. There is another thing which can confuse the poorly written applications. User-space thinks that this syscall returns int, not long. This means that the returned value can be negative and look like the error code. But note that libc won't be confused and thus errno won't be set, and with this patch the user-space can never get -1 unless sys_personality() really fails. And, most importantly, the negative RET != -1 is only possible if that app previously called personality(RET). Pointed-out-by: Wenming Zhang <wezhang@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-04 15:21:45 -07:00
Steven Rostedt	5168ae50a6	tracing: Remove ftrace_preempt_disable/enable The ftrace_preempt_disable/enable functions were to address a recursive race caused by the function tracer. The function tracer traces all functions which makes it easily susceptible to recursion. One area was preempt_enable(). This would call the scheduler and the schedulre would call the function tracer and loop. (So was it thought). The ftrace_preempt_disable/enable was made to protect against recursion inside the scheduler by storing the NEED_RESCHED flag. If it was set before the ftrace_preempt_disable() it would not call schedule on ftrace_preempt_enable(), thinking that if it was set before then it would have already scheduled unless it was already in the scheduler. This worked fine except in the case of SMP, where another task would set the NEED_RESCHED flag for a task on another CPU, and then kick off an IPI to trigger it. This could cause the NEED_RESCHED to be saved at ftrace_preempt_disable() but the IPI to arrive in the the preempt disabled section. The ftrace_preempt_enable() would not call the scheduler because the flag was already set before entring the section. This bug would cause a missed preemption check and cause lower latencies. Investigating further, I found that the recusion caused by the function tracer was not due to schedule(), but due to preempt_schedule(). Now that preempt_schedule is completely annotated with notrace, the recusion no longer is an issue. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-03 19:32:38 -04:00
Steven Rostedt	d1f74e20b5	tracing/sched: Make preempt_schedule() notrace The function tracer code uses ftrace_preempt_disable() to disable preemption instead of normal preempt_disable(). But there's a slight race condition that may cause it to lose a preemption check. This was made to keep the function tracer from recursing on itself by disabling preemption then having the enable call the function tracer again, causing infinite recursion. The bug was assumed to happen if the call was just in schedule, but this is incorrect. The bug is caused by preempt_schedule() which is called by preempt_enable(). The calling of preempt_enable() when NEED_RESCHED was set would call preempt_schedule() which would call the function tracer again. By making the preempt_schedule() and add_preempt_count() notrace then this will prevent the inifinite recursion. This is because the add_preempt_count() would stop the preempt_enable() in the function tracer from calling preempt_schedule() again. The sub_preempt_count() is also made notrace just to keep it symmetric. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-06-03 19:09:41 -04:00
Linus Torvalds	39d112100e	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched, trace: Fix sched_switch() prev_state argument sched: Fix wake_affine() vs RT tasks sched: Make sure timers have migrated before killing the migration_thread	2010-06-03 15:47:51 -07:00
Linus Torvalds	f150dba6d4	Merge branch 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix crash in swevents perf buildid-list: Fix --with-hits event processing perf scripts python: Give field dict to unhandled callback perf hist: fix objdump output parsing perf-record: Check correct pid when forking perf: Do the comm inheritance per thread in event__process_task perf: Use event__process_task from perf sched perf: Process comm events by tid blktrace: Fix new kernel-doc warnings perf_events: Fix unincremented buffer base on partial copy perf_events: Fix event scheduling issues introduced by transactional API perf_events, trace: Fix perf_trace_destroy(), mutex went missing perf_events, trace: Fix probe unregister race perf_events: Fix races in group composition perf_events: Fix races and clean up perf_event and perf_mmap_data interaction	2010-06-03 15:45:26 -07:00
Peter Zijlstra	c6df8d5ab8	perf: Fix crash in swevents Frederic reported that because swevents handling doesn't disable IRQs anymore, we can get a recursion of perf_adjust_period(), once from overflow handling and once from the tick. If both call ->disable, we get a double hlist_del_rcu() and trigger a LIST_POISON2 dereference. Since we don't actually need to stop/start a swevent to re-programm the hardware (lack of hardware to program), simply nop out these callbacks for the swevent pmu. Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1275557609.27810.35218.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-03 17:03:08 +02:00
Jens Axboe	ff9da691c0	pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface This changes the interface to be based on bytes instead. The API matches that of F_SETPIPE_SZ in that it rounds up the passed in size so that the resulting page array is a power-of-2 in size. The proc file is renamed to /proc/sys/fs/pipe-max-size to reflect this change. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-03 14:54:39 +02:00
Dan Carpenter	749d811f10	padata: add parenthesis in MAX_SEQ_NR macro MAX_SEQ_NR is used in padata_alloc_pd() like this: pd->max_seq_nr = (MAX_SEQ_NR / num_cpus) * num_cpus - 1; It needs parenthesis or the divide by num_cpus takes precedence over the subtraction. Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-06-03 20:19:28 +10:00
Daniel J Blueman	5c113fbeed	fix cpu_chain section mismatch... In commit `e9fb7631eb` ("cpu-hotplug: introduce cpu_notify(), __cpu_notify(), cpu_notify_nofail()") the new helper functions access cpu_chain. As a result, it shouldn't be marked __cpuinitdata (via section mismatch warning). Alternatively, the helper functions should be forced inline, or marked __ref or __cpuinit. In the meantime, this patch silences the warning the trivial way. Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-01 09:22:50 -07:00
Linus Torvalds	1f73897861	Merge branch 'for-35' of git://repo.or.cz/linux-kbuild * 'for-35' of git://repo.or.cz/linux-kbuild: (81 commits) kbuild: Revert part of `e8d400a` to resolve a conflict kbuild: Fix checking of scm-identifier variable gconfig: add support to show hidden options that have prompts menuconfig: add support to show hidden options which have prompts gconfig: remove show_debug option gconfig: remove dbg_print_ptype() and dbg_print_stype() kconfig: fix zconfdump() kconfig: some small fixes add random binaries to .gitignore kbuild: Include gen_initramfs_list.sh and the file list in the .d file kconfig: recalc symbol value before showing search results .gitignore: ignore *.lzo files headerdep: perlcritic warning scripts/Makefile.lib: Align the output of LZO kbuild: Generate modules.builtin in make modules_install Revert "kbuild: specify absolute paths for cscope" kbuild: Do not unnecessarily regenerate modules.builtin headers_install: use local file handles headers_check: fix perl warnings export_report: fix perl warnings ...	2010-06-01 08:55:52 -07:00
Peter Zijlstra	e51fd5e22e	sched: Fix wake_affine() vs RT tasks Mike reports that since `e9e9250b` (sched: Scale down cpu_power due to RT tasks), wake_affine() goes funny on RT tasks due to them still having a !0 weight and wake_affine() still subtracts that from the rq weight. Since nobody should be using se->weight for RT tasks, set the value to zero. Also, since we now use ->cpu_power to normalize rq weights to account for RT cpu usage, add that factor into the imbalance computation. Reported-by: Mike Galbraith <efault@gmx.de> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1275316109.27810.22969.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-01 09:27:16 +02:00
Rusty Russell	293a7cfeed	module: fix reference to mod->percpu after freeing module. Rafael sees a sometimes crash at precpu_modfree from kernel/module.c; it only occurred with another (since-reverted) patch, but that patch simply changed timing to uncover this bug, it was otherwise unrelated. The comment about the mod being freed is self-explanatory, but neither Tejun nor I read it. This bug was introduced in `259354deaa`, after it had previously been fixed in `6e2b75740b`. How embarrassing. Reported-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Embarrassingly-Acked-by: Tejun Heo <tj@kernel.org> Cc: Masami Hiramatsu <mhiramat@redhat.com> Tested-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-31 11:21:32 -07:00
Randy Dunlap	546cf44a1b	blktrace: Fix new kernel-doc warnings Fix blktrace.c kernel-doc warnings: Warning(kernel/trace/blktrace.c:858): No description found for parameter 'ignore' Warning(kernel/trace/blktrace.c:890): No description found for parameter 'ignore' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100529114507.c466fc1e.randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 09:58:20 +02:00
Frederic Weisbecker	74048f895f	perf_events: Fix unincremented buffer base on partial copy If a sample size crosses to the next page boundary, the copy will be made in more than one step. However we forget to advance the source offset for the next copy, leading to unexpected double copies that completely mess up the traces. This fixes various kinds of bad traces that have irrelevant data inside, as an example: geany-4979 [001] 5758.077775: sched_switch: prev_comm=! prev_pid=121 prev_prio=0 prev_state=S\|D\|Z\|X\|x ==> next_comm= next_pid=7497072 next_prio=0 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1274988898-5639-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:10 +02:00
Stephane Eranian	90151c35b1	perf_events: Fix event scheduling issues introduced by transactional API The transactional API patch between the generic and model-specific code introduced several important bugs with event scheduling, at least on X86. If you had pinned events, e.g., watchdog, and were over-committing the PMU, you would get bogus counts. The bug was showing up on Intel CPU because events would move around more often that on AMD. But the problem also existed on AMD, though harder to expose. The issues were: - group_sched_in() was missing a cancel_txn() in the error path - cpuc->n_added was not properly maintained, leading to missing actions in hw_perf_enable(), i.e., n_running being 0. You cannot update n_added until you know the transaction has succeeded. In case of failed transaction n_added was not adjusted back. - in case of failed transactions, event_sched_out() was called and eventually invoked x86_disable_event() to touch the HW reg. But with transactions, on X86, event_sched_in() does not touch HW registers, it simply collects events into a list. Thus, you could end up calling x86_disable_event() on a counter which did not correspond to the current event when idx != -1. The patch modifies the generic and X86 code to avoid all those problems. First, we keep track of the number of events added last. In case the transaction fails, we substract them from n_added. This approach is necessary (as opposed to delaying updates to n_added) because not all event updates use the transaction API, e.g., single events. Second, we encapsulate the event_sched_in() and event_sched_out() in group_sched_in() inside the transaction. That makes the operations symmetrical and you can also detect that you are inside a transaction and skip the HW reg access by checking cpuc->group_flag. With this patch, you can now overcommit the PMU even with pinned system-wide events present and still get valid counts. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1274796225.5882.1389.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:10 +02:00
Peter Zijlstra	2e97942fe5	perf_events, trace: Fix perf_trace_destroy(), mutex went missing Steve spotted I forgot to do the destroy under event_mutex. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1274451913.1674.1707.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:09 +02:00
Peter Zijlstra	3771f07711	perf_events, trace: Fix probe unregister race tracepoint_probe_unregister() does not synchronize against the probe callbacks, so do that explicitly. This properly serializes the callbacks and the free of the data used therein. Also, use this_cpu_ptr() where possible. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1274438476.1674.1702.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:09 +02:00
Peter Zijlstra	8a49542c05	perf_events: Fix races in group composition Group siblings don't pin each-other or the parent, so when we destroy events we must make sure to clean up all cross referencing pointers. In particular, for destruction of a group leader we must be able to find all its siblings and remove their reference to it. This means that detaching an event from its context must not detach it from the group, otherwise we can end up failing to clear all pointers. Solve this by clearly separating the attachment to a context and attachment to a group, and keep the group composed until we destroy the events. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:09 +02:00
Peter Zijlstra	ac9721f3f5	perf_events: Fix races and clean up perf_event and perf_mmap_data interaction In order to move toward separate buffer objects, rework the whole perf_mmap_data construct to be a more self-sufficient entity, one with its own lifetime rules. This greatly sanitizes the whole output redirection code, which was riddled with bugs and races. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:46:08 +02:00
Amit K. Arora	54e88fad22	sched: Make sure timers have migrated before killing the migration_thread Problem: In a stress test where some heavy tests were running along with regular CPU offlining and onlining, a hang was observed. The system seems to be hung at a point where migration_call() tries to kill the migration_thread of the dying CPU, which just got moved to the current CPU. This migration thread does not get a chance to run (and die) since rt_throttled is set to 1 on current, and it doesn't get cleared as the hrtimer which is supposed to reset the rt bandwidth (sched_rt_period_timer) is tied to the CPU which we just marked dead! Solution: This patch pushes the killing of migration thread to "CPU_POST_DEAD" event. By then all the timers (including sched_rt_period_timer) should have got migrated (along with other callbacks). Signed-off-by: Amit Arora <aarora@in.ibm.com> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20100525132346.GA14986@amitarora.in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-31 08:37:44 +02:00
Linus Torvalds	fa7eadab4b	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: mutex: Fix optimistic spinning vs. BKL	2010-05-30 12:35:15 -07:00
Linus Torvalds	bc7d352c5e	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tui: Fix last use_browser problem related to .perfconfig perf symbols: Add the build id cache to the vmlinux path perf tui: Reset use_browser if stdout is not a tty ring-buffer: Move zeroing out excess in page to ring buffer code ring-buffer: Reset "real_end" when page is filled	2010-05-30 12:35:01 -07:00
Rafael J. Wysocki	e9a5f426b8	CPU: Avoid using unititialized error variable in disable_nonboot_cpus() If there's only one CPU online when disable_nonboot_cpus() is called, the error variable will not be initialized and that may lead to erroneous behavior. Fix this issue by initializing error in disable_nonboot_cpus() as appropriate. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-30 09:06:00 -07:00
Linus Torvalds	35926ff5fb	Revert "cpusets: randomize node rotor used in cpuset_mem_spread_node()" This reverts commit `0ac0c0d0f8`, which caused cross-architecture build problems for all the wrong reasons. IA64 already added its own version of __node_random(), but the fact is, there is nothing architectural about the function, and the original commit was just badly done. Revert it, since no fix is forthcoming. Requested-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-30 09:00:03 -07:00
Linus Torvalds	b612a05537	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: clean up on forwarded aborted mds request ceph: fix leak of osd authorizer ceph: close out mds, osd connections before stopping auth ceph: make lease code DN specific fs/ceph: Use ERR_CAST ceph: renew auth tickets before they expire ceph: do not resend mon requests on auth ticket renewal ceph: removed duplicated #includes ceph: avoid possible null dereference ceph: make mds requests killable, not interruptible sched: add wait_for_completion_killable_timeout	2010-05-30 08:56:39 -07:00
Sage Weil	0aa12fb439	sched: add wait_for_completion_killable_timeout Add missing _killable_timeout variant for wait_for_completion that will return when a timeout expires or the task is killed. CC: Ingo Molnar <mingo@elte.hu> CC: Andreas Herrmann <andreas.herrmann3@amd.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:30 -07:00
Linus Torvalds	29d03fa12b	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix_timer: Fix error path in timer_create hrtimer: Avoid double seqlock timers: Move local variable into else section timers: Fix slack calculation really	2010-05-28 10:16:27 -07:00
Al Viro	ea635c64e0	Fix racy use of anon_inode_getfd() in perf_event.c once anon_inode_getfd() is called, you can't expect anything about struct file that descriptor points to - another thread might be doing whatever it likes with descriptor table at that point. Cc: stable <stable@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:03:08 -04:00
Linus Torvalds	c5617b200a	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (61 commits) tracing: Add __used annotation to event variable perf, trace: Fix !x86 build bug perf report: Support multiple events on the TUI perf annotate: Fix up usage of the build id cache x86/mmiotrace: Remove redundant instruction prefix checks perf annotate: Add TUI interface perf tui: Remove annotate from popup menu after failure perf report: Don't start the TUI if -D is used perf: Fix getline undeclared perf: Optimize perf_tp_event_match() perf: Remove more code from the fastpath perf: Optimize the !vmalloc backed buffer perf: Optimize perf_output_copy() perf: Fix wakeup storm for RO mmap()s perf-record: Share per-cpu buffers perf-record: Remove -M perf: Ensure that IOC_OUTPUT isn't used to create multi-writer buffers perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction perf tui: Allow disabling the TUI on a per command basis in ~/.perfconfig ...	2010-05-27 15:23:47 -07:00
Andrey Vagin	45e0fffc8a	posix_timer: Fix error path in timer_create Move CLOCK_DISPATCH(which_clock, timer_create, (new_timer)) after all posible EFAULT erros. *_timer_create may allocate/get resources. (for example posix_cpu_timer_create does get_task_struct) [ tglx: fold the remove crappy comment patch into this ] Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: <stable@kernel.org> Reviewed-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-27 22:38:15 +02:00
Linus Torvalds	00b9b0af58	Avoid warning when CPU hotplug isn't enabled Commit `e9fb7631eb` ("cpu-hotplug: introduce cpu_notify(), __cpu_notify(), cpu_notify_nofail()") also introduced this annoying warning: kernel/cpu.c:157: warning: 'cpu_notify_nofail' defined but not used when CONFIG_HOTPLUG_CPU wasn't set. So move that helper inside the #ifdef CONFIG_HOTPLUG_CPU region, and simplify it while at it. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 10:32:08 -07:00
Lee Schermerhorn	3dd6b5fb43	numa: in-kernel profiling: use cpu_to_mem() for per cpu allocations In kernel profiling requires that we be able to allocate "local" memory for each cpu. Use "cpu_to_mem()" instead of "cpu_to_node()" to support memoryless nodes. Depends on the "numa_mem_id()" patch. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:57 -07:00
Anton Blanchard	5b530fc183	panic: call console_verbose() in panic Most distros turn the console verbosity down and that means a backtrace after a panic never makes it to the console. I assume we haven't seen this because a panic is often preceeded by an oops which will have called console_verbose. There are however a lot of places we call panic directly, and they are broken. Use console_verbose like we do in the oops path to ensure a directly called panic will print a backtrace. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:53 -07:00
Oleg Nesterov	f106eee100	pids: fix fork_idle() to setup ->pids correctly copy_process(pid => &init_struct_pid) doesn't do attach_pid/etc. It shouldn't, but this means that the idle threads run with the wrong pids copied from the caller's task_struct. In x86 case the caller is either kernel_init() thread or keventd. In particular, this means that after the series of cpu_up/cpu_down an idle thread (which never exits) can run with .pid pointing to nowhere. Change fork_idle() to initialize idle->pids[] correctly. We only set .pid = &init_struct_pid but do not add .node to list, INIT_TASK() does the same for the boot-cpu idle thread (swapper). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Mathias Krause <Mathias.Krause@secunet.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:52 -07:00
Hedi Berriche	72680a191b	pids: increase pid_max based on num_possible_cpus On a system with a substantial number of processors, the early default pid_max of 32k will not be enough. A system with 1664 CPU's, there are 25163 processes started before the login prompt. It's estimated that with 2048 CPU's we will pass the 32k limit. With 4096, we'll reach that limit very early during the boot cycle, and processes would stall waiting for an available pid. This patch increases the early maximum number of pids available, and increases the minimum number of pids that can be set during runtime. [akpm@linux-foundation.org: fix warnings] Signed-off-by: Hedi Berriche <hedi@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Robin Holt <holt@sgi.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Machek <pavel@ucw.cz> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg KH <gregkh@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: John Stoffel <john@stoffel.org> Cc: Jack Steiner <steiner@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:51 -07:00
Lai Jiangshan	79a6cdeb7e	cpuhotplug: do not need cpu_hotplug_begin() when CONFIG_HOTPLUG_CPU=n Since when CONFIG_HOTPLUG_CPU=n, get_online_cpus() do nothing, so we don't need cpu_hotplug_begin() either. This patch moves cpu_hotplug_begin()/cpu_hotplug_done() into the code block of CONFIG_HOTPLUG_CPU=y. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:48 -07:00
Akinobu Mita	80b5184cc5	kernel/: convert cpu notifier to return encapsulate errno value By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for kernel/*.c Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:48 -07:00
Akinobu Mita	e6bde73b07	cpu-hotplug: return better errno on cpu hotplug failure Currently, onlining or offlining a CPU failure by one of the cpu notifiers error always cause -EINVAL error. (i.e. writing 0 or 1 to /sys/devices/system/cpu/cpuX/online gets EINVAL) To get better error reporting rather than always getting -EINVAL, This changes cpu_notify() to return -errno value with notifier_to_errno() and fix the callers. Now that cpu notifiers can return encapsulate errno value. Currently, all cpu hotplug notifiers return NOTIFY_OK, NOTIFY_BAD, or NOTIFY_DONE. So cpu_notify() can returns 0 or -EPERM with this change for now. (notifier_to_errno(NOTIFY_OK) == 0, notifier_to_errno(NOTIFY_DONE) == 0, notifier_to_errno(NOTIFY_BAD) == -EPERM) Forthcoming patches convert several cpu notifiers to return encapsulate errno value with notifier_from_errno(). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Akinobu Mita	e9fb7631eb	cpu-hotplug: introduce cpu_notify(), __cpu_notify(), cpu_notify_nofail() No functional change. These are just wrappers of raw_cpu_notifier_call_chain. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	b3ac022cb9	proc: turn signal_struct->count into "int nr_threads" No functional changes, just s/atomic_t count/int nr_threads/. With the recent changes this counter has a single user, get_nr_threads() And, none of its callers need the really accurate number of threads, not to mention each caller obviously races with fork/exit. It is only used to report this value to the user-space, except first_tid() uses it to avoid the unnecessary while_each_thread() loop in the unlikely case. It is a bit sad we need a word in struct signal_struct for this, perhaps we can change get_nr_threads() to approximate the number of threads using signal->live and kill ->nr_threads later. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	5089a97680	proc_sched_show_task(): use get_nr_threads() Trivial, use get_nr_threads() helper to read signal->count which we are going to change. Like other callers, proc_sched_show_task() doesn't need the exactly precise nr_threads. David said: : Note that get_nr_threads() isn't completely equivalent (it can return 0 : where proc_sched_show_task() will display a 1). But I don't think this : should be a problem. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	6e1be45aa6	check_unshare_flags: kill the bogus CLONE_SIGHAND/sig->count check check_unshare_flags(CLONE_SIGHAND) adds CLONE_THREAD to *flags_ptr if the task is multithreaded to ensure unshare_thread() will fail. Not only this is a bit strange way to return the error, this is absolutely meaningless. If signal->count > 1 then sighand->count must be also > 1, and unshare_sighand() will fail anyway. In fact, all CLONE_THREAD/SIGHAND/VM checks inside sys_unshare() do not look right. Fortunately this code doesn't really work anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	97101eb41d	exit: move taskstats_tgid_free() from __exit_signal() to free_signal_struct() Move taskstats_tgid_free() from __exit_signal() to free_signal_struct(). This way signal->stats never points to nowhere and we can read ->stats lockless. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	a705be6b5e	kill the obsolete thread_group_cputime_free() helper Kill the empty thread_group_cputime_free() helper. It was needed to free the per-cpu data which we no longer have. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	d40e48e02f	exit: __exit_signal: use thread_group_leader() consistently Cleanup: - Add the boolean, group_dead = thread_group_leader(), for clarity. - Do not test/set sig == NULL to detect the all-dead case, use this boolean. - Pass this boolen to __unhash_process() and use it instead of another thread_group_leader() call which needs ->group_leader. This can be considered as microoptimization, but hopefully this also allows us do do other cleanups later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	b7b8ff6373	signals: kill the awful task_rq_unlock_wait() hack Now that task->signal can't go away we can revert the horrible hack added by `ad474caca3` ("fix for account_group_exec_runtime(), make sure ->signal can't be freed under rq->lock"). And we can do more cleanups sched_stats.h/posix-cpu-timers.c later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	4ada856fb0	signals: clear signal->tty when the last thread exits When the last thread exits signal->tty is freed, but the pointer is not cleared and points to nowhere. This is OK. Nobody should use signal->tty lockless, and it is no longer possible to take ->siglock. However this looks wrong even if correct, and the nice OOPS is better than subtle and hard to find bugs. Change __exit_signal() to clear signal->tty under ->siglock. Note: __exit_signal() needs more cleanups. It should not check "sig != NULL" to detect the all-dead case and we have the same issues with signal->stats. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	ea6d290ca3	signals: make task_struct->signal immutable/refcountable We have a lot of problems with accessing task_struct->signal, it can "disappear" at any moment. Even current can't use its ->signal safely after exit_notify(). ->siglock helps, but it is not convenient, not always possible, and sometimes it makes sense to use task->signal even after this task has already dead. This patch adds the reference counter, sigcnt, into signal_struct. This reference is owned by task_struct and it is dropped in __put_task_struct(). Perhaps it makes sense to export get/put_signal_struct() later, but currently I don't see the immediate reason. Rename __cleanup_signal() to free_signal_struct() and unexport it. With the previous changes it does nothing except kmem_cache_free(). Change __exit_signal() to not clear/free ->signal, it will be freed when the last reference to any thread in the thread group goes away. Note: - when the last thead exits signal->tty can point to nowhere, see the next patch. - with or without this patch signal_struct->count should go away, or at least it should be "int nr_threads" for fs/proc. This will be addressed later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	4dec2a91fd	fork/exit: move tty_kref_put() outside of __cleanup_signal() tty_kref_put() has two callsites in copy_process() paths, 1. if copy_process() suceeds it is called before we copy signal->tty from parent 2. otherwise it is called from __cleanup_signal() under bad_fork_cleanup_signal: label In both cases tty_kref_put() is not right and unneeded because we don't have the balancing tty_kref_get(). Fortunately, this is harmless because this can only happen without CLONE_THREAD, and in this case signal->tty must be NULL. Remove tty_kref_put() from copy_process() and __cleanup_signal(), and change another caller of __cleanup_signal(), __exit_signal(), to call tty_kref_put() by hand. I hope this change makes sense by itself, but it is also needed to make ->signal refcountable. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alan Cox <alan@linux.intel.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	d30fda3551	posix-cpu-timers: avoid "task->signal != NULL" checks Preparation to make task->signal immutable, no functional changes. posix-cpu-timers.c checks task->signal != NULL to ensure this task is alive and didn't pass __exit_signal(). This is correct but we are going to change the lifetime rules for ->signal and never reset this pointer. Change the code to check ->sighand instead, it doesn't matter which pointer we check under tasklist, they both are cleared simultaneously. As Roland pointed out, some of these changes are not strictly needed and probably it makes sense to revert them later, when ->signal will be pinned to task_struct. But this patch tries to ensure the subsequent changes in fork/exit can't make any visible impact on posix cpu timers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	4a59994297	exit: avoid sig->count in __exit_signal() to detect the group-dead case Change __exit_signal() to check thread_group_leader() instead of atomic_dec_and_test(&sig->count). This must be equivalent, the group leader must be released only after all other threads have exited and passed __exit_signal(). Henceforth sig->count is not actually used, except in fs/proc for get_nr_threads/etc. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	d344193a05	exit: avoid sig->count in de_thread/__exit_signal synchronization de_thread() and __exit_signal() use signal_struct->count/notify_count for synchronization. We can simplify the code and use ->notify_count only. Instead of comparing these two counters, we can change de_thread() to set ->notify_count = nr_of_sub_threads, then change __exit_signal() to dec-and-test this counter and notify group_exit_task. Note that __exit_signal() checks "notify_count > 0" just for symmetry with exit_notify(), we could just check it is != 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	09faef11df	exit: change zap_other_threads() to count sub-threads Change zap_other_threads() to return the number of other sub-threads found on ->thread_group list. Other changes are cosmetic: - change the code to use while_each_thread() helper - remove the obsolete comment about SIGKILL/SIGSTOP Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	9c33916844	exit: exit_notify() can trust signal->notify_count < 0 signal_struct->count in its current form must die. - it has no reasons to be atomic_t - it looks like a reference counter, but it is not - otoh, we really need to make task->signal refcountable, just look at the extremely ugly task_rq_unlock_wait() called from __exit_signals(). - we should change the lifetime rules for task->signal, it should be pinned to task_struct. We have a lot of code which can be simplified after that. - it is not needed! while the code is correct, any usage of this counter is artificial, except fs/proc uses it correctly to show the number of threads. This series removes the usage of sig->count from exit pathes. This patch: Now that Veaceslav changed copy_signal() to use zalloc(), exit_notify() can just check notify_count < 0 to ensure the execing sub-threads needs the notification from us. No need to do other checks, notify_count != 0 must always mean ->group_exit_task != NULL is waiting for us. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	04b1c384fb	call_usermodehelper: UMH_WAIT_EXEC ignores kernel_thread() failure UMH_WAIT_EXEC should report the error if kernel_thread() fails, like UMH_WAIT_PROC does. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	d47419cd96	call_usermodehelper: simplify/fix UMH_NO_WAIT case __call_usermodehelper(UMH_NO_WAIT) has 2 problems: - if kernel_thread() fails, call_usermodehelper_freeinfo() is not called. - for unknown reason UMH_NO_WAIT has UMH_WAIT_PROC logic, we spawn yet another thread which waits until the user mode application exits. Change the UMH_NO_WAIT code to use ____call_usermodehelper() instead of wait_for_helper(), and do call_usermodehelper_freeinfo() unconditionally. We can rely on CLONE_VFORK, do_fork(CLONE_VFORK) until the child exits or execs. With or without this patch UMH_NO_WAIT does not report the error if kernel_thread() fails, this is correct since the caller doesn't wait for result. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	7d64224217	wait_for_helper: SIGCHLD from user-space can lead to use-after-free 1. wait_for_helper() calls allow_signal(SIGCHLD) to ensure the child can't autoreap itself. However, this means that a spurious SIGCHILD from user-space can set TIF_SIGPENDING and: - kernel_thread() or sys_wait4() can fail due to signal_pending() - worse, wait4() can fail before ____call_usermodehelper() execs or exits. In this case the caller may kfree(subprocess_info) while the child still uses this memory. Change the code to use SIG_DFL instead of magic "(void __user *)2" set by allow_signal(). This means that SIGCHLD won't be delivered, yet the child won't autoreap itsefl. The problem is minor, only root can send a signal to this kthread. 2. If sys_wait4(&ret) fails it doesn't populate "ret", in this case wait_for_helper() reports a random value from uninitialized var. With this patch sys_wait4() should never fail, but still it makes sense to initialize ret = -ECHILD so that the caller can notice the problem. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	363da4022c	call_usermodehelper: no need to unblock signals ____call_usermodehelper() correctly calls flush_signal_handlers() to set SIG_DFL, but sigemptyset(->blocked) and recalc_sigpending() are not needed. This kthread was forked by workqueue thread, all signals must be unblocked and ignored, no pending signal is possible. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	c70a626d3e	umh: creds: kill subprocess_info->cred logic Now that nobody ever changes subprocess_info->cred we can kill this member and related code. ____call_usermodehelper() always runs in the context of freshly forked kernel thread, it has the proper ->cred copied from its parent kthread, keventd. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	685bfd2c48	umh: creds: convert call_usermodehelper_keys() to use subprocess_info->init() call_usermodehelper_keys() uses call_usermodehelper_setkeys() to change subprocess_info->cred in advance. Now that we have info->init() we can change this code to set tgcred->session_keyring in context of execing kernel thread. Note: since currently call_usermodehelper_keys() is never called with UMH_NO_WAIT, call_usermodehelper_keys()->key_get() and umh_keys_cleanup() are not really needed, we could rely on install_session_keyring_to_cred() which does key_get() on success. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Neil Horman	898b374af6	exec: replace call_usermodehelper_pipe with use of umh init function and resolve limit The first patch in this series introduced an init function to the call_usermodehelper api so that processes could be customized by caller. This patch takes advantage of that fact, by customizing the helper in do_coredump to create the pipe and set its core limit to one (for our recusrsion check). This lets us clean up the previous uglyness in the usermodehelper internals and factor call_usermodehelper out entirely. While I'm at it, we can also modify the helper setup to look for a core limit value of 1 rather than zero for our recursion check Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Neil Horman	a06a4dc3a0	kmod: add init function to usermodehelper About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe feature in the kernel works. We had reports of several races, including some reports of apps bypassing our recursion check so that a process that was forked as part of a core_pattern setup could infinitely crash and refork until the system crashed. We fixed those by improving our recursion checks. The new check basically refuses to fork a process if its core limit is zero, which works well. Unfortunately, I've been getting grief from maintainer of user space programs that are inserted as the forked process of core_pattern. They contend that in order for their programs (such as abrt and apport) to work, all the running processes in a system must have their core limits set to a non-zero value, to which I say 'yes'. I did this by design, and think thats the right way to do things. But I've been asked to ease this burden on user space enough times that I thought I would take a look at it. The first suggestion was to make the recursion check fail on a non-zero 'special' number, like one. That way the core collector process could set its core size ulimit to 1, and enable the kernel's recursion detection. This isn't a bad idea on the surface, but I don't like it since its opt-in, in that if a program like abrt or apport has a bug and fails to set such a core limit, we're left with a recursively crashing system again. So I've come up with this. What I've done is modify the call_usermodehelper api such that an extra parameter is added, a function pointer which will be called by the user helper task, after it forks, but before it exec's the required process. This will give the caller the opportunity to get a call back in the processes context, allowing it to do whatever it needs to to the process in the kernel prior to exec-ing the user space code. In the case of do_coredump, this callback is ues to set the core ulimit of the helper process to 1. This elimnates the opt-in problem that I had above, as it allows the ulimit for core sizes to be set to the value of 1, which is what the recursion check looks for in do_coredump. This patch: Create new function call_usermodehelper_fns() and allow it to assign both an init and cleanup function, as we'll as arbitrary data. The init function is called from the context of the forked process and allows for customization of the helper process prior to calling exec. Its return code gates the continuation of the process, or causes its exit. Also add an arbitrary data pointer to the subprocess_info struct allowing for data to be passed from the caller to the new process, and the subsequent cleanup process Also, use this patch to cleanup the cleanup function. It currently takes an argp and envp pointer for freeing, which is ugly. Lets instead just make the subprocess_info structure public, and pass that to the cleanup and init routines Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Oleg Nesterov	065add3941	signals: check_kill_permission(): don't check creds if same_thread_group() Andrew Tridgell reports that aio_read(SIGEV_SIGNAL) can fail if the notification from the helper thread races with setresuid(), see http://samba.org/~tridge/junkcode/aio_uid.c This happens because check_kill_permission() doesn't permit sending a signal to the task with the different cred->xids. But there is not any security reason to check ->cred's when the task sends a signal (private or group-wide) to its sub-thread. Whatever we do, any thread can bypass all security checks and send SIGKILL to all threads, or it can block a signal SIG and do kill(gettid(), SIG) to deliver this signal to another sub-thread. Not to mention that CLONE_THREAD implies CLONE_VM. Change check_kill_permission() to avoid the credentials check when the sender and the target are from the same thread group. Also, move "cred = current_cred()" down to avoid calling get_current() twice. Note: David Howells pointed out we could relax this even more, the CLONE_SIGHAND (without CLONE_THREAD) case probably does not need these checks too. Roland said: : The glibc (libpthread) that does set*id across threads has : been in use for a while (2.3.4?), probably in distro's using kernels as old : or older than any active -stable streams. In the race in question, this : kernel bug is breaking valid POSIX application expectations. Reported-by: Andrew Tridgell <tridge@samba.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Eric Paris <eparis@parisplace.org> Cc: Jakub Jelinek <jakub@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Roland McGrath <roland@redhat.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: <stable@kernel.org> [all kernel versions] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Oleg Nesterov	e0129ef91e	ptrace: PTRACE_GETFDPIC: fix the unsafe usage of child->mm Now that Mike Frysinger unified the FDPIC ptrace code, we can fix the unsafe usage of child->mm in ptrace_request(PTRACE_GETFDPIC). We have the reference to task_struct, and ptrace_check_attach() verified the tracee is stopped. But nothing can protect from SIGKILL after that, we must not assume child->mm != NULL. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Mike Frysinger <vapier.adi@gmail.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Greg Ungerer <gerg@snapgear.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Mike Frysinger	9c1a125921	ptrace: unify FDPIC implementations The Blackfin/FRV/SuperH guys all have the same exact FDPIC ptrace code in their arch handlers (since they were probably copied & pasted). Since these ptrace interfaces are an arch independent aspect of the FDPIC code, unify them in the common ptrace code so new FDPIC ports don't need to copy and paste this fundamental stuff yet again. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Jack Steiner	0ac0c0d0f8	cpusets: randomize node rotor used in cpuset_mem_spread_node() Some workloads that create a large number of small files tend to assign too many pages to node 0 (multi-node systems). Part of the reason is that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at node 0 for newly created tasks. This patch changes the rotor to be initialized to a random node number of the cpuset. [akpm@linux-foundation.org: fix layout] [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration] Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul Menage <menage@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Jack Steiner	6adef3ebe5	cpusets: new round-robin rotor for SLAB allocations We have observed several workloads running on multi-node systems where memory is assigned unevenly across the nodes in the system. There are numerous reasons for this but one is the round-robin rotor in cpuset_mem_spread_node(). For example, a simple test that writes a multi-page file will allocate pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it allocates on odd nodes & skips even nodes). An example is shown below. The program "lfile" writes a file consisting of 10 pages. The program then mmaps the file & uses get_mempolicy(..., MPOL_F_NODE) to determine the nodes where the file pages were allocated. The output is shown below: # ./lfile allocated on nodes: 2 4 6 0 1 2 6 0 2 There is a single rotor that is used for allocating both file pages & slab pages. Writing the file allocates both a data page & a slab page (buffer_head). This advances the RR rotor 2 nodes for each page allocated. A quick confirmation seems to confirm this is the cause of the uneven allocation: # echo 0 >/dev/cpuset/memory_spread_slab # ./lfile allocated on nodes: 6 7 8 9 0 1 2 3 4 5 This patch introduces a second rotor that is used for slab allocations. Signed-off-by: Jack Steiner <steiner@sgi.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul Menage <menage@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Kirill A. Shutemov	907860ed38	cgroups: make cftype.unregister_event() void-returning Since we are unable to handle an error returned by cftype.unregister_event() properly, let's make the callback void-returning. mem_cgroup_unregister_event() has been rewritten to be a "never fail" function. On mem_cgroup_usage_register_event() we save old buffer for thresholds array and reuse it in mem_cgroup_usage_unregister_event() to avoid allocation. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Stanislaw Gruszka	174bd1994e	hrtimer: Avoid double seqlock hrtimer_get_softirq_time() has it's own xtime lock protection, so it's safe to use plain __current_kernel_time() and avoid the double seqlock loop. Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> LKML-Reference: <20100525214912.GA1934@r2bh72.net.upc.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-26 16:15:37 +02:00
Thomas Gleixner	2abfb9e1d4	timers: Move local variable into else section Fix nit-picking coding style detail. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-26 16:07:13 +02:00
Linus Torvalds	b1cdc4670b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits) drivers/net/usb/asix.c: Fix pointer cast. be2net: Bug fix to avoid disabling bottom half during firmware upgrade. proc_dointvec: write a single value hso: add support for new products Phonet: fix potential use-after-free in pep_sock_close() ath9k: remove VEOL support for ad-hoc ath9k: change beacon allocation to prefer the first beacon slot sock.h: fix kernel-doc warning cls_cgroup: Fix build error when built-in macvlan: do proper cleanup in macvlan_common_newlink() V2 be2net: Bug fix in init code in probe net/dccp: expansion of error code size ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep wireless: fix sta_info.h kernel-doc warnings wireless: fix mac80211.h kernel-doc warnings iwlwifi: testing the wrong variable in iwl_add_bssid_station() ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs() ath9k_htc: dereferencing before check in hif_usb_tx_cb() rt2x00: Fix rt2800usb TX descriptor writing. rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions. ...	2010-05-25 16:59:51 -07:00
Linus Torvalds	218ce73514	Revert "module: drop the lock while waiting for module to complete initialization." This reverts commit `480b02df3a`, since Rafael reports that it causes occasional kernel paging request faults in load_module(). Dropping the module lock and re-taking it deep in the call-chain is definitely not the right thing to do. That just turns the mutex from a lock into a "random non-locking data structure" that doesn't actually protect what it's supposed to protect. Requested-and-tested-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Brandon Philips <brandon@ifup.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 16:48:30 -07:00
J. R. Okajima	563b046710	proc_dointvec: write a single value The commit `00b7c3395a` "sysctl: refactor integer handling proc code" modified the behaviour of writing to /proc. Before the commit, write("1\n") to /proc/sys/kernel/printk succeeded. But now it returns EINVAL. This commit supports writing a single value to a multi-valued entry. Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp> Reviewed-and-tested-by: WANG Cong <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-25 16:10:14 -07:00
Thomas Gleixner	8e63d7795e	timers: Fix slack calculation really commit `f00e047ef` (timers: Fix slack calculation for expired timers) fixed the issue of slack on expired timers only partially. Linus noticed that jiffies is volatile so it is reloaded twice, which generates bad code. But its worse. This can defeat the time_after() check if jiffies are incremented between time_after() and the slack calculation. Fix it by reading jiffies into a local variable, which prevents the compiler from loading it twice. While at it make the > -1 check into >= 0 which is easier to read. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 21:07:48 +02:00
Steven Rostedt	2711ca237a	ring-buffer: Move zeroing out excess in page to ring buffer code Currently the trace splice code zeros out the excess bytes in the page before sending it off to userspace. This is to make sure userspace is not getting anything it should not be when reading the pages, because the excess data was never initialized to zero before writing (for perfomance reasons). But the splice code has no business in doing this work, it should be done by the ring buffer. With the latest changes for recording lost events, the splice code gets it wrong anyway. Move the zeroing out of excess bytes into the ring buffer code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-25 11:57:26 -04:00
Steven Rostedt	b3230c8b44	ring-buffer: Reset "real_end" when page is filled The code to store the "lost events" requires knowing the real end of the page. Since the 'commit' includes the padding at the end of a page a "real_end" variable was used to keep track of the end not including the padding. If events were lost, the reader can place the count of events in the padded area if there is enough room. The bug this patch fixes is that when we fill the page we do not reset the real_end variable, and if the writer had wrapped a few times, the real_end would be incorrect. This patch simply resets the real_end if the page was filled. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-25 11:57:24 -04:00
Andy Shevchenko	69e4469a39	sysctl: don't use own implementation of hex_to_bin() Remove own implementation of hex_to_bin(). Signed-off-by: Andy Shevchenko <ext-andriy.shevchenko@nokia.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:05 -07:00
Wenji Huang	7d52669b14	module: remove duplicate declaration of __ksymtab_gpl_future Minor cleanup on duplicate __{start/stop}__ksymtab_gpl_future. Signed-off-by: Wenji Huang <wenji.huang@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:04 -07:00
Haicheng Li	4eaf3f6439	mem-hotplug: fix potential race while building zonelist for new populated zone Add global mutex zonelists_mutex to fix the possible race: CPU0 CPU1 CPU2 (1) zone->present_pages += online_pages; (2) build_all_zonelists(); (3) alloc_page(); (4) free_page(); (5) build_all_zonelists(); (6) __build_all_zonelists(); (7) zone->pageset = alloc_percpu(); In step (3,4), zone->pageset still points to boot_pageset, so bad things may happen if 2+ nodes are in this state. Even if only 1 node is accessing the boot_pageset, (3) may still consume too much memory to fail the memory allocations in step (7). Besides, atomic operation ensures alloc_percpu() in step (7) will never fail since there is a new fresh memory block added in step(6). [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists] Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Andi Kleen <andi.kleen@intel.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:02 -07:00
Haicheng Li	1f522509c7	mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset For each new populated zone of hotadded node, need to update its pagesets with dynamically allocated per_cpu_pageset struct for all possible CPUs: 1) Detach zone->pageset from the shared boot_pageset at end of __build_all_zonelists(). 2) Use mutex to protect zone->pageset when it's still shared in onlined_pages() Otherwises, multiple zones of different nodes would share same boot strapping boot_pageset for same CPU, which will finally cause below kernel panic: ------------[ cut here ]------------ kernel BUG at mm/page_alloc.c:1239! invalid opcode: 0000 [#1] SMP ... Call Trace: [<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0 [<ffffffff81162e67>] alloc_pages_current+0x87/0xd0 [<ffffffff81128407>] __page_cache_alloc+0x67/0x70 [<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260 [<ffffffff81132751>] ra_submit+0x21/0x30 [<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0 [<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0 [<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670 [<ffffffff81266cfa>] nfs_file_read+0xca/0x130 [<ffffffff8117b20a>] do_sync_read+0xfa/0x140 [<ffffffff8117bf75>] vfs_read+0xb5/0x1a0 [<ffffffff8117c151>] sys_read+0x51/0x80 [<ffffffff8103c032>] system_call_fastpath+0x16/0x1b RIP [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900 RSP <ffff88000d1e78a8> ---[ end trace 4bda28328b9990db ] [akpm@linux-foundation.org: merge fix] Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Andi Kleen <andi.kleen@intel.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:01 -07:00
minskey guo	cf23422b9d	cpu/mem hotplug: enable CPUs online before local memory online Enable users to online CPUs even if the CPUs belongs to a numa node which doesn't have onlined local memory. The zonlists(pg_data_t.node_zonelists[]) of a numa node are created either in system boot/init period, or at the time of local memory online. For a numa node without onlined local memory, its zonelists are not initialized at present. As a result, any memory allocation operations executed by CPUs within this node will fail. In fact, an out-of-memory error is triggered when attempt to online CPUs before memory comes to online. This patch tries to create zonelists for such numa nodes, so that the memory allocation for this node can be fallback'ed to other nodes. [akpm@linux-foundation.org: remove unneeded export] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: minskey guo<chaohong.guo@intel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:00 -07:00
Mel Gorman	5e77190580	mm: compaction: add a tunable that decides when memory should be compacted and when it should be reclaimed The kernel applies some heuristics when deciding if memory should be compacted or reclaimed to satisfy a high-order allocation. One of these is based on the fragmentation. If the index is below 500, memory will not be compacted. This choice is arbitrary and not based on data. To help optimise the system and set a sensible default for this value, this patch adds a sysctl extfrag_threshold. The kernel will only compact memory if the fragmentation index is above the extfrag_threshold. [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:59 -07:00
Mel Gorman	76ab0f530e	mm: compaction: add /proc trigger for memory compaction Add a proc file /proc/sys/vm/compact_memory. When an arbitrary value is written to the file, all zones are compacted. The expected user of such a trigger is a job scheduler that prepares the system before the target application runs. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:59 -07:00
Miao Xie	c0ff7453bb	cpuset,mm: fix no node to alloc memory when changing cpuset's mems Before applying this patch, cpuset updates task->mems_allowed and mempolicy by setting all new bits in the nodemask first, and clearing all old unallowed bits later. But in the way, the allocator may find that there is no node to alloc memory. The reason is that cpuset rebinds the task's mempolicy, it cleans the nodes which the allocater can alloc pages on, for example: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom This patch fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. [akpm@linux-foundation.org: fix spello] Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:57 -07:00
Miao Xie	708c1bbc9d	mempolicy: restructure rebinding-mempolicy functions Nick Piggin reported that the allocator may see an empty nodemask when changing cpuset's mems[1]. It happens only on the kernel that do not do atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG) But I found that there is also a problem on the kernel that can do atomic nodemask_t stores. The problem is that the allocator can't find a node to alloc page when changing cpuset's mems though there is a lot of free memory. The reason is like this: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom I can use the attached program reproduce it by the following step: # mkdir /dev/cpuset # mount -t cpuset cpuset /dev/cpuset # mkdir /dev/cpuset/1 # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems # echo $$ > /dev/cpuset/1/tasks # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> & <nr_tasks> = max(nr_cpus - 1, 1) # killall -s SIGUSR1 cpuset_mem_hog # ./change_mems.sh several hours later, oom will happen though there is a lot of free memory. This patchset fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. This patch: In order to fix no node to alloc memory, when we want to update mempolicy and mems_allowed, we expand the set of nodes first (set all the newly nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the mempolicy's rebind functions may breaks the expanding. So we restructure the mempolicy's rebind functions and split the rebind work to two steps, just like the update of cpuset's mems: The 1st step: expand the set of the mempolicy's nodes. The 2nd step: shrink the set of the mempolicy's nodes. It is used when there is no real lock to protect the mempolicy in the read-side. Otherwise we can do rebind work at once. In order to implement it, we define enum mpol_rebind_step { MPOL_REBIND_ONCE, MPOL_REBIND_STEP1, MPOL_REBIND_STEP2, MPOL_REBIND_NSTEP, }; If the mempolicy needn't be updated by two steps, we can pass MPOL_REBIND_ONCE to the rebind functions. Or we can pass MPOL_REBIND_STEP1 to do the first step of the rebind work and pass MPOL_REBIND_STEP2 to do the second step work. Besides that, it maybe long time between these two step and we have to release the lock that protects mempolicy and mems_allowed. If we hold the lock once again, we must check whether the current mempolicy is under the rebinding (the first step has been done) or not, because the task may alloc a new mempolicy when we don't hold the lock. So we defined the following flag to identify it: #define MPOL_F_REBINDING (1 << 2) The new functions will be used in the next patch. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:57 -07:00
Peter Zijlstra	87f44bbc24	perf, trace: Fix !x86 build bug Patch `b7e2ecef92` (perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction) made the unfortunate mistake of assuming the world is x86 only, correct this. The problem was that perf_fetch_caller_regs() did local_save_flags() into regs->flags, and I re-used that to remove another local_save_flags(), forgetting !x86 doesn't have regs->flags. Do the reverse, remove the local_save_flags() from perf_fetch_caller_regs() and let the ftrace site do the local_save_flags() instead. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Cc: acme@redhat.com Cc: efault@gmx.de Cc: fweisbec@gmail.com Cc: rostedt@goodmis.org LKML-Reference: <1274778175.5882.623.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-25 11:28:49 +02:00
Jeff Chua	f00e047efd	timers: Fix slack calculation for expired timers commit `3bbb9ec946` (timers: Introduce the concept of timer slack for legacy timers) does not take the case into account when the timer is already expired. This broke wireless drivers. The solution is not to apply slack to already expired timers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com>	2010-05-24 12:10:23 +02:00
Thomas Gleixner	bd45b7a385	timekeeping: Fix timezone update commit `64ce4c2f` (time: Clean up warp_clock()) breaks the timezone update in a very subtle way. To avoid the direct access to timekeeping internals it adds the timezone delta to the current time with timespec_add_safe(). This works nicely when the timezone delta is > 0. If timezone delta is < 0 then the wrap check in timespec_add_safe() triggers and timespec_add_safe() returns TIME_MAX and screws up timekeeping completely. The comment above timespec_add_safe() says: It's assumed that both values are valid (>= 0) Add the timezone seconds adjustment directly. Reported-by: Rafael J. Wysocki <rjw@sisk.pl> Tested-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-24 11:50:38 +02:00
Linus Torvalds	6109e2ce26	Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (36 commits) PCI: hotplug: pciehp: Removed check for hotplug of display devices PCI: read memory ranges out of Broadcom CNB20LE host bridge PCI: Allow manual resource allocation for PCI hotplug bridges x86/PCI: make ACPI MCFG reserved error messages ACPI specific PCI hotplug: Use kmemdup PM/PCI: Update PCI power management documentation PCI: output FW warning in pci_read/write_vpd PCI: fix typos pci_device_dis/enable to pci_dis/enable_device in comments PCI quirks: disable msi on AMD rs4xx internal gfx bridges PCI: Disable MSI for MCP55 on P5N32-E SLI x86/PCI: irq and pci_ids patch for additional Intel Cougar Point DeviceIDs PCI: aerdrv: trivial cleanup for aerdrv_core.c PCI: aerdrv: trivial cleanup for aerdrv.c PCI: aerdrv: introduce default_downstream_reset_link PCI: aerdrv: rework find_aer_service PCI: aerdrv: remove is_downstream PCI: aerdrv: remove magical ROOT_ERR_STATUS_MASKS PCI: aerdrv: redefine PCI_ERR_ROOT_*_SRC PCI: aerdrv: rework do_recovery PCI: aerdrv: rework get_e_source() ...	2010-05-21 18:58:52 -07:00
Linus Torvalds	0961d6581c	Merge git://git.infradead.org/iommu-2.6 * git://git.infradead.org/iommu-2.6: intel-iommu: Set a more specific taint flag for invalid BIOS DMAR tables intel-iommu: Combine the BIOS DMAR table warning messages panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I') panic: Allow warnings to set different taint flags intel-iommu: intel_iommu_map_range failed at very end of address space intel-iommu: errors with smaller iommu widths intel-iommu: Fix boot inside 64bit virtualbox with io-apic disabled intel-iommu: use physfn to search drhd for VF intel-iommu: Print out iommu seq_id intel-iommu: Don't complain that ACPI_DMAR_SCOPE_TYPE_IOAPIC is not supported intel-iommu: Avoid global flushes with caching mode. intel-iommu: Use correct domain ID when caching mode is enabled intel-iommu mistakenly uses offset_pfn when caching mode is enabled intel-iommu: use for_each_set_bit() intel-iommu: Fix section mismatch dmar_ir_support() uses dmar_tbl.	2010-05-21 17:25:01 -07:00
Linus Torvalds	a8251096b4	Merge branch 'modules' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'modules' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: drop the lock while waiting for module to complete initialization. MODULE_DEVICE_TABLE(isapnp, ...) does nothing hisax_fcpcipnp: fix broken isapnp device table. isapnp: move definitions to mod_devicetable.h so file2alias can reach them.	2010-05-21 17:15:44 -07:00
Linus Torvalds	6e80e8ed5e	Merge branch 'for-2.6.35' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.35' of git://git.kernel.dk/linux-2.6-block: (86 commits) pipe: set lower and upper limit on max pages in the pipe page array pipe: add support for shrinking and growing pipes drbd: This is now equivalent to drbd release 8.3.8rc1 drbd: Do not free p_uuid early, this is done in the exit code of the receiver drbd: Null pointer deref fix to the large "multi bio rewrite" drbd: Fix: Do not detach, if a bio with a barrier fails drbd: Ensure to not trigger late-new-UUID creation multiple times drbd: Do not Oops when C_STANDALONE when uuid gets generated writeback: fix mixed up arguments to bdi_start_writeback() writeback: fix problem with !CONFIG_BLOCK compilation block: improve automatic native capacity unlocking block: use struct parsed_partitions *state universally in partition check code block,ide: simplify bdops->set_capacity() to ->unlock_native_capacity() block: restart partition scan after resizing a device buffer: make invalidate_bdev() drain all percpu LRU add caches block: remove all rcu head initializations writeback: fixups for !dirty_writeback_centisecs writeback: bdi_writeback_task() must set task state before calling schedule() writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync drivers/block/drbd: Use kzalloc ...	2010-05-21 15:25:33 -07:00
Randy Dunlap	0fc377bd64	sysctl: fix kernel-doc notation and typos Fix kernel-doc warnings, kernel-doc special characters, and typos in recent kernel/sysctl.c additions. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Amerigo Wang <amwang@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-21 15:23:12 -07:00
Linus Torvalds	2a8ba8f032	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (46 commits) random: simplify fips mode crypto: authenc - Fix cryptlen calculation crypto: talitos - add support for sha224 crypto: talitos - add hash algorithms crypto: talitos - second prepare step for adding ahash algorithms crypto: talitos - prepare for adding ahash algorithms crypto: n2 - Add Niagara2 crypto driver crypto: skcipher - Add ablkcipher_walk interfaces crypto: testmgr - Add testing for async hashing and update/final crypto: tcrypt - Add speed tests for async hashing crypto: scatterwalk - Fix scatterwalk_done() test crypto: hifn_795x - Rename ablkcipher_walk to hifn_cipher_walk padata: Use get_online_cpus/put_online_cpus in padata_free padata: Add some code comments padata: Flush the padata queues actively padata: Use a timer to handle remaining objects in the reorder queues crypto: shash - Remove usage of CRYPTO_MINALIGN crypto: mv_cesa - Use resource_size crypto: omap - OMAP macros corrected padata: Use get_online_cpus/put_online_cpus ... Fix up conflicts in arch/arm/mach-omap2/devices.c	2010-05-21 14:46:51 -07:00
Jens Axboe	ee9a3607fb	Merge branch 'master' into for-2.6.35 Conflicts: fs/ext3/fsync.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-21 21:27:26 +02:00
Jens Axboe	b492e95be0	pipe: set lower and upper limit on max pages in the pipe page array We need at least two to guarantee proper POSIX behaviour, so never allow a smaller limit than that. Also expose a /proc/sys/fs/pipe-max-pages sysctl file that allows root to define a sane upper limit. Make it default to 16 times the default size, which is 16 pages. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-21 21:12:52 +02:00
Jens Axboe	35f3d14dbb	pipe: add support for shrinking and growing pipes This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for growing and shrinking the size of a pipe and adjusts pipe.c and splice.c (and relay and network splice) usage to work with these larger (or smaller) pipes. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-21 21:12:40 +02:00
Linus Torvalds	ac3ee84c60	Merge branch 'dbg-early-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'dbg-early-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: echi-dbgp: Add kernel debugger support for the usb debug port earlyprintk,vga,kdb: Fix \b and \r for earlyprintk=vga with kdb kgdboc: Add ekgdboc for early use of the kernel debugger x86,early dr regs,kgdb: Allow kernel debugger early dr register access x86,kgdb: Implement early hardware breakpoint debugging x86, kgdb, init: Add early and late debug states x86, kgdb: early trap init for early debug	2010-05-21 11:10:41 -07:00
Linus Torvalds	90b9a32d8f	Merge branch 'kdb-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'kdb-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: (25 commits) kdb,debug_core: Allow the debug core to receive a panic notification MAINTAINERS: update kgdb, kdb, and debug_core info debug_core,kdb: Allow the debug core to process a recursive debug entry printk,kdb: capture printk() when in kdb shell kgdboc,kdb: Allow kdb to work on a non open console port kgdb: Add the ability to schedule a breakpoint via a tasklet mips,kgdb: kdb low level trap catch and stack trace powerpc,kgdb: Introduce low level trap catching x86,kgdb: Add low level debug hook kgdb: remove post_primary_code references kgdb,docs: Update the kgdb docs to include kdb kgdboc,keyboard: Keyboard driver for kdb with kgdb kgdb: gdb "monitor" -> kdb passthrough sparc,sunzilog: Add console polling support for sunzilog serial driver sh,sh-sci: Use NO_POLL_CHAR in the SCIF polled console code kgdb,8250,pl011: Return immediately from console poll kgdb: core changes to support kdb kdb: core for kgdb back end (2 of 2) kdb: core for kgdb back end (1 of 2) kgdb,blackfin: Add in kgdb_arch_set_pc for blackfin ...	2010-05-21 11:08:05 -07:00
Chris Wright	2c3c8bea60	sysfs: add struct file* to bin_attr callbacks This allows bin_attr->read,write,mmap callbacks to check file specific data (such as inode owner) as part of any privilege validation. Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-21 09:37:31 -07:00
Peter Zijlstra	1704f47b50	lockdep: Add novalidate class for dev->mutex conversion The conversion of device->sem to device->mutex resulted in lockdep warnings. Create a novalidate class for now until the driver folks come up with separate classes. That way we have at least the basic mutex debugging coverage. Add a checkpatch error so the usage is reserved for device->mutex. [ tglx: checkpatch and compile fix for LOCKDEP=n ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-21 09:37:30 -07:00
NeilBrown	db1afffab0	kref: remove kref_set Of the three uses of kref_set in the kernel: One really should be kref_put as the code is letting go of a reference, Two really should be kref_init because the kref is being initialised. This suggests that making kref_set available encourages bad code. So fix the three uses and remove kref_set completely. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Mimi Zohar <zohar@us.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-21 09:37:29 -07:00
Steven Rostedt	ff5f149b6a	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into trace/tip/tracing/core-7 Conflicts: include/linux/ftrace_event.h include/trace/ftrace.h kernel/trace/trace_event_perf.c kernel/trace/trace_kprobe.c kernel/trace/trace_syscalls.c Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-21 11:49:57 -04:00
Linus Torvalds	33cf23b0a5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (182 commits) [SCSI] aacraid: add an ifdef'd device delete case instead of taking the device offline [SCSI] aacraid: prohibit access to array container space [SCSI] aacraid: add support for handling ATA pass-through commands. [SCSI] aacraid: expose physical devices for models with newer firmware [SCSI] aacraid: respond automatically to volumes added by config tool [SCSI] fcoe: fix fcoe module ref counting [SCSI] libfcoe: FIP Keep-Alive messages for VPorts are sent with incorrect port_id and wwn [SCSI] libfcoe: Fix incorrect MAC address clearing [SCSI] fcoe: fix a circular locking issue with rtnl and sysfs mutex [SCSI] libfc: Move the port_id into lport [SCSI] fcoe: move link speed checking into its own routine [SCSI] libfc: Remove extra pointer check [SCSI] libfc: Remove unused fc_get_host_port_type [SCSI] fcoe: fixes wrong error exit in fcoe_create [SCSI] libfc: set seq_id for incoming sequence [SCSI] qla2xxx: Updates to ISP82xx support. [SCSI] qla2xxx: Optionally disable target reset. [SCSI] qla2xxx: ensure flash operation and host reset via sg_reset are mutually exclusive [SCSI] qla2xxx: Silence bogus warning by gcc for wrap and did. [SCSI] qla2xxx: T10 DIF support added. ...	2010-05-21 07:19:18 -07:00
Peter Zijlstra	580d607cd6	perf: Optimize perf_tp_event_match() Since we know tracepoints come from kernel context, avoid conditionals that try and establish that very fact. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.904944001@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:38:00 +02:00
Peter Zijlstra	a94ffaaf55	perf: Remove more code from the fastpath Sanity checks cost instructions. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.852926930@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:59 +02:00
Peter Zijlstra	3cafa9fbb5	perf: Optimize the !vmalloc backed buffer Reduce code and data by using the knowledge that for !PERF_USE_VMALLOC data_order is always 0. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.795019386@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:59 +02:00
Peter Zijlstra	5d967a8be6	perf: Optimize perf_output_copy() Reduce the clutter in perf_output_copy() by keeping an interator in perf_output_handle. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.742809176@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:59 +02:00
Peter Zijlstra	adb8e118f2	perf: Fix wakeup storm for RO mmap()s RO mmap()s don't update the tail pointer, so comparing against it for determining the written data size doesn't really do any good. Keep track of when we last did a wakeup, and compare against that. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.684479310@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:58 +02:00
Peter Zijlstra	0f139300c9	perf: Ensure that IOC_OUTPUT isn't used to create multi-writer buffers Since we want to ensure buffers only have a single writer, we must avoid creating one with multiple. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.528215873@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:57 +02:00
Peter Zijlstra	1c024eca51	perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events Avoid the swevent hash-table by using per-tracepoint hlists. Also, avoid conditionals on the fast path by ordering with probe unregister so that we should never get on the callback path without the data being there. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100521090710.473188012@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:56 +02:00
Peter Zijlstra	b7e2ecef92	perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction Improves performance. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1274259525.5605.10352.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-21 11:37:56 +02:00
Linus Torvalds	7a9b149212	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6: (229 commits) USB: remove unused usb_buffer_alloc and usb_buffer_free macros usb: musb: update gfp/slab.h includes USB: ftdi_sio: fix legacy SIO-device header USB: kl5usb105: reimplement using generic framework USB: kl5usb105: minor clean ups USB: kl5usb105: fix memory leak USB: io_ti: use kfifo to implement write buffering USB: io_ti: remove unsused private counter USB: ti_usb: use kfifo to implement write buffering USB: ir-usb: fix incorrect write-buffer length USB: aircable: fix incorrect write-buffer length USB: safe_serial: straighten out read processing USB: safe_serial: reimplement read using generic framework USB: safe_serial: reimplement write using generic framework usb-storage: always print quirks USB: usb-storage: trivial debug improvements USB: oti6858: use port write fifo USB: oti6858: use kfifo to implement write buffering USB: cypress_m8: use kfifo to implement write buffering USB: cypress_m8: remove unused drain define ... Fix up conflicts (due to usb_buffer_alloc/free renaming) in drivers/input/tablet/acecad.c drivers/input/tablet/kbtab.c drivers/input/tablet/wacom_sys.c drivers/media/video/gspca/gspca.c sound/usb/usbaudio.c	2010-05-20 21:26:12 -07:00
Linus Torvalds	f8965467f3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1674 commits) qlcnic: adding co maintainer ixgbe: add support for active DA cables ixgbe: dcb, do not tag tc_prio_control frames ixgbe: fix ixgbe_tx_is_paused logic ixgbe: always enable vlan strip/insert when DCB is enabled ixgbe: remove some redundant code in setting FCoE FIP filter ixgbe: fix wrong offset to fc_frame_header in ixgbe_fcoe_ddp ixgbe: fix header len when unsplit packet overflows to data buffer ipv6: Never schedule DAD timer on dead address ipv6: Use POSTDAD state ipv6: Use state_lock to protect ifa state ipv6: Replace inet6_ifaddr->dead with state cxgb4: notify upper drivers if the device is already up when they load cxgb4: keep interrupts available when the ports are brought down cxgb4: fix initial addition of MAC address cnic: Return SPQ credit to bnx2x after ring setup and shutdown. cnic: Convert cnic_local_flags to atomic ops. can: Fix SJA1000 command register writes on SMP systems bridge: fix build for CONFIG_SYSFS disabled ARCNET: Limit com20020 PCI ID matches for SOHARD cards ... Fix up various conflicts with pcmcia tree drivers/net/ {pcmcia/3c589_cs.c, wireless/orinoco/orinoco_cs.c and wireless/orinoco/spectrum_cs.c} and feature removal (Documentation/feature-removal-schedule.txt). Also fix a non-content conflict due to pm_qos_requirement getting renamed in the PM tree (now pm_qos_request) in net/mac80211/scan.c	2010-05-20 21:04:44 -07:00
Jason Wessel	0b4b3827db	x86, kgdb, init: Add early and late debug states The kernel debugger can operate well before mm_init(), but the x86 hardware breakpoint code which uses the perf api requires that the kernel allocators are initialized. This means the kernel debug core needs to provide an optional arch specific call back to allow the initialization functions to run after the kernel has been further initialized. The kdb shell already had a similar restriction with an early initialization and late initialization. The kdb_init() was moved into the debug core's version of the late init which is called dbg_late_init(); CC: kgdb-bugreport@lists.sourceforge.net Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:29 -05:00
Jason Wessel	4402c153cb	kdb,debug_core: Allow the debug core to receive a panic notification It is highly desirable to trap into kdb on panic. The debug core will attempt to register as the first in line for the panic notifier. CC: Ingo Molnar <mingo@elte.hu> CC: Andrew Morton <akpm@linux-foundation.org> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:28 -05:00
Jason Wessel	6d90634076	debug_core,kdb: Allow the debug core to process a recursive debug entry This allows kdb to debug a crash with in the kms code with a single level recursive re-entry. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:27 -05:00
Jason Wessel	d37d39ae3b	printk,kdb: capture printk() when in kdb shell Certain calls from the kdb shell will call out to printk(), and any of these calls should get vectored back to the kdb_printf() so that the kdb pager and processing can be used, as well as to properly channel I/O to the polled I/O devices. CC: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Andrew Morton <akpm@linux-foundation.org>	2010-05-20 21:04:27 -05:00
Jason Wessel	efe2f29e32	kgdboc,kdb: Allow kdb to work on a non open console port If kdb is open on a serial port that is not actually a console make sure to call the poll routines to emit and receive characters. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Martin Hicks <mort@sgi.com>	2010-05-20 21:04:26 -05:00
Jason Wessel	1cee5e35f1	kgdb: Add the ability to schedule a breakpoint via a tasklet Some kgdb I/O modules require the ability to create a breakpoint tasklet, such as kgdboc and external modules such as kgdboe. The breakpoint tasklet is used as an asynchronous entry point into the debugger which will have a different function scope than the current execution path where it might not be safe to have an inline breakpoint. This is true of some of the kgdb I/O drivers which share code with kgdb and rest of the kernel users. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:26 -05:00
Jason Wessel	f503b5ae53	x86,kgdb: Add low level debug hook The only way the debugger can handle a trap in inside rcu_lock, notify_die, or atomic_notifier_call_chain without a triple fault is to have a low level "first opportunity handler" in the int3 exception handler. Generally this will be something the vast majority of folks will not need, but for those who need it, it is added as a kernel .config option called KGDB_LOW_LEVEL_TRAP. CC: Ingo Molnar <mingo@elte.hu> CC: Thomas Gleixner <tglx@linutronix.de> CC: H. Peter Anvin <hpa@zytor.com> CC: x86@kernel.org Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:25 -05:00
Jason Wessel	98ec1878ca	kgdb: remove post_primary_code references Remove all the references to the kgdb_post_primary_code. This function serves no useful purpose because you can obtain the same information from the "struct kgdb_state *ks" from with in the debugger, if for some reason you want the data. Also remove the unintentional duplicate assignment for ks->ex_vector. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:25 -05:00
Jason Wessel	ada64e4c98	kgdboc,keyboard: Keyboard driver for kdb with kgdb This patch adds in the kdb PS/2 keyboard driver. This was mostly a direct port from the original kdb where I cleaned up the code against checkpatch.pl and added the glue to stitch it into kgdb. This patch also enables early kdb debug via kgdbwait and the keyboard. All the access to configure kdb using either a serial console or the keyboard is done via kgdboc. If you want to use only the keyboard and want to break in early you would add to your kernel command arguments: kgdboc=kbd kgdbwait If you wanted serial and or the keyboard access you could use: kgdboc=kbd,ttyS0 You can also configure kgdboc as a kernel module or at run time with the sysfs where you can activate and deactivate kgdb. Turn it on: echo kbd,ttyS0 > /sys/module/kgdboc/parameters/kgdboc Turn it off: echo "" > /sys/module/kgdboc/parameters/kgdboc Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>	2010-05-20 21:04:24 -05:00
Jason Wessel	a0de055cf6	kgdb: gdb "monitor" -> kdb passthrough One of the driving forces behind integrating another front end (kdb) to the debug core is to allow front end commands to be accessible via gdb's monitor command. It is true that you could write gdb macros to get certain data, but you may want to just use gdb to access the commands that are available in the kdb front end. This patch implements the Rcmd gdb stub packet. In gdb you access this with the "monitor" command. For instance you could type "monitor help", "monitor lsmod" or "monitor ps A" etc... There is no error checking or command restrictions on what you can and cannot access at this point. Doing something like trying to set breakpoints with the monitor command is going to cause nothing but problems. Perhaps in the future only the commands that are actually known to work with the gdb monitor command will be available. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:24 -05:00
Jason Wessel	f5316b4aea	kgdb,8250,pl011: Return immediately from console poll The design of the kdb shell requires that every device that can provide input to kdb have a polling routine that exits immediately if there is no character available. This is required in order to get the page scrolling mechanism working. Changing the kernel debugger I/O API to require all polling character routines to exit immediately if there is no data allows the kernel debugger to process multiple input channels. NO_POLL_CHAR will be the return code to the polling routine when ever there is no character available. CC: linux-serial@vger.kernel.org Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:22 -05:00
Jason Wessel	dcc7871128	kgdb: core changes to support kdb These are the minimum changes to the kgdb core in order to enable an API to connect a new front end (kdb) to the debug core. This patch introduces the dbg_kdb_mode variable controls where the user level I/O is routed. It will be routed to the gdbstub (kgdb) or to the kdb front end which is a simple shell available over the kgdboc connection. You can switch back and forth between kdb or the gdb stub mode of operation dynamically. From gdb stub mode you can blindly type "$3#33", or from the kdb mode you can enter "kgdb" to switch to the gdb stub. The logic in the debug core depends on kdb to look for the typical gdb connection sequences and return immediately with KGDB_PASS_EVENT if a gdb serial command sequence is detected. That should allow a reasonably seamless transition between kdb -> gdb without leaving the kernel exception state. The two gdb serial queries that kdb is responsible for detecting are the "?" and "qSupported" packets. CC: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Martin Hicks <mort@sgi.com>	2010-05-20 21:04:21 -05:00
Jason Wessel	67fc4e0cb9	kdb: core for kgdb back end (2 of 2) This patch contains the hooks and instrumentation into kernel which live outside the kernel/debug directory, which the kdb core will call to run commands like lsmod, dmesg, bt etc... CC: linux-arch@vger.kernel.org Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Martin Hicks <mort@sgi.com>	2010-05-20 21:04:21 -05:00
Jason Wessel	5d5314d679	kdb: core for kgdb back end (1 of 2) This patch contains only the kdb core. Because the change set was large, it was split. The next patch in the series includes the instrumentation into the core kernel which are mainly helper functions for kdb. This work is directly derived from kdb v4.4 found at: ftp://oss.sgi.com/projects/kdb/download/v4.4/ The kdb internals have been re-organized to make them mostly platform independent and to connect everything to the debug core which is used by gdbstub (which has long been known as kgdb). The original version of kdb was 58,000 lines worth of changes to support x86. From that implementation only the kdb shell, and basic commands for memory access, runcontrol, lsmod, and dmesg where carried forward. This is a generic implementation which aims to cover all the current architectures using the kgdb core: ppc, arm, x86, mips, sparc, sh and blackfin. More archictectures can be added by implementing the architecture specific kgdb functions. [mort@sgi.com: Compile fix with hugepages enabled] [mort@sgi.com: Clean breakpoint code renaming kdba_ -> kdb_] [mort@sgi.com: fix new line after printing registers] [mort@sgi.com: Remove the concept of global vs. local breakpoints] [mort@sgi.com: Rework kdb_si_swapinfo to use more generic name] [mort@sgi.com: fix the information dump macros, remove 'arch' from the names] [sfr@canb.auug.org.au: include fixup to include linux/slab.h] CC: linux-arch@vger.kernel.org Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Martin Hicks <mort@sgi.com>	2010-05-20 21:04:20 -05:00
Jason Wessel	53197fc495	Separate the gdbstub from the debug core Split the former kernel/kgdb.c into debug_core.c which contains the kernel debugger exception logic and to the gdbstub.c which contains the logic for allowing gdb to talk to the debug core. This also created a private include file called debug_core.h which contains all the definitions to glue the debug_core to any other debugger connections. CC: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:19 -05:00
Jason Wessel	c433820971	Move kernel/kgdb.c to kernel/debug/debug_core.c Move kgdb.c in preparation to separate the gdbstub from the debug core and exception handling. CC: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-05-20 21:04:18 -05:00
Michal Nazarewicz	22c43c81a5	wait_event_interruptible_locked() interface New wait_event_interruptible{,_exclusive}_locked{,_irq} macros added. They work just like versions without _locked* suffix but require the wait queue's lock to be held. Also __wake_up_locked() is now exported as to pair it with the above macros. The use case of this new facility is when one uses wait queue's lock to protect a data structure. This may be advantageous if the structure needs to be protected by a spinlock anyway. In particular, with additional spinlock the following code has to be used to wait for a condition: spin_lock(&data.lock); ... for (ret = 0; !ret && !(condition); ) { spin_unlock(&data.lock); ret = wait_event_interruptible(data.wqh, (condition)); spin_lock(&data.lock); } ... spin_unlock(&data.lock); This looks bizarre plus wait_event_interruptible() locks the wait queue's lock anyway so there is a unlock+lock sequence where it could be avoided. To avoid those problems and benefit from wait queue's lock, a code similar to the following should be used: /* Waiting / spin_lock(&data.wqh.lock); ... ret = wait_event_interruptible_locked(data.wqh, (condition)); ... spin_unlock(&data.wqh.lock); / Waiting exclusively / spin_lock(&data.whq.lock); ... ret = wait_event_interruptible_exclusive_locked(data.whq, (condition)); ... spin_unlock(&data.whq.lock); / Waking up / spin_lock(&data.wqh.lock); ... wake_up_locked(&data.wqh); ... spin_unlock(&data.wqh.lock); When spin_lock_irq() is used matching versions of macros need to be used (_locked_irq()). Signed-off-by: Michal Nazarewicz <m.nazarewicz@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Takashi Iwai <tiwai@suse.de> Cc: David Howells <dhowells@redhat.com> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-20 13:21:42 -07:00
Frederic Weisbecker	acd35a463c	perf: Fix forgotten preempt_enable by nested writers A writer that gets a reference to the buffer handle disables preemption. When we put that reference, we check if we are the outer most writer and if not, we simply return and defer the head update to the outer most writer. The problem here is that preemption is only reenabled by the outer most, that produces preemption count imbalance for every nested writer that exit. So just don't forget to always re-enable preemption when we put the buffer reference, whoever we are. Fixes lots of sleeping in atomic warnings, visible with lock events recording. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Robert Richter <robert.richter@amd.com>	2010-05-20 21:28:34 +02:00
Linus Torvalds	a0fe3cc5d3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (40 commits) Input: psmouse - small formatting changes to better follow coding style Input: synaptics - set dimensions as reported by firmware Input: elantech - relax signature checks Input: elantech - enforce common prefix on messages Input: wistron_btns - switch to using kmemdup() Input: usbtouchscreen - switch to using kmemdup() Input: do not force selecting i8042 on Moorestown Input: Documentation/sysrq.txt - update KEY_SYSRQ info Input: 88pm860x_onkey - remove invalid irq number assignment Input: i8042 - add a PNP entry to the aux device list Input: i8042 - add some extra PNP keyboard types Input: wm9712 - fix wm97xx_set_gpio() logic Input: add keypad driver for keys interfaced to TCA6416 Input: remove obsolete {corgi,spitz,tosa}kbd.c Input: kbtab - do not advertise unsupported events Input: kbtab - simplify kbtab_disconnect() Input: kbtab - fix incorrect size parameter in usb_buffer_free Input: acecad - don't advertise mouse events Input: acecad - fix some formatting issues Input: acecad - simplify usb_acecad_disconnect() ... Trivial conflict in Documentation/feature-removal-schedule.txt	2010-05-20 10:33:06 -07:00
Linus Torvalds	f39d01be4c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits) vlynq: make whole Kconfig-menu dependant on architecture add descriptive comment for TIF_MEMDIE task flag declaration. EEPROM: max6875: Header file cleanup EEPROM: 93cx6: Header file cleanup EEPROM: Header file cleanup agp: use NULL instead of 0 when pointer is needed rtc-v3020: make bitfield unsigned PCI: make bitfield unsigned jbd2: use NULL instead of 0 when pointer is needed cciss: fix shadows sparse warning doc: inode uses a mutex instead of a semaphore. uml: i386: Avoid redefinition of NR_syscalls fix "seperate" typos in comments cocbalt_lcdfb: correct sections doc: Change urls for sparse Powerpc: wii: Fix typo in comment i2o: cleanup some exit paths Documentation/: it's -> its where appropriate UML: Fix compiler warning due to missing task_struct declaration UML: add kernel.h include to signal.c ...	2010-05-20 09:20:59 -07:00
Linus Torvalds	46ee964509	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: PM QOS update fix Freezer / cgroup freezer: Update stale locking comments PM / platform_bus: Allow runtime PM by default i2c: Fix bus-level power management callbacks PM QOS update PM / Hibernate: Fix block_io.c printk warning PM / Hibernate: Group swap ops PM / Hibernate: Move the first_sector out of swsusp_write PM / Hibernate: Separate block_io PM / Hibernate: Snapshot cleanup FS / libfs: Implement simple_write_to_buffer PM / Hibernate: document open(/dev/snapshot) side effects PM / Runtime: Add sysfs debug files PM: Improve device power management document PM: Update device power management document PM: Allow runtime_suspend methods to call pm_schedule_suspend() PM: pm_wakeup - switch to using bool	2010-05-20 09:03:55 -07:00
Linus Torvalds	fa5312d9e8	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: change cancel_work_sync() to clear work->data workqueue: warn about flush_scheduled_work()	2010-05-20 09:03:02 -07:00
Linus Torvalds	96b5b7f4f2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (61 commits) KEYS: Return more accurate error codes LSM: Add __init to fixup function. TOMOYO: Add pathname grouping support. ima: remove ACPI dependency TPM: ACPI/PNP dependency removal security/selinux/ss: Use kstrdup TOMOYO: Use stack memory for pending entry. Revert "ima: remove ACPI dependency" Revert "TPM: ACPI/PNP dependency removal" KEYS: Do preallocation for __key_link() TOMOYO: Use mutex_lock_interruptible. KEYS: Better handling of errors from construct_alloc_key() KEYS: keyring_serialise_link_sem is only needed for keyring->keyring links TOMOYO: Use GFP_NOFS rather than GFP_KERNEL. ima: remove ACPI dependency TPM: ACPI/PNP dependency removal selinux: generalize disabling of execmem for plt-in-heap archs LSM Audit: rename LSM_AUDIT_NO_AUDIT to LSM_AUDIT_DATA_NONE CRED: Holding a spinlock does not imply the holding of RCU read lock SMACK: Don't #include Ext2 headers ...	2010-05-20 08:55:50 -07:00
Ingo Molnar	dfacc4d6c9	Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core	2010-05-20 14:38:55 +02:00
Frederic Weisbecker	49f135ed02	perf: Comply with new rcu checks API The software events hlist doesn't fully comply with the new rcu checks api. We need to consider three different sides that access the hlist: - the hlist allocation/release side. This side happens when an events is created or released, accesses to the hlist are serialized under the cpuctx mutex. - the events insertion/removal in the hlist. This side is always serialized against the above one. The hlist is always present during such operations. This side happens when a software event is scheduled in/out. The serialization that ensures the software event is really attached to the context is made under the ctx->lock. - events triggering. This is the read side, it can happen concurrently with any update side. This patch deals with them one by one and anticipates with the separate rcu mem space patches in preparation. This patch fixes various annoying rcu warnings. Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org>	2010-05-20 10:40:37 +02:00
Linus Torvalds	164d44fd92	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Add clocksource_register_hz/khz interface posix-cpu-timers: Optimize run_posix_cpu_timers() time: Remove xtime_cache mqueue: Convert message queue timeout to use hrtimers hrtimers: Provide schedule_hrtimeout for CLOCK_REALTIME timers: Introduce the concept of timer slack for legacy timers ntp: Remove tickadj ntp: Make time_adjust static time: Add xtime, wall_to_monotonic to feature-removal-schedule timer: Try to survive timer callback preempt_count leak timer: Split out timer function call timer: Print function name for timer callbacks modifying preemption count time: Clean up warp_clock() cpu-timers: Avoid iterating over all threads in fastpath_timer_check() cpu-timers: Change SIGEV_NONE timer implementation cpu-timers: Return correct previous timer reload value cpu-timers: Cleanup arm_timer() cpu-timers: Simplify RLIMIT_CPU handling	2010-05-19 17:11:10 -07:00
Linus Torvalds	6e0b7b2c39	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Clear CPU mask in affinity_hint when none is provided genirq: Add CPU mask affinity hint genirq: Remove IRQF_DISABLED from core code genirq: Run irq handlers with interrupts disabled genirq: Introduce request_any_context_irq() genirq: Expose irq_desc->node in proc/irq Fixed up trivial conflicts in Documentation/feature-removal-schedule.txt	2010-05-19 17:09:40 -07:00
KOSAKI Motohiro	fa9dc265ac	cpumask: fix compat getaffinity Commit `a45185d2d` "cpumask: convert kernel/compat.c" broke libnuma, which abuses sched_getaffinity to find out NR_CPUS in order to parse /sys/devices/system/node/node*/cpumap. On NUMA systems with less than 32 possibly CPUs, the current compat_sys_sched_getaffinity now returns '4' instead of the actual NR_CPUS/8, which makes libnuma bail out when parsing the cpumap. The libnuma call sched_getaffinity(0, bitmap, 4096) at first. It mean the libnuma expect the return value of sched_getaffinity() is either len argument or NR_CPUS. But it doesn't expect to return nr_cpu_ids. Strictly speaking, userland requirement are 1) Glibc assume the return value mean the lengh of initialized of mask argument. E.g. if sched_getaffinity(1024) return 128, glibc make zero fill rest 896 byte. 2) Libnuma assume the return value can be used to guess NR_CPUS in kernel. It assume len-arg<NR_CPUS makes -EINVAL. But it try len=4096 at first and 4096 is always bigger than NR_CPUS. Then, if we remove strange min_length normalization, we never hit -EINVAL case. sched_getaffinity() already solved this issue. This patch adapts compat_sys_sched_getaffinity() to match the non-compat case. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Ken Werner <ken.werner@web.de> Cc: stable@kernel.org Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-19 11:48:18 -07:00
Linus Torvalds	ba0234ec35	Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6 * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (24 commits) [S390] drivers/s390/char: Use kmemdup [S390] drivers/s390/char: Use kstrdup [S390] debug: enable exception-trace debug facility [S390] s390_hypfs: Add new attributes [S390] qdio: remove API wrappers [S390] qdio: set correct bit in dsci [S390] qdio: dont convert timestamps to microseconds [S390] qdio: remove memset hack [S390] qdio: prevent starvation on PCI devices [S390] qdio: count number of qdio interrupts [S390] user space fault: report fault before calling do_exit [S390] topology: expose core identifier [S390] dasd: remove uid from devmap [S390] dasd: add dynamic pav toleration [S390] vdso: add missing vdso_install target [S390] vdso: remove redundant check for CONFIG_64BIT [S390] avoid default_llseek in s390 drivers [S390] vmcp: disallow modular build [S390] add breaking event address for user space [S390] virtualization aware cpu measurement ...	2010-05-19 11:35:30 -07:00
Dmitry Torokhov	8d0bc2b456	Merge commit 'v2.6.34' into next	2010-05-19 10:12:41 -07:00
Don Zickus	26e09c6eee	lockup_detector: Convert per_cpu to __get_cpu_var for readability Just a bunch of conversions as suggested by Frederic W. __get_cpu_var() provides preemption disabled checks. Plus it gives more readability as it makes it obvious we are dealing locally now with these vars. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <1274133966-18415-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-19 11:32:14 +02:00
Rusty Russell	480b02df3a	module: drop the lock while waiting for module to complete initialization. This fixes "gave up waiting for init of module libcrc32c." which happened at boot time due to multiple parallel module loads. The problem was a deadlock: we wait for a module to finish initializing, but we keep the module_lock mutex so it can't complete. In particular, this could reasonably happen if a module does a request_module() in its initialization routine. So we change use_module() to return an errno rather than a bool, and if it's -EBUSY we drop the lock and wait in the caller, then reaquire the lock. Reported-by: Brandon Philips <brandon@ifup.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Tested-by: Brandon Philips <brandon@ifup.org>	2010-05-19 17:33:39 +09:30
Ben Hutchings	92946bc72f	panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I') This taint flag will initially be used when warning about invalid ACPI DMAR tables. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2010-05-19 08:37:43 +01:00
Ben Hutchings	b2be05273a	panic: Allow warnings to set different taint flags WARN() is used in some places to report firmware or hardware bugs that are then worked-around. These bugs do not affect the stability of the kernel and should not set the flag for TAINT_WARN. To allow for this, add WARN_TAINT() and WARN_TAINT_ONCE() macros that take a taint number as argument. Architectures that implement warnings using trap instructions instead of calls to warn_slowpath_*() now implement __WARN_TAINT(taint) instead of __WARN(). Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Acked-by: Helge Deller <deller@gmx.de> Tested-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2010-05-19 08:36:48 +01:00
Tony Breeds	fd6be105b8	mutex: Fix optimistic spinning vs. BKL Currently, we can hit a nasty case with optimistic spinning on mutexes: CPU A tries to take a mutex, while holding the BKL CPU B tried to take the BLK while holding the mutex This looks like a AB-BA scenario but in practice, is allowed and happens due to the auto-release on schedule() nature of the BKL. In that case, the optimistic spinning code can get us into a situation where instead of going to sleep, A will spin waiting for B who is spinning waiting for A, and the only way out of that loop is the need_resched() test in mutex_spin_on_owner(). This patch fixes it by completely disabling spinning if we own the BKL. This adds one more detail to the extensive list of reasons why it's a bad idea for kernel code to be holding the BKL. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: <stable@kernel.org> LKML-Reference: <20100519054636.GC12389@ozlabs.org> [ added an unlikely() attribute to the branch ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-19 08:18:44 +02:00
David S. Miller	2ec8c6bb5d	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: include/linux/mod_devicetable.h scripts/mod/file2alias.c	2010-05-18 23:01:55 -07:00
Steffen Klassert	3789ae7dcd	padata: Use get_online_cpus/put_online_cpus in padata_free Add get_online_cpus/put_online_cpus to ensure that no cpu goes offline during the flushing of the padata percpu queues. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-19 13:45:35 +10:00
Steffen Klassert	0198ffd135	padata: Add some code comments Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-19 13:44:27 +10:00
Steffen Klassert	2b73b07ab8	padata: Flush the padata queues actively yield was used to wait until all references of the internal control structure in use are dropped before it is freed. This patch implements padata_flush_queues which actively flushes the padata percpu queues in this case. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-19 13:43:46 +10:00
Steffen Klassert	d46a5ac7a7	padata: Use a timer to handle remaining objects in the reorder queues padata_get_next needs to check whether the next object that need serialization must be parallel processed by the local cpu. This check was wrong implemented and returned always true, so the try_again loop in padata_reorder was never taken. This can lead to object leaks in some rare cases due to a race that appears with the trylock in padata_reorder. The try_again loop was not a good idea after all, because a cpu could take that loop frequently, so we handle this with a timer instead. This patch adds a timer to handle the race that appears with the trylock. If cpu1 queues an object to the reorder queue while cpu2 holds the pd->lock but left the while loop in padata_reorder already, cpu2 can't care for this object and cpu1 exits because it can't get the lock. Usually the next cpu that takes the lock cares for this object too. We need the timer just if this object was the last one that arrives to the reorder queues. The timer function sends it out in this case. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-19 13:43:14 +10:00
Peter Zijlstra	6d1acfd5c6	perf: Optimize perf_output_*() by avoiding local_xchg() Since the x86 XCHG ins implies LOCK, avoid the use by using a sequence count instead. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:49 +02:00
Peter Zijlstra	fa5881514e	perf: Optimize the hotpath by converting the perf output buffer to local_t Since there is now only a single writer, we can use local_t instead and avoid all these pesky LOCK insn. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:49 +02:00
Peter Zijlstra	ef60777c9a	perf: Optimize the perf_output() path by removing IRQ-disables Since we can now assume there is only a single writer to each buffer, we can remove per-cpu lock thingy and use a simply nest-count to the same effect. This removes the need to disable IRQs. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:48 +02:00
Peter Zijlstra	c7920614ce	perf: Disallow mmap() on per-task inherited events Since we now have working per-task-per-cpu events for a while, disallow mmap() on per-task inherited events. Those things were a performance problem anyway, and doing away with it allows us to optimize the buffer somewhat by assuming there is only a single writer. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:48 +02:00
Peter Zijlstra	a19d35c11f	perf: Optimize buffer placement by allocating buffers NUMA aware Ensure cpu bound buffers live on the right NUMA node. Suggested-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1274114880.5605.5236.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:47 +02:00
Stephane Eranian	00d1d0b095	perf: Fix errors path in perf_output_begin() In case the sampling buffer has no "payload" pages, nr_pages is 0. The problem is that the error path in perf_output_begin() skips to a label which assumes perf_output_lock() has been issued which is not the case. That triggers a WARN_ON() in perf_output_unlock(). This patch fixes the problem by skipping perf_output_unlock() in case data->nr_pages is 0. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4bf13674.014fd80a.6c82.ffffb20c@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:47 +02:00
Peter Zijlstra	4f41c013f5	perf/ftrace: Optimize perf/tracepoint interaction for single events When we've got but a single event per tracepoint there is no reason to try and multiplex it so don't. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 18:35:46 +02:00
Linus Torvalds	752f114fb8	Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix "integer as NULL pointer" warning. tracing: Fix tracepoint.h DECLARE_TRACE() to allow more than one header tracing: Make the documentation clear on trace_event boot option ring-buffer: Wrap open-coded WARN_ONCE tracing: Convert nop macros to static inlines tracing: Fix sleep time function profiling tracing: Show sample std dev in function profiling tracing: Add documentation for trace commands mod, traceon/traceoff ring-buffer: Make benchmark handle missed events ring-buffer: Make non-consuming read less expensive with lots of cpus. tracing: Add graph output support for irqsoff tracer tracing: Have graph flags passed in to ouput functions tracing: Add ftrace events for graph tracer tracing: Dump either the oops's cpu source or all cpus buffers tracing: Fix uninitialized variable of tracing/trace output	2010-05-18 08:35:04 -07:00
Linus Torvalds	b8ae30ee26	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits) stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback() sched, wait: Use wrapper functions sched: Remove a stale comment ondemand: Make the iowait-is-busy time a sysfs tunable ondemand: Solve a big performance issue by counting IOWAIT time as busy sched: Intoduce get_cpu_iowait_time_us() sched: Eliminate the ts->idle_lastupdate field sched: Fold updating of the last_update_time_info into update_ts_time_stats() sched: Update the idle statistics in get_cpu_idle_time_us() sched: Introduce a function to update the idle statistics sched: Add a comment to get_cpu_idle_time_us() cpu_stop: add dummy implementation for UP sched: Remove rq argument to the tracepoints rcu: need barrier() in UP synchronize_sched_expedited() sched: correctly place paranioa memory barriers in synchronize_sched_expedited() sched: kill paranoia check in synchronize_sched_expedited() sched: replace migration_thread with cpu_stop stop_machine: reimplement using cpu_stop cpu_stop: implement stop_cpu[s]() sched: Fix select_idle_sibling() logic in select_task_rq_fair() ...	2010-05-18 08:27:54 -07:00
Linus Torvalds	4d7b4ac22f	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits) perf tools: Add mode to build without newt support perf symbols: symbol inconsistency message should be done only at verbose=1 perf tui: Add explicit -lslang option perf options: Type check all the remaining OPT_ variants perf options: Type check OPT_BOOLEAN and fix the offenders perf options: Check v type in OPT_U?INTEGER perf options: Introduce OPT_UINTEGER perf tui: Add workaround for slang < 2.1.4 perf record: Fix bug mismatch with -c option definition perf options: Introduce OPT_U64 perf tui: Add help window to show key associations perf tui: Make <- exit menus too perf newt: Add single key shortcuts for zoom into DSO and threads perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed perf newt: Fix the 'A'/'a' shortcut for annotate perf newt: Make <- exit the ui_browser x86, perf: P4 PMU - fix counters management logic perf newt: Make <- zoom out filters perf report: Report number of events, not samples perf hist: Clarify events_stats fields usage ... Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c	2010-05-18 08:19:03 -07:00
Linus Torvalds	3aaf51ace5	Merge branch 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits) oprofile/x86: make AMD IBS hotplug capable oprofile/x86: notify cpus only when daemon is running oprofile/x86: reordering some functions oprofile/x86: stop disabled counters in nmi handler oprofile/x86: protect cpu hotplug sections oprofile/x86: remove CONFIG_SMP macros oprofile/x86: fix uninitialized counter usage during cpu hotplug oprofile/x86: remove duplicate IBS capability check oprofile/x86: move IBS code oprofile/x86: return -EBUSY if counters are already reserved oprofile/x86: moving shutdown functions oprofile/x86: reserve counter msrs pairwise oprofile/x86: rework error handler in nmi_setup() oprofile: update file list in MAINTAINERS file oprofile: protect from not being in an IRQ context oprofile: remove double ring buffering ring-buffer: Add lost event count to end of sub buffer tracing: Show the lost events in the trace_pipe output ring-buffer: Add place holder recording of dropped events tracing: Fix compile error in module tracepoints when MODULE_UNLOAD not set ...	2010-05-18 08:18:07 -07:00
Linus Torvalds	f262af3d08	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits) rcu: remove all rcu head initializations, except on_stack initializations rcu head introduce rcu head init on stack Debugobjects transition check rcu: fix build bug in RCU_FAST_NO_HZ builds rcu: RCU_FAST_NO_HZ must check RCU dyntick state rcu: make SRCU usable in modules rcu: improve the RCU CPU-stall warning documentation rcu: reduce the number of spurious RCU_SOFTIRQ invocations rcu: permit discontiguous cpu_possible_mask CPU numbering rcu: improve RCU CPU stall-warning messages rcu: print boot-time console messages if RCU configs out of ordinary rcu: disable CPU stall warnings upon panic rcu: enable CPU_STALL_VERBOSE by default rcu: slim down rcutiny by removing rcu_scheduler_active and friends rcu: refactor RCU's context-switch handling rcu: rename rcutiny rcu_ctrlblk to rcu_sched_ctrlblk rcu: shrink rcutiny by making synchronize_rcu_bh() be inline rcu: fix now-bogus rcu_scheduler_active comments. rcu: Fix bogus CONFIG_PROVE_LOCKING in comments to reflect reality. rcu: ignore offline CPUs in last non-dyntick-idle CPU check ...	2010-05-18 08:17:58 -07:00
Linus Torvalds	1014cfe2fb	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Reduce stack_trace usage lockdep: No need to disable preemption in debug atomic ops lockdep: Actually _dec_ in debug_atomic_dec lockdep: Provide off case for redundant_hardirqs_on increment lockdep: Simplify debug atomic ops lockdep: Fix redundant_hardirqs_on incremented with irqs enabled lockstat: Make lockstat counting per cpu i8253: Convert i8253_lock to raw_spinlock	2010-05-18 08:17:35 -07:00
James Bottomley	95bb335c0e	[SCSI] Merge scsi-misc-2.6 into scsi-rc-fixes-2.6 Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-05-18 10:37:41 -04:00
Steven Rostedt	f0218b3e99	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into trace/tip/tracing/core-6 Conflicts: include/trace/ftrace.h kernel/trace/trace_kprobe.c Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-18 00:35:23 -04:00
James Morris	539c99fd7f	Merge branch 'next' into for-linus	2010-05-18 08:57:00 +10:00
Ingo Molnar	9c6f7e43b4	stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback() This addresses the following compiler warning: kernel/stop_machine.c: In function 'cpu_stop_cpu_callback': kernel/stop_machine.c:297: warning: unused variable 'work' Cc: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <tip-3fc1f1e27a5b807791d72e5d992aa33b668a6626@git.kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-18 00:17:44 +02:00
Linus Torvalds	3d2c978e0c	Merge branch 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: ptrace: Cleanup useless header ptrace: kill BKL in ptrace syscall	2010-05-17 13:48:10 -07:00
Heiko Carstens	ab3c68ee5f	[S390] debug: enable exception-trace debug facility The exception-trace facility on x86 and other architectures prints traces to dmesg whenever a user space application crashes. s390 has such a feature since ages however it is called userprocess_debug and is enabled differently. This patch makes sure that whenever one of the two procfs files /proc/sys/kernel/userprocess_debug /proc/sys/debug/exception-trace is modified the contents of the second one changes as well. That way we keep backwards compatibilty but also support the same interface like other architectures do. Besides that the output of the traces is improved since it will now also contain the corresponding filename of the vma (when available) where the process caused a fault or trap. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2010-05-17 10:00:17 +02:00
Mark Gross	25f3a5a285	PM: PM QOS update fix This update handles a use case where pm_qos update requests need to silently fail if the update is being sent to a handle that is NULL. The problem was that the original pm_qos silently fails when a request update is passed to a parameter that has not been added to the list yet. This update restores that behavior. Signed-off-by: markgross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-17 00:21:03 +02:00
Octavian Purdila	9f977fb7ae	sysctl: add proc_do_large_bitmap The new function can be used to read/write large bitmaps via /proc. A comma separated range format is used for compact output and input (e.g. 1,3-4,10-10). Writing into the file will first reset the bitmap then update it based on the given input. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-15 23:28:39 -07:00
Amerigo Wang	00b7c3395a	sysctl: refactor integer handling proc code (Based on Octavian's work, and I modified a lot.) As we are about to add another integer handling proc function a little bit of cleanup is in order: add a few helper functions to improve code readability and decrease code duplication. In the process a bug is also fixed: if the user specifies a number with more then 20 digits it will be interpreted as two integers (e.g. 10000...13 will be interpreted as 100.... and 13). Behavior for EFAULT handling was changed as well. Previous to this patch, when an EFAULT error occurred in the middle of a write operation, although some of the elements were set, that was not acknowledged to the user (by shorting the write and returning the number of bytes accepted). EFAULT is now treated just like any other errors by acknowledging the amount of bytes accepted. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-15 23:28:38 -07:00
Don Zickus	cafcd80d21	lockup_detector: Cross arch compile fixes Combining the softlockup and hardlockup code causes watchdog.c to build even without the hardlockup detection support. So if an arch, that has the previous and the new nmi watchdog implementations cohabiting, wants to know if the generic one is in use, CONFIG_LOCKUP_DETECTOR is not a reliable check. We need to use CONFIG_HARDLOCKUP_DETECTOR instead. Fixes: kernel/built-in.o: In function `touch_nmi_watchdog': (.text+0x449bc): multiple definition of `touch_nmi_watchdog' arch/sparc/kernel/built-in.o:(.text+0x11b28): first defined here Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <20100514151121.GR15159@redhat.com> [ use CONFIG_HARDLOCKUP_DETECTOR instead of CONFIG_PERF_EVENTS_NMI] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-16 04:25:14 +02:00
Frederic Weisbecker	23637d477c	lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR This new config is deemed to simplify even more the lockup detector dependencies and can make it easier to bring a smooth sorting between archs that support the new generic lockup detector and those that still have their own, especially for those that are in the middle of this migration. Instead of checking whether we have CONFIG_LOCKUP_DETECTOR + CONFIG_PERF_EVENTS_NMI each time an arch wants to know if it needs to build its own lockup detector, take a shortcut with this new config. It is enabled only if the hardlockup detection part of the whole lockup detector is on. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com>	2010-05-16 01:57:42 +02:00
Hugh Dickins	16a2164bb0	profile: fix stats and data leakage If the kernel is large or the profiling step small, /proc/profile leaks data and readprofile shows silly stats, until readprofile -r has reset the buffer: clear the prof_buffer when it is vmalloc()ed. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-14 19:45:06 -07:00
Li Zefan	e1f7992e01	tracing: Fix function declarations if !CONFIG_STACKTRACE ftrace_trace_stack() and frace_trace_userstacke() take a struct ring_buffer argument, not struct trace_array. Commit e77405ad("tracing: pass around ring buffer instead of tracer") made this change. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BE77C14.5010806@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:33:24 -04:00
Steven Rostedt	553552ce17	tracing: Combine event filter_active and enable into single flags field The filter_active and enable both use an int (4 bytes each) to set a single flag. We can save 4 bytes per event by combining the two into a single integer. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4894944 1018052 861512 6774508 675eec vmlinux.id 4894871 1012292 861512 6768675 674823 vmlinux.flags This gives us another 5K in savings. The modification of both the enable and filter fields are done under the event_mutex, so it is still safe to combine the two. Note: Although Mathieu gave his Acked-by, he would like it documented that the reads of flags are not protected by the mutex. The way the code works, these reads will not break anything, but will have a residual effect. Since this behavior is the same even before this patch, describing this situation is left to another patch, as this patch does not change the behavior, but just brought it to Mathieu's attention. v2: Updated the event trace self test to for this change. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:33:22 -04:00
Steven Rostedt	32c0edaeaa	tracing: Remove duplicate id information in event structure Now that the trace_event structure is embedded in the ftrace_event_call structure, there is no need for the ftrace_event_call id field. The id field is the same as the trace_event type field. Removing the id and re-arranging the structure brings down the tracepoint footprint by another 5K. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig `4895024` `1023812` 861512 6780348 6775bc vmlinux.print 4894944 1018052 861512 6774508 675eec vmlinux.id Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:33:15 -04:00
Steven Rostedt	80decc70af	tracing: Move print functions into event class Currently, every event has its own trace_event structure. This is fine since the structure is needed anyway. But the print function structure (trace_event_functions) is now separate. Since the output of the trace event is done by the class (with the exception of events defined by DEFINE_EVENT_PRINT), it makes sense to have the class define the print functions that all events in the class can use. This makes a bigger deal with the syscall events since all syscall events use the same class. The savings here is another 30K. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900382 1048964 861512 6810858 67ecea vmlinux.init 4900446 1049028 861512 6810986 67ed6a vmlinux.preprint `4895024` `1023812` 861512 6780348 6775bc vmlinux.print To accomplish this, and to let the class know what event is being printed, the event structure is embedded in the ftrace_event_call structure. This should not be an issues since the event structure was created for each event anyway. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:20:34 -04:00
Steven Rostedt	a9a5776380	tracing: Allow events to share their print functions Multiple events may use the same method to print their data. Instead of having all events have a pointer to their print funtions, the trace_event structure now points to a trace_event_functions structure that will hold the way to print ouf the event. The event itself is now passed to the print function to let the print function know what kind of event it should print. This opens the door to consolidating the way several events print their output. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900382 1048964 861512 6810858 67ecea vmlinux.init 4900446 1049028 861512 6810986 67ed6a vmlinux.preprint This change slightly increases the size but is needed for the next change. v3: Fix the branch tracer events to handle this change. v2: Fix the new function graph tracer event calls to handle this change. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:20:32 -04:00
Steven Rostedt	0405ab80aa	tracing: Move raw_init from events to class The raw_init function pointer in the event is used to initialize various kinds of events. The type of initialization needed is usually classed to the kind of event it is. Two events with the same class will always have the same initialization function, so it makes sense to move this to the class structure. Perhaps even making a special system structure would work since the initialization is the same for all events within a system. But since there's no system structure (yet), this will just move it to the class. text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900375 1053380 861512 6815267 67fe23 vmlinux.fields 4900382 1048964 861512 6810858 67ecea vmlinux.init The text grew very slightly, but this is a constant growth that happened with the changing of the C files that call the init code. The bigger savings is the data which will be saved the more events share a class. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:20:30 -04:00
Steven Rostedt	2e33af0295	tracing: Move fields from event to class structure Move the defined fields from the event to the class structure. Since the fields of the event are defined by the class they belong to, it makes sense to have the class hold the information instead of the individual events. The events of the same class would just hold duplicate information. After this change the size of the kernel dropped another 3K: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4900252 1057412 861512 6819176 680d68 vmlinux.regs 4900375 1053380 861512 6815267 67fe23 vmlinux.fields Although the text increased, this was mainly due to the C files having to adapt to the change. This is a constant increase, where new tracepoints will not increase the Text. But the big drop is in the data size (as well as needed allocations to hold the fields). This will give even more savings as more tracepoints are created. Note, if just TRACE_EVENT()s are used and not DECLARE_EVENT_CLASS() with several DEFINE_EVENT()s, then the savings will be lost. But we are pushing developers to consolidate events with DEFINE_EVENT() so this should not be an issue. The kprobes define a unique class to every new event, but are dynamic so it should not be a issue. The syscalls however have a single class but the fields for the individual events are different. The syscalls use a metadata to define the fields. I moved the fields list from the event to the metadata and added a "get_fields()" function to the class. This function is used to find the fields. For normal events and kprobes, get_fields() just returns a pointer to the fields list_head in the class. For syscall events, it returns the fields list_head in the metadata for the event. v2: Fixed the syscall fields. The syscall metadata needs a list of fields for both enter and exit. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:20:23 -04:00
Steven Rostedt	2239291aeb	tracing: Remove per event trace registering This patch removes the register functions of TRACE_EVENT() to enable and disable tracepoints. The registering of a event is now down directly in the trace_events.c file. The tracepoint_probe_register() is now called directly. The prototypes are no longer type checked, but this should not be an issue since the tracepoints are created automatically by the macros. If a prototype is incorrect in the TRACE_EVENT() macro, then other macros will catch it. The trace_event_class structure now holds the probes to be called by the callbacks. This removes needing to have each event have a separate pointer for the probe. To handle kprobes and syscalls, since they register probes in a different manner, a "reg" field is added to the ftrace_event_class structure. If the "reg" field is assigned, then it will be called for enabling and disabling of the probe for either ftrace or perf. To let the reg function know what is happening, a new enum (trace_reg) is created that has the type of control that is needed. With this new rework, the 82 kernel events and 618 syscall events has their footprint dramatically lowered: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint 4900252 1057412 861512 6819176 680d68 vmlinux.regs The size went from 6863829 to 6819176, that's a total of 44K in savings. With tracepoints being continuously added, this is critical that the footprint becomes minimal. v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf specific structure in trace_events.c. v4: Fixed trace self tests to check probe because regfunc no longer exists. v3: Updated to handle void data in beginning of probe parameters. Also added the tracepoint: check_trace_callback_type_##call(). v2: Changed the callback probes to pass void and typecast the value within the function. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 14:19:14 -04:00
Steven Rostedt	38516ab59f	tracing: Let tracepoints have data passed to tracepoint callbacks This patch adds data to be passed to tracepoint callbacks. The created functions from DECLARE_TRACE() now need a mandatory data parameter. For example: DECLARE_TRACE(mytracepoint, int value, value) Will create the register function: int register_trace_mytracepoint((void()(void data, int value))probe, void data); As the first argument, all callbacks (probes) must take a (void data) parameter. So a callback for the above tracepoint will look like: void myprobe(void data, int value) { } The callback may choose to ignore the data parameter. This change allows callbacks to register a private data pointer along with the function probe. void mycallback(void data, int value); register_trace_mytracepoint(mycallback, mydata); Then the mycallback() will receive the "mydata" as the first parameter before the args. A more detailed example: DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); /* In the C file / DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status)); [...] trace_mytracepoint(status); / In a file registering this tracepoint / int my_callback(void data, int status) { struct my_struct my_data = data; [...] } [...] my_data = kmalloc(sizeof(my_data), GFP_KERNEL); init_my_data(my_data); register_trace_mytracepoint(my_callback, my_data); The same callback can also be registered to the same tracepoint as long as the data registered is different. Note, the data must also be used to unregister the callback: unregister_trace_mytracepoint(my_callback, my_data); Because of the data parameter, tracepoints declared this way can not have no args. That is: DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS()); will cause an error. If no arguments are needed, a new macro can be used instead: DECLARE_TRACE_NOARGS(mytracepoint); Since there are no arguments, the proto and args fields are left out. This is part of a series to make the tracepoint footprint smaller: text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class 4918492 1084612 861512 6864616 68bee8 vmlinux.tracepoint Again, this patch also increases the size of the kernel, but lays the ground work for decreasing it. v5: Fixed net/core/drop_monitor.c to handle these updates. v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the #ifdef CONFIG_TRACE_POINTS, since the two are the same in both cases. The __DECLARE_TRACE() is what changes. Thanks to Frederic Weisbecker for pointing this out. v3: Made all register_ functions require data to be passed and all callbacks to take a void * parameter as its first argument. This makes the calling functions comply with C standards. Also added more comments to the modifications of DECLARE_TRACE(). v2: Made the DECLARE_TRACE() have the ability to pass arguments and added a new DECLARE_TRACE_NOARGS() for tracepoints that do not need any arguments. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 09:50:34 -04:00
Steven Rostedt	8f08201830	tracing: Create class struct for events This patch creates a ftrace_event_class struct that event structs point to. This class struct will be made to hold information to modify the events. Currently the class struct only holds the events system name. This patch slightly increases the size, but this change lays the ground work of other changes to make the footprint of tracepoints smaller. With 82 standard tracepoints, and 618 system call tracepoints (two tracepoints per syscall: enter and exit): text data bss dec hex filename 4913961 1088356 861512 6863829 68bbd5 vmlinux.orig 4914025 1088868 861512 6864405 68be15 vmlinux.class This patch also cleans up some stale comments in ftrace.h. v2: Fixed missing semi-colon in macro. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-14 09:33:49 -04:00
Steven Rostedt	23e117fa44	Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip into trace/tip/tracing/core-4	2010-05-14 09:29:52 -04:00
Ingo Molnar	0167c78190	watchdog: Export touch_softlockup_watchdog There are modules that rely on it: ERROR: "touch_softlockup_watchdog" [drivers/video/nvidia/nvidiafb.ko] undefined! Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <1273713674-8434-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-13 08:53:33 +02:00
Ingo Molnar	ad56b0797e	Merge commit 'v2.6.34-rc7' into tracing/core Merge reason: Update from -rc5 to -rc7. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-13 08:11:29 +02:00
Don Zickus	d7c547335f	lockup_detector: Separate touch_nmi_watchdog code path from touch_watchdog When I combined the nmi_watchdog (hardlockup) and softlockup code, I also combined the paths the touch_watchdog and touch_nmi_watchdog took. This may not be the best idea as pointed out by Frederic W., that the touch_watchdog case probably should not reset the hardlockup count. Therefore the patch below falls back to the previous idea of keeping the touch_nmi_watchdog a superset of the touch_watchdog case. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-9-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-12 23:55:55 +02:00
Don Zickus	f69bcf60c3	lockup_detector: Remove nmi_watchdog.c file This file migrated to kernel/watchdog.c and then combined with kernel/softlockup.c. As a result kernel/nmi_watchdog.c is no longer needed. Just remove it. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-5-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-12 23:55:46 +02:00
Don Zickus	2508ce1845	lockup_detector: Remove old softlockup code Now that is no longer compiled or used, just remove it. Also move some of the code wrapped with DETECT_SOFTLOCKUP to the LOCKUP_DETECTOR wrappers because that is the code that uses it now. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-4-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-12 23:55:45 +02:00
Don Zickus	332fbdbca3	lockup_detector: Touch_softlockup cleanups and softlockup_tick removal Just some code cleanup to make touch_softlockup clearer and remove the softlockup_tick function as it is no longer needed. Also remove the /proc softlockup_thres call as it has been changed to watchdog_thres. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-3-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-12 23:55:43 +02:00
Don Zickus	58687acba5	lockup_detector: Combine nmi_watchdog and softlockup detector The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file: kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions V4: - documentation fixes - convert per_cpu to __get_cpu_var - powerpc compile fixes V5: - split apart warn flags for hard and soft lockups TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period [fweisbec: merged conflict patch] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-05-12 23:55:33 +02:00
Frederic Weisbecker	a9aa1d02de	Merge commit 'v2.6.34-rc7' into perf/nmi Merge reason: catch up with latest softlockup detector changes.	2010-05-12 23:20:33 +02:00
Peter P Waskiewicz Jr	4308ad8011	genirq: Clear CPU mask in affinity_hint when none is provided When an interrupt is disabled and torn down, the CPU mask returned through affinity_hint right now is all CPUs. Also, for drivers that don't provide an affinity_hint mask, this can be misleading. There should be no hint at all, meaning an empty CPU mask. [ tglx: use zalloc_cpumask_var instead of clearing it under the lock ] Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Cc: davem@davemloft.net Cc: arjan@linux.jf.intel.com Cc: bhutchings@solarflare.com LKML-Reference: <20100505205638.5426.87189.stgit@ppwaskie-hc2.jf.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-12 11:23:34 +02:00
David S. Miller	278554bd65	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: Documentation/feature-removal-schedule.txt drivers/net/wireless/ath/ar9170/usb.c drivers/scsi/iscsi_tcp.c net/ipv4/ipmr.c	2010-05-12 00:05:35 -07:00
KAMEZAWA Hiroyuki	747388d78a	memcg: fix css_is_ancestor() RCU locking Some callers (in memcontrol.c) calls css_is_ancestor() without rcu_read_lock. Because css_is_ancestor() has to access RCU protected data, it should be under rcu_read_lock(). This makes css_is_ancestor() itself does safe access to RCU protected area. (At least, "root" can have refcnt==0 if it's not an ancestor of "child". So, we need rcu_read_lock().) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-11 17:33:42 -07:00
KAMEZAWA Hiroyuki	7f0f154641	memcg: fix css_id() RCU locking for real Commit `ad4ba37537` ("memcg: css_id() must be called under rcu_read_lock()") modifies memcontol.c for fixing RCU check message. But Andrew Morton pointed out that the fix doesn't seems sane and it was just for hidining lockdep messages. This is a patch for do proper things. Checking again, all places, accessing without rcu_read_lock, that commit fixies was intentional.... all callers of css_id() has reference count on it. So, it's not necessary to be under rcu_read_lock(). Considering again, we can use rcu_dereference_check for css_id(). We know css->id is valid if css->refcnt > 0. (css->id never changes and freed after css->refcnt going to be 0.) This patch makes use of rcu_dereference_check() in css_id/depth and remove unnecessary rcu-read-lock added by the commit. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-11 17:33:42 -07:00
Vitaliy Gusev	11cad320a4	bsdacct: use del_timer_sync() in acct_exit_ns() acct_exit_ns --> acct_file_reopen deletes timer without check timer execution on other CPUs. So acct_timeout() can change an unmapped memory. Signed-off-by: Vitaliy Gusev <vgusev@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-11 17:33:42 -07:00
Vitaly Mayatskikh	475f9aa6aa	kexec: fix OOPS in crash_kernel_shrink Two "echo 0 > /sys/kernel/kexec_crash_size" OOPSes kernel. Also content of this file is invalid after first shrink to zero: it shows 1 instead of 0. This scenario is unlikely to happen often (root privs, valid crashkernel= in cmdline, dump-capture kernel not loaded), I hit it only by chance. This patch fixes it. Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Cc: Cong Wang <amwang@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-11 17:33:42 -07:00
Robin Holt	34441427aa	revert "procfs: provide stack information for threads" and its fixup commits Originally, commit `d899bf7b` ("procfs: provide stack information for threads") attempted to introduce a new feature for showing where the threadstack was located and how many pages are being utilized by the stack. Commit `c44972f1` ("procfs: disable per-task stack usage on NOMMU") was applied to fix the NO_MMU case. Commit `89240ba0` ("x86, fs: Fix x86 procfs stack information for threads on 64-bit") was applied to fix a bug in ia32 executables being loaded. Commit `9ebd4eba7` ("procfs: fix /proc/<pid>/stat stack pointer for kernel threads") was applied to fix a bug which had kernel threads printing a userland stack address. Commit `1306d603f` ('proc: partially revert "procfs: provide stack information for threads"') was then applied to revert the stack pages being used to solve a significant performance regression. This patch nearly undoes the effect of all these patches. The reason for reverting these is it provides an unusable value in field 28. For x86_64, a fork will result in the task->stack_start value being updated to the current user top of stack and not the stack start address. This unpredictability of the stack_start value makes it worthless. That includes the intended use of showing how much stack space a thread has. Other architectures will get different values. As an example, ia64 gets 0. The do_fork() and copy_process() functions appear to treat the stack_start and stack_size parameters as architecture specific. I only partially reverted `c44972f1` ("procfs: disable per-task stack usage on NOMMU") . If I had completely reverted it, I would have had to change mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is configured. Since I could not test the builds without significant effort, I decided to not change mm/Makefile. I only partially reverted `89240ba0` ("x86, fs: Fix x86 procfs stack information for threads on 64-bit") . I left the KSTK_ESP() change in place as that seemed worthwhile. Signed-off-by: Robin Holt <holt@sgi.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-11 17:33:41 -07:00
Paul E. McKenney	72d5a9f7a9	rcu: remove all rcu head initializations, except on_stack initializations Remove all rcu head inits. We don't care about the RCU head state before passing it to call_rcu() anyway. Only leave the "on_stack" variants so debugobjects can keep track of objects on stack. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-11 16:10:47 -07:00
Alan Cox	8b6d043b7e	resource: shared I/O region support SuperIO devices share regions and use lock/unlock operations to chip select. We therefore need to be able to request a resource and wait for it to be freed by whichever other SuperIO device currently hogs it. Right now you have to poll which is horrible. Add a MUXED field to IO port resources. If the MUXED field is set on the resource and on the request (via request_muxed_region) then we block until the previous owner of the muxed resource releases their region. This allows us to implement proper resource sharing and locking for superio chips using code of the form enable_my_superio_dev() { request_muxed_region(0x44, 0x02, "superio:watchdog"); outb() ..sequence to enable chip } disable_my_superio_dev() { outb() .. sequence of disable chip release_region(0x44, 0x02); } Signed-off-by: Giel van Schijndel <me@mortis.eu> Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-05-11 12:01:10 -07:00
Changli Gao	a93d2f1744	sched, wait: Use wrapper functions epoll should not touch flags in wait_queue_t. This patch introduces a new function __add_wait_queue_exclusive(), for the users, who use wait queue as a LIFO queue. __add_wait_queue_tail_exclusive() is introduced too instead of add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as it is a duplicate of __remove_wait_queue(), disliked by users, and with less users. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <containers@lists.linux-foundation.org> LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-11 17:43:58 +02:00
Peter Zijlstra	96c21a460a	perf: Fix exit() vs event-groups Corey reported that the value scale times of group siblings are not updated when the monitored task dies. The problem appears to be that we only update the group leader's time values, fix it by updating the whole group. Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> # .34.x LKML-Reference: <1273588935.1810.6.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-11 17:08:24 +02:00
Peter Zijlstra	050735b08c	perf: Fix exit() vs PERF_FORMAT_GROUP Both Stephane and Corey reported that PERF_FORMAT_GROUP didn't work as expected if the task the counters were attached to quit before the read() call. The cause is that we unconditionally destroy the grouping when we remove counters from their context. Fix this by splitting off the group destroy from the list removal such that perf_event_remove_from_context() does not do this and change perf_event_release() to do so. Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: <stable@kernel.org> # .34.x LKML-Reference: <1273571513.5605.3527.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-11 15:46:43 +02:00
Ingo Molnar	e3174cfd2a	Revert "perf: Fix exit() vs PERF_FORMAT_GROUP" This reverts commit `4fd38e4595`. It causes various crashes and hangs when events are activated. The cause is not fully understood yet but we need to revert it because the effects are severe. Reported-by: Stephane Eranian <eranian@google.com> Reported-by: Lin Ming <ming.m.lin@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-11 08:31:49 +02:00
Matt Helsley	8f77578cc2	Freezer / cgroup freezer: Update stale locking comments Update stale comments regarding locking order and add a little more detail so it's easier to follow the locking between the cgroup freezer and the power management freezer code. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:18:47 +02:00
Mark Gross	ed77134bfc	PM QOS update This patch changes the string based list management to a handle base implementation to help with the hot path use of pm-qos, it also renames much of the API to use "request" as opposed to "requirement" that was used in the initial implementation. I did this because request more accurately represents what it actually does. Also, I added a string based ABI for users wanting to use a string interface. So if the user writes 0xDDDDDDDD formatted hex it will be accepted by the interface. (someone asked me for it and I don't think it hurts anything.) This patch updates some documentation input I got from Randy. Signed-off-by: markgross <mgross@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:08:19 +02:00
Randy Dunlap	0fef8b1e83	PM / Hibernate: Fix block_io.c printk warning Fix printk format warning in block_io.c: kernel/power/block_io.c:41: warning: format '%ld' expects type 'long int', but argument 2 has type 'sector_t' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:08:18 +02:00
Jiri Slaby	6f612af578	PM / Hibernate: Group swap ops Move all the swap processing into one function. It will make swap calls from a non-swap code easier. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:08:18 +02:00
Jiri Slaby	51fb352b2c	PM / Hibernate: Move the first_sector out of swsusp_write The first sector knowledge is swap-only specific. Move it into the swap handle. This will be needed for later non-swap specific code moving into snapshot.c. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2010-05-10 23:08:18 +02:00
Jiri Slaby	8a0d613fa1	PM / Hibernate: Separate block_io Move block I/O operations to a separate file. It is because it will be used later not only by the swap writer. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:08:18 +02:00
Jiri Slaby	d3c1b24c50	PM / Hibernate: Snapshot cleanup Remove support of reads with offset. This means snapshot_read/write_next now does not accept count parameter. It allows to clean up the functions and snapshot handle which no longer needs to care about offsets. /dev/snapshot handler is converted to simple_{read_from,write_to}_buffer which take care of offsets. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-05-10 23:08:17 +02:00
Paul E. McKenney	d822ed1094	rcu: fix build bug in RCU_FAST_NO_HZ builds Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:35 -07:00
Paul E. McKenney	77e38ed347	rcu: RCU_FAST_NO_HZ must check RCU dyntick state The current version of RCU_FAST_NO_HZ reproduces the old CLASSIC_RCU dyntick-idle bug, as it fails to detect CPUs that have interrupted or NMIed out of dyntick-idle mode. Fix this by making rcu_needs_cpu() check the state in the per-CPU rcu_dynticks variables, thus correctly detecting the dyntick-idle state from an RCU perspective. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:35 -07:00
Paul E. McKenney	d21670acab	rcu: reduce the number of spurious RCU_SOFTIRQ invocations Lai Jiangshan noted that up to 10% of the RCU_SOFTIRQ are spurious, and traced this down to the fact that the current grace-period machinery will uselessly raise RCU_SOFTIRQ when a given CPU needs to go through a quiescent state, but has not yet done so. In this situation, there might well be nothing that RCU_SOFTIRQ can do, and the overhead can be worth worrying about in the ksoftirqd case. This patch therefore avoids raising RCU_SOFTIRQ in this situation. Changes since v1 (http://lkml.org/lkml/2010/3/30/122 from Lai Jiangshan): o Omit the rcu_qs_pending() prechecks, as they aren't that much less expensive than the quiescent-state checks. o Merge with the set_need_resched() patch that reduces IPIs. o Add the new n_rp_report_qs field to the rcu_pending tracing output. o Update the tracing documentation accordingly. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:35 -07:00
Paul E. McKenney	4a90a0681c	rcu: permit discontiguous cpu_possible_mask CPU numbering TREE_RCU assumes that CPU numbering is contiguous, but some users need large holes in the numbering to better map to hardware layout. This patch makes TREE_RCU (and TREE_PREEMPT_RCU) tolerate large holes in the CPU numbering. However, NR_CPUS must still be greater than the largest CPU number. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:35 -07:00
Paul E. McKenney	4300aa642c	rcu: improve RCU CPU stall-warning messages The existing RCU CPU stall-warning messages can be confusing, especially in the case where one CPU detects a single other stalled CPU. In addition, the console messages did not say which flavor of RCU detected the stall, which can make it difficult to work out exactly what is causing the stall. This commit improves these messages. Requested-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:34 -07:00
Paul E. McKenney	26845c2860	rcu: print boot-time console messages if RCU configs out of ordinary Print boot-time messages if tracing is enabled, if fanout is set to non-default values, if exact fanout is specified, if accelerated dyntick-idle grace periods have been enabled, if RCU-lockdep is enabled, if rcutorture has been boot-time enabled, if the CPU stall detector has been disabled, or if four-level hierarchy has been enabled. This is all for TREE_RCU and TREE_PREEMPT_RCU. TINY_RCU will be handled separately, if at all. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:34 -07:00
Paul E. McKenney	c68de2097a	rcu: disable CPU stall warnings upon panic The current RCU CPU stall warnings remain enabled even after a panic occurs, which some people have found to be a bit counterproductive. This patch therefore uses a notifier to disable stall warnings once a panic occurs. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:34 -07:00
Paul E. McKenney	bbad937983	rcu: slim down rcutiny by removing rcu_scheduler_active and friends TINY_RCU does not need rcu_scheduler_active unless CONFIG_DEBUG_LOCK_ALLOC. So conditionally compile rcu_scheduler_active in order to slim down rcutiny a bit more. Also gets rid of an EXPORT_SYMBOL_GPL, which is responsible for most of the slimming. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:34 -07:00
Paul E. McKenney	25502a6c13	rcu: refactor RCU's context-switch handling The addition of preemptible RCU to treercu resulted in a bit of confusion and inefficiency surrounding the handling of context switches for RCU-sched and for RCU-preempt. For RCU-sched, a context switch is a quiescent state, pure and simple, just like it always has been. For RCU-preempt, a context switch is in no way a quiescent state, but special handling is required when a task blocks in an RCU read-side critical section. However, the callout from the scheduler and the outer loop in ksoftirqd still calls something named rcu_sched_qs(), whose name is no longer accurate. Furthermore, when rcu_check_callbacks() notes an RCU-sched quiescent state, it ends up unnecessarily (though harmlessly, aside from the performance hit) enqueuing the current task if it happens to be running in an RCU-preempt read-side critical section. This not only increases the maximum latency of scheduler_tick(), it also needlessly increases the overhead of the next outermost rcu_read_unlock() invocation. This patch addresses this situation by separating the notion of RCU's context-switch handling from that of RCU-sched's quiescent states. The context-switch handling is covered by rcu_note_context_switch() in general and by rcu_preempt_note_context_switch() for preemptible RCU. This permits rcu_sched_qs() to handle quiescent states and only quiescent states. It also reduces the maximum latency of scheduler_tick(), though probably by much less than a microsecond. Finally, it means that tasks within preemptible-RCU read-side critical sections avoid incurring the overhead of queuing unless there really is a context switch. Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>	2010-05-10 11:08:33 -07:00
Paul E. McKenney	99652b54de	rcu: rename rcutiny rcu_ctrlblk to rcu_sched_ctrlblk Make naming line up in preparation for CONFIG_TINY_PREEMPT_RCU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:32 -07:00
Paul E. McKenney	da848c47bc	rcu: shrink rcutiny by making synchronize_rcu_bh() be inline Because synchronize_rcu_bh() is identical to synchronize_sched(), make the former a static inline invoking the latter, saving the overhead of an EXPORT_SYMBOL_GPL() and the duplicate code. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:32 -07:00
Lai Jiangshan	5db356736a	rcu: ignore offline CPUs in last non-dyntick-idle CPU check Offline CPUs are not in nohz_cpu_mask, but can be ignored when checking for the last non-dyntick-idle CPU. This patch therefore only checks online CPUs for not being dyntick idle, allowing fast entry into full-system dyntick-idle state even when there are some offline CPUs. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:31 -07:00
Lai Jiangshan	0c34029abd	rcu: move some code from macro to function Shrink the RCU_INIT_FLAVOR() macro by moving all but the initialization of the ->rda[] array to rcu_init_one(). The call to rcu_init_one() can then be moved to the end of the RCU_INIT_FLAVOR() macro, which is required because rcu_boot_init_percpu_data(), which is now called from rcu_init_one(), depends on the initialization of the ->rda[] array. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:31 -07:00
Lai Jiangshan	f261414f0d	rcu: make dead code really dead cleanup: make dead code really dead Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:31 -07:00
Paul E. McKenney	d25eb9442b	rcu: substitute set_need_resched for sending resched IPIs This patch adds a check to __rcu_pending() that does a local set_need_resched() if the current CPU is holding up the current grace period and if force_quiescent_state() will be called soon. The goal is to reduce the probability that force_quiescent_state() will need to do smp_send_reschedule(), which sends an IPI and is therefore more expensive on most architectures. Signed-off-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:31 -07:00
Lai Jiangshan	2b3fc35f69	rcu: optionally leave lockdep enabled after RCU lockdep splat There is no need to disable lockdep after an RCU lockdep splat, so remove the debug_lockdeps_off() from lockdep_rcu_dereference(). To avoid repeated lockdep splats, use a static variable in the inlined rcu_dereference_check() and rcu_dereference_protected() macros so that a given instance splats only once, but so that multiple instances can be detected per boot. This is controlled by a new config variable CONFIG_PROVE_RCU_REPEATEDLY, which is disabled by default. This provides the normal lockdep behavior by default, but permits people who want to find multiple RCU-lockdep splats per boot to easily do so. Requested-by: Eric Paris <eparis@redhat.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Tested-by: Eric Paris <eparis@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-10 11:08:31 -07:00
John Stultz	d7e81c269d	clocksource: Add clocksource_register_hz/khz interface How to pick good mult/shift pairs has always been difficult to describe to folks writing clocksource drivers, since it requires careful tradeoffs in adjustment accuracy vs overflow limits. Now, with the clocks_calc_mult_shift function, its much easier. However, not many clocksources have converted to using that function, and there is still the issue of the max interval length assumption being made by each clocksource driver independently. So this patch simplifies the registration process by having clocksources be registered with a hz/khz value and the registration function taking care of setting mult/shift. This should take most of the confusion out of writing a clocksource driver. Additionally it also keeps the shift size tradeoff (more accuracy vs longer possible nohz times) centralized so the timekeeping core can keep track of the assumptions being made. [ tglx: Coding style and comments fixed ] Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1273280858-30143-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-10 14:24:26 +02:00
Stanislaw Gruszka	29f87b793d	posix-cpu-timers: Optimize run_posix_cpu_timers() We can optimize and simplify things taking into account signal->cputimer is always running when we have configured any process wide cpu timer. In check_process_timers(), we don't have to check if new updated value of signal->cputime_expires is smaller, since we maintain new first expiration time ({prof,virt,sched}_expires) in code flow and all other writes to expiration cache are protected by sighand->siglock . Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-10 14:24:26 +02:00
Thomas Gleixner	dbb6be6d5e	Merge branch 'linus' into timers/core Reason: Further posix_cpu_timer patches depend on mainline changes Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-10 14:20:42 +02:00
Ingo Molnar	7c224a03a7	Merge commit 'v2.6.34-rc7' into oprofile Merge reason: Update to Linus's latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-10 13:12:29 +02:00
Li Zefan	af507ae8a0	sched: Remove a stale comment This comment should have been removed together with uids_mutex when removing user sched. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dhaval Giani <dhaval.giani@gmail.com> LKML-Reference: <4BE77C6B.5010402@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-10 08:48:39 +02:00
Arjan van de Ven	0224cf4c5e	sched: Intoduce get_cpu_iowait_time_us() For the ondemand cpufreq governor, it is desired that the iowait time is microaccounted in a similar way as idle time is. This patch introduces the infrastructure to account and expose this information via the get_cpu_iowait_time_us() function. [akpm@linux-foundation.org: fix CONFIG_NO_HZ=n build] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082523.284feab6@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:27 +02:00
Arjan van de Ven	e0e37c200f	sched: Eliminate the ts->idle_lastupdate field Now that the only user of ts->idle_lastupdate is update_ts_time_stats(), the entire field can be eliminated. In update_ts_time_stats(), idle_lastupdate is first set to "now", and a few lines later, the only user is an if() statement that assigns a variable either to "now" or to ts->idle_lastupdate, which has the value of "now" at that point. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082439.2fab0b4f@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:26 +02:00
Arjan van de Ven	8d63bf949e	sched: Fold updating of the last_update_time_info into update_ts_time_stats() This patch folds the updating of the last_update_time into the update_ts_time_stats() function, and updates the callers. This allows for further cleanups that are done in the next patch. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082403.60072967@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:26 +02:00
Arjan van de Ven	8c7b09f43f	sched: Update the idle statistics in get_cpu_idle_time_us() Right now, get_cpu_idle_time_us() only reports the idle statistics upto the point the CPU entered last idle; not what is valid right now. This patch adds an update of the idle statistics to get_cpu_idle_time_us(), so that calling this function always returns statistics that are accurate at the point of the call. This includes resetting the start of the idle time for accounting purposes to avoid double accounting. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082323.2d2f1945@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:26 +02:00
Arjan van de Ven	595aac488b	sched: Introduce a function to update the idle statistics Currently, two places update the idle statistics (and more to come later in this series). This patch creates a helper function for updating these statistics. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082245.163e67ed@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:25 +02:00
Arjan van de Ven	b1f724c305	sched: Add a comment to get_cpu_idle_time_us() The exported function get_cpu_idle_time_us() has no comment describing it; add a kerneldoc comment Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com LKML-Reference: <20100509082208.7cb721f0@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-09 19:35:25 +02:00
Frederic Weisbecker	9313543945	tracing: Drop the nested field from lock_release event Drop the nested field as we don't use it. Every nested state can be computed from a state machine on post processing already. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Steven Rostedt <rostedt@goodmis.org>	2010-05-09 13:45:34 +02:00
Frederic Weisbecker	883a2a3189	tracing: Drop lock_acquired waittime field Drop the waittime field from the lock_acquired event, we can calculate it by substracting the lock_acquired event timestamp with the matching lock_acquire one. It is not needed and takes useless space in the traces. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Steven Rostedt <rostedt@goodmis.org>	2010-05-09 13:45:32 +02:00
Ingo Molnar	e7858f52a5	Merge branch 'cpu_stop' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into sched/core	2010-05-08 18:11:19 +02:00
Masami Hiramatsu	c0614829c1	kprobes: Move enable/disable_kprobe() out from debugfs code Move enable/disable_kprobe() API out from debugfs related code, because these interfaces are not related to debugfs interface. This fixes a compiler warning. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Tony Luck <tony.luck@intel.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> LKML-Reference: <20100427223312.2322.60512.stgit@localhost6.localdomain6> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-08 18:08:30 +02:00
Tejun Heo	bbf1bb3eee	cpu_stop: add dummy implementation for UP When !CONFIG_SMP, cpu_stop functions weren't defined at all which could lead to build failures if UP code uses cpu_stop facility. Add dummy cpu_stop implementation for UP. The waiting variants execute the work function directly with preempt disabled and stop_one_cpu_nowait() schedules a workqueue work. Makefile and ifdefs around stop_machine implementation are updated to accomodate CONFIG_SMP && !CONFIG_STOP_MACHINE case. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ingo Molnar <mingo@elte.hu>	2010-05-08 17:12:33 +02:00
Paul Mackerras	6e85158cf5	perf_event: Make software events work again Commit `6bde9b6ce0` ("perf: Add group scheduling transactional APIs") added code to allow a group to be scheduled in a single transaction. However, it introduced a bug in handling events whose pmu does not implement transactions -- at the end of scheduling in the events in the group, in the non-transactional case the code now falls through to the group_error label, and proceeds to unschedule all the events in the group and return failure. This fixes it by returning 0 (success) in the non-transactional case. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: eranian@gmail.com LKML-Reference: <20100508105800.GB10650@brick.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-08 13:16:55 +02:00
Neil Horman	3ee943728f	ipv4: remove ip_rt_secret timer (v4) A while back there was a discussion regarding the rt_secret_interval timer. Given that we've had the ability to do emergency route cache rebuilds for awhile now, based on a statistical analysis of the various hash chain lengths in the cache, the use of the flush timer is somewhat redundant. This patch removes the rt_secret_interval sysctl, allowing us to rely solely on the statistical analysis mechanism to determine the need for route cache flushes. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-08 01:57:52 -07:00
Linus Torvalds	91bc482ec5	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: create rcu_my_thread_group_empty() wrapper memcg: css_id() must be called under rcu_read_lock() cgroup: Check task_lock in task_subsys_state() sched: Fix an RCU warning in print_task() cgroup: Fix an RCU warning in alloc_css_id() cgroup: Fix an RCU warning in cgroup_path() KEYS: Fix an RCU warning in the reading of user keys KEYS: Fix an RCU warning	2010-05-07 13:58:21 -07:00
Lin Ming	6bde9b6ce0	perf: Add group scheduling transactional APIs Add group scheduling transactional APIs to struct pmu. These APIs will be implemented in arch code, based on Peter's idea as below. > the idea behind hw_perf_group_sched_in() is to not perform > schedulability tests on each event in the group, but to add the group > as a whole and then perform one test. > > Of course, when that test fails, you'll have to roll-back the whole > group again. > > So start_txn (or a better name) would simply toggle a flag in the pmu > implementation that will make pmu::enable() not perform the > schedulablilty test. > > Then commit_txn() will perform the schedulability test (so note the > method has to have a !void return value. > > This will allow us to use the regular > kernel/perf_event.c::group_sched_in() and all the rollback code. > Currently each hw_perf_group_sched_in() implementation duplicates all > the rolllback code (with various bugs). ->start_txn: Start group events scheduling transaction, set a flag to make pmu::enable() not perform the schedulability test, it will be performed at commit time. ->commit_txn: Commit group events scheduling transaction, perform the group schedulability as a whole ->cancel_txn: Stop group events scheduling transaction, clear the flag so pmu::enable() will perform the schedulability test. Reviewed-by: Stephane Eranian <eranian@google.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Cc: David Miller <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1272002160.5707.60.camel@minggr.sh.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:31:02 +02:00
Peter Zijlstra	a0507c84bf	perf: Annotate perf_event_read_group() vs perf_event_release_kernel() Stephane reported a lockdep warning while using PERF_FORMAT_GROUP. The issue is that perf_event_read_group() takes faults while holding the ctx->mutex, while perf_event_release_kernel() can be called from munmap(). Which makes for an AB-BA deadlock. Except we can never establish the deadlock because we'll only ever call perf_event_release_kernel() after all file descriptors are dead so there is no concurrency possible. Reported-by: Stephane Eranian <eranian@google.com> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:30:59 +02:00
Ingo Molnar	cce9131781	Merge branch 'perf/urgent' into perf/core Merge reason: Resolve patch dependency Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:30:30 +02:00
Peter Zijlstra	4fd38e4595	perf: Fix exit() vs PERF_FORMAT_GROUP Both Stephane and Corey reported that PERF_FORMAT_GROUP didn't work as expected if the task the counters were attached to quit before the read() call. The cause is that we unconditionally destroy the grouping when we remove counters from their context. Fix this by only doing this when we free the counter itself. Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1273160566.5605.404.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:30:17 +02:00
Peter Zijlstra	27a9da6538	sched: Remove rq argument to the tracepoints struct rq isn't visible outside of sched.o so its near useless to expose the pointer, also there are no users of it, so remove it. Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1272997616.1642.207.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:28:17 +02:00
Ingo Molnar	48652ced15	Merge commit 'v2.6.34-rc6' into sched/core	2010-05-07 11:27:54 +02:00
Yong Zhang	4726f2a617	lockdep: Reduce stack_trace usage When calling check_prevs_add(), if all validations passed add_lock_to_list() will add new lock to dependency tree and alloc stack_trace for each list_entry. But at this time, we are always on the same stack, so stack_trace for each list_entry has the same value. This is redundant and eats up lots of memory which could lead to warning on low MAX_STACK_TRACE_ENTRIES. Use one copy of stack_trace instead. V2: As suggested by Peter Zijlstra, move save_trace() from check_prevs_add() to check_prev_add(). Add tracking for trylock dependence which is also redundant. Signed-off-by: Yong Zhang <yong.zhang0@windriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100504065711.GC10784@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-07 11:27:26 +02:00
Paul E. McKenney	fc390cde36	rcu: need barrier() in UP synchronize_sched_expedited() If synchronize_sched_expedited() is ever to be called from within kernel/sched.c in a !SMP PREEMPT kernel, the !SMP implementation needs a barrier(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-05-07 07:23:21 +02:00
Dan Carpenter	d9f599e1e6	perf: Fix check at end of event search The original code doesn't work because "call" is never NULL there. Signed-off-by: Dan Carpenter <error27@gmail.com> LKML-Reference: <20100320143911.GF5331@bicker> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-06 19:49:52 -04:00
Paul E. McKenney	cc631fb732	sched: correctly place paranioa memory barriers in synchronize_sched_expedited() The memory barriers must be in the SMP case, not in the !SMP case. Also add a barrier after the atomic_inc() in order to ensure that other CPUs see post-synchronize_sched_expedited() actions as following the expedited grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-05-06 18:49:21 +02:00
Tejun Heo	94458d5ecb	sched: kill paranoia check in synchronize_sched_expedited() The paranoid check which verifies that the cpu_stop callback is actually called on all online cpus is completely superflous. It's guaranteed by cpu_stop facility and if it didn't work as advertised other things would go horribly wrong and trying to recover using synchronize_sched() wouldn't be very meaningful. Kill the paranoid check. Removal of this feature is done as a separate step so that it can serve as a bisection point if something actually goes wrong. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Josh Triplett <josh@freedesktop.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>	2010-05-06 18:49:21 +02:00
Tejun Heo	969c79215a	sched: replace migration_thread with cpu_stop Currently migration_thread is serving three purposes - migration pusher, context to execute active_load_balance() and forced context switcher for expedited RCU synchronize_sched. All three roles are hardcoded into migration_thread() and determining which job is scheduled is slightly messy. This patch kills migration_thread and replaces all three uses with cpu_stop. The three different roles of migration_thread() are splitted into three separate cpu_stop callbacks - migration_cpu_stop(), active_load_balance_cpu_stop() and synchronize_sched_expedited_cpu_stop() - and each use case now simply asks cpu_stop to execute the callback as necessary. synchronize_sched_expedited() was implemented with private preallocated resources and custom multi-cpu queueing and waiting logic, both of which are provided by cpu_stop. synchronize_sched_expedited_count is made atomic and all other shared resources along with the mutex are dropped. synchronize_sched_expedited() also implemented a check to detect cases where not all the callback got executed on their assigned cpus and fall back to synchronize_sched(). If called with cpu hotplug blocked, cpu_stop already guarantees that and the condition cannot happen; otherwise, stop_machine() would break. However, this patch preserves the paranoid check using a cpumask to record on which cpus the stopper ran so that it can serve as a bisection point if something actually goes wrong theree. Because the internal execution state is no longer visible, rcu_expedited_torture_stats() is removed. This patch also renames cpu_stop threads to from "stopper/%d" to "migration/%d". The names of these threads ultimately don't matter and there's no reason to make unnecessary userland visible changes. With this patch applied, stop_machine() and sched now share the same resources. stop_machine() is faster without wasting any resources and sched migration users are much cleaner. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Josh Triplett <josh@freedesktop.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>	2010-05-06 18:49:21 +02:00
Tejun Heo	3fc1f1e27a	stop_machine: reimplement using cpu_stop Reimplement stop_machine using cpu_stop. As cpu stoppers are guaranteed to be available for all online cpus, stop_machine_create/destroy() are no longer necessary and removed. With resource management and synchronization handled by cpu_stop, the new implementation is much simpler. Asking the cpu_stop to execute the stop_cpu() state machine on all online cpus with cpu hotplug disabled is enough. stop_machine itself doesn't need to manage any global resources anymore, so all per-instance information is rolled into struct stop_machine_data and the mutex and all static data variables are removed. The previous implementation created and destroyed RT workqueues as necessary which made stop_machine() calls highly expensive on very large machines. According to Dimitri Sivanich, preventing the dynamic creation/destruction makes booting faster more than twice on very large machines. cpu_stop resources are preallocated for all online cpus and should have the same effect. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>	2010-05-06 18:49:20 +02:00
Tejun Heo	1142d81029	cpu_stop: implement stop_cpu[s]() Implement a simplistic per-cpu maximum priority cpu monopolization mechanism. A non-sleeping callback can be scheduled to run on one or multiple cpus with maximum priority monopolozing those cpus. This is primarily to replace and unify RT workqueue usage in stop_machine and scheduler migration_thread which currently is serving multiple purposes. Four functions are provided - stop_one_cpu(), stop_one_cpu_nowait(), stop_cpus() and try_stop_cpus(). This is to allow clean sharing of resources among stop_cpu and all the migration thread users. One stopper thread per cpu is created which is currently named "stopper/CPU". This will eventually replace the migration thread and take on its name. * This facility was originally named cpuhog and lived in separate files but Peter Zijlstra nacked the name and thus got renamed to cpu_stop and moved into stop_machine.c. * Better reporting of preemption leak as per Peter's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>	2010-05-06 18:49:20 +02:00
Paul E. McKenney	ee84b8243b	rcu: create rcu_my_thread_group_empty() wrapper Some RCU-lockdep splat repairs need to know whether they are running in a single-threaded process. Unfortunately, the thread_group_empty() primitive is defined in sched.h, and can induce #include hell. This commit therefore introduces a rcu_my_thread_group_empty() wrapper that is defined in rcupdate.c, thus avoiding the need to include sched.h everywhere. Signed-off-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>	2010-05-06 09:28:41 -07:00
James Morris	043b4d40f5	Merge branch 'master' into next Conflicts: security/keys/keyring.c Resolved conflict with whitespace fix in find_keyring_by_name() Signed-off-by: James Morris <jmorris@namei.org>	2010-05-06 22:21:04 +10:00
James Morris	0ffbe2699c	Merge branch 'master' into next	2010-05-06 10:56:07 +10:00
Thiago Farina	668eb65f09	tracing: Fix "integer as NULL pointer" warning. kernel/trace/trace_output.c:256:24: warning: Using plain integer as NULL pointer Signed-off-by: Thiago Farina <tfransosi@gmail.com> LKML-Reference: <1264349038-1766-3-git-send-email-tfransosi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-05 12:01:26 -04:00
Linus Torvalds	8777c793d6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: flush_delayed_work: keep the original workqueue for re-queueing	2010-05-05 07:56:36 -07:00
Linus Torvalds	f5fa05d972	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix resource leak in failure path of perf_event_open()	2010-05-04 15:16:15 -07:00
Linus Torvalds	f2809d61d6	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: Fix RCU lockdep splat on freezer_fork path rcu: Fix RCU lockdep splat in set_task_cpu on fork path mutex: Don't spin when the owner CPU is offline or other weird cases	2010-05-04 15:15:43 -07:00
Li Zefan	b629317e66	sched: Fix an RCU warning in print_task() With CONFIG_PROVE_RCU=y, a warning can be triggered: $ cat /proc/sched_debug ... kernel/cgroup.c:1649 invoked rcu_dereference_check() without protection! ... Both cgroup_path() and task_group() should be called with either rcu_read_lock or cgroup_mutex held. The rcu_dereference_check() does include cgroup_lock_is_held(), so we know that this lock is not held. Therefore, in a CONFIG_PREEMPT kernel, to say nothing of a CONFIG_PREEMPT_RT kernel, the original code could have ended up copying a string out of the freelist. This patch inserts RCU read-side primitives needed to prevent this scenario. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-04 09:25:01 -07:00
Li Zefan	fae9c79170	cgroup: Fix an RCU warning in alloc_css_id() With CONFIG_PROVE_RCU=y, a warning can be triggered: # mount -t cgroup -o memory xxx /mnt # mkdir /mnt/0 ... kernel/cgroup.c:4442 invoked rcu_dereference_check() without protection! ... This is a false-positive. It's safe to directly access parent_css->id. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-04 09:25:00 -07:00
Li Zefan	9a9686b634	cgroup: Fix an RCU warning in cgroup_path() with CONFIG_PROVE_RCU=y, a warning can be triggered: # mount -t cgroup -o debug xxx /mnt # cat /proc/$$/cgroup ... kernel/cgroup.c:1649 invoked rcu_dereference_check() without protection! ... This is a false-positive, because cgroup_path() can be called with either rcu_read_lock() held or cgroup_mutex held. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2010-05-04 09:24:59 -07:00
Borislav Petkov	956097912c	ring-buffer: Wrap open-coded WARN_ONCE Wrap open-coded WARN_ONCE functionality into the equivalent macro. Signed-off-by: Borislav Petkov <bp@alien8.de> LKML-Reference: <20100502060354.GA5281@liondog.tnic> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-05-04 12:23:47 -04:00
Frederic Weisbecker	777d0411cd	hw_breakpoints: Fix percpu build failure Fix this build error: kernel/hw_breakpoint.c:58:1: error: pasting "__pcpu_scope_" and "" does not give a valid preprocessing token It happens if CONFIG_DEBUG_FORCE_WEAK_PER_CPU, because we concatenate someting with the name and we have the "" in the name. Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Wessel <jason.wessel@windriver.com> LKML-Reference: <20100503133942.GA5497@nowhere> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-04 08:39:36 +02:00
Frederic Weisbecker	54d47a2be5	lockdep: No need to disable preemption in debug atomic ops No need to disable preemption in the debug_atomic_* ops, as we ensure interrupts are disabled already. So let's use the __this_cpu_ops() rather than this_cpu_ops() that enclose the ops in a preempt disabled section. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org>	2010-05-04 05:38:16 +02:00
Frederic Weisbecker	fa9a97dec6	lockdep: Actually _dec_ in debug_atomic_dec Fix a silly copy-paste mistake that was making debug_atomic_dec use this_cpu_inc instead of this_cpu_dec. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org>	2010-05-04 05:38:12 +02:00
Frederic Weisbecker	ba697f40db	lockdep: Provide off case for redundant_hardirqs_on increment We forgot to provide a !CONFIG_DEBUG_LOCKDEP case for the redundant_hardirqs_on stat handling. Manage that in the headers with a new __debug_atomic_inc() helper. Fixes: kernel/lockdep.c:2306: error: 'lockdep_stats' undeclared (first use in this function) kernel/lockdep.c:2306: error: (Each undeclared identifier is reported only once kernel/lockdep.c:2306: error: for each function it appears in.) Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org>	2010-05-04 05:37:28 +02:00
Peter P Waskiewicz Jr	e7a297b0d7	genirq: Add CPU mask affinity hint This patch adds a cpumask affinity hint to the irq_desc structure, along with a registration function and a read-only proc entry for each interrupt. This affinity_hint handle for each interrupt can be used by underlying drivers that need a better mechanism to control interrupt affinity. The underlying driver can register a cpumask for the interrupt, which will allow the driver to provide the CPU mask for the interrupt to anything that requests it. The intent is to extend the userspace daemon, irqbalance, to help hint to it a preferred CPU mask to balance the interrupt into. [ tglx: Fixed compile warnings, added WARN_ON, made SMP only ] Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Cc: davem@davemloft.net Cc: arjan@linux.jf.intel.com Cc: bhutchings@solarflare.com LKML-Reference: <20100430214445.3992.41647.stgit@ppwaskie-hc2.jf.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-05-03 11:50:57 +02:00
Steffen Klassert	6751fb3c0e	padata: Use get_online_cpus/put_online_cpus This patch puts get_online_cpus/put_online_cpus around the places we modify the padata cpumask to ensure that no cpu goes offline during this operation. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-03 11:32:12 +08:00
Steffen Klassert	7b389b2cc5	padata: Initialize the padata queues only for the used cpus padata_alloc_pd set up queues for all possible cpus. This patch changes this to set up the queues just for the used cpus. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-03 11:32:11 +08:00
Steffen Klassert	7d0d2d385c	padata: Remove superfluous might_sleep might_sleep() was placed before mutex_lock() in some places. We remove them because mutex_lock() does might_sleep() too. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-03 11:32:11 +08:00
Steffen Klassert	e2cb2f1c2c	padata: cpu hotplug code should depend on CONFIG_HOTPLUG_CPU This patch makes the padata cpu hotplug code dependend on CONFIG_HOTPLUG_CPU. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-03 11:32:11 +08:00
Herbert Xu	df2071bd08	Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6	2010-05-03 11:28:58 +08:00
Steffen Klassert	97e3d94aac	padata: Dont scale the parallel objects with the cpus Scaling the maximum number of objects in the parallel codepath can lead to out of memory problems on bigsmp machines. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-05-03 11:16:13 +08:00
Tejun Heo	048c852051	perf: Fix resource leak in failure path of perf_event_open() perf_event_open() kfrees event after init failure which doesn't release all resources allocated by perf_event_alloc(). Use free_event() instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@au1.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: <stable@kernel.org> LKML-Reference: <4BDBE237.1040809@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-05-01 13:11:25 +02:00
Frederic Weisbecker	feef47d0cb	hw-breakpoints: Get the number of available registers on boot dynamically The breakpoint generic layer assumes that archs always know in advance the static number of address registers available to host breakpoints through the HBP_NUM macro. However this is not true for every archs. For example Arm needs to get this information dynamically to handle the compatiblity between different versions. To solve this, this patch proposes to drop the static HBP_NUM macro and let the arch provide the number of available slots through a new hw_breakpoint_slots() function. For archs that have CONFIG_HAVE_MIXED_BREAKPOINTS_REGS selected, it will be called once as the number of registers fits for instruction and data breakpoints together. For the others it will be called first to get the number of instruction breakpoint registers and another time to get the data breakpoint registers, the targeted type is given as a parameter of hw_breakpoint_slots(). Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Ingo Molnar <mingo@elte.hu>	2010-05-01 04:32:14 +02:00
Frederic Weisbecker	f93a205411	hw-breakpoints: Handle breakpoint weight in allocation constraints Depending on their nature and on what an arch supports, breakpoints may consume more than one address register. For example a simple absolute address match usually only requires one address register. But an address range match may consume two registers. Currently our slot allocation constraints, that tend to reflect the limited arch's resources, always consider that a breakpoint consumes one slot. Then provide a way for archs to tell us the weight of a breakpoint through a new hw_breakpoint_weight() helper. This weight will be computed against the generic allocation constraints instead of a constant value. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu>	2010-05-01 04:32:12 +02:00
Frederic Weisbecker	0102752e4c	hw-breakpoints: Separate constraint space for data and instruction breakpoints There are two outstanding fashions for archs to implement hardware breakpoints. The first is to separate breakpoint address pattern definition space between data and instruction breakpoints. We then have typically distinct instruction address breakpoint registers and data address breakpoint registers, delivered with separate control registers for data and instruction breakpoints as well. This is the case of PowerPc and ARM for example. The second consists in having merged breakpoint address space definition between data and instruction breakpoint. Address registers can host either instruction or data address and the access mode for the breakpoint is defined in a control register. This is the case of x86 and Super H. This patch adds a new CONFIG_HAVE_MIXED_BREAKPOINTS_REGS config that archs can select if they belong to the second case. Those will have their slot allocation merged for instructions and data breakpoints. The others will have a separate slot tracking between data and instruction breakpoints. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu>	2010-05-01 04:32:11 +02:00
Frederic Weisbecker	b2812d031d	hw-breakpoints: Change/Enforce some breakpoints policies The current policies of breakpoints in x86 and SH are the following: - task bound breakpoints can only break on userspace addresses - cpu wide breakpoints can only break on kernel addresses The former rule prevents ptrace breakpoints to be set to trigger on kernel addresses, which is good. But as a side effect, we can't breakpoint on kernel addresses for task bound breakpoints. The latter rule simply makes no sense, there is no reason why we can't set breakpoints on userspace while performing cpu bound profiles. We want the following new policies: - task bound breakpoint can set userspace address breakpoints, with no particular privilege required. - task bound breakpoints can set kernelspace address breakpoints but must be privileged to do that. - cpu bound breakpoints can do what they want as they are privileged already. To implement these new policies, this patch checks if we are dealing with a kernel address breakpoint, if so and if the exclude_kernel parameter is set, we tell the user that the breakpoint is invalid, which makes a good generic ptrace protection. If we don't have exclude_kernel, ensure the user has the right privileges as kernel breakpoints are quite sensitive (risk of trap recursion attacks and global performance impacts). [ Paul Mundt: keep addr space check for sh signal delivery and fix double function declaration] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2010-05-01 04:32:10 +02:00
Frederic Weisbecker	87e9b20246	hw-breakpoints: Check disabled breakpoints again We stopped checking disabled breakpoints because we weren't allowing breakpoints on NULL addresses. And gdb tends to set NULL addresses on inactive breakpoints. But refusing NULL addresses was actually a regression that has been fixed now. There is no reason anymore to not validate inactive breakpoint settings. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Ingo Molnar <mingo@elte.hu>	2010-05-01 04:32:09 +02:00
Kei Tokunaga	bf81623542	[SCSI] add scsi trace core functions and put trace points Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com> Signed-off-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-04-30 12:51:10 -05:00
Kei Tokunaga	5a2e399595	[SCSI] ftrace: add __print_hex() __print_hex() prints values in an array in hex (w/o '0x') (space separated) EX) 92 33 32 f3 ee 4d Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com> Signed-off-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: James Bottomley <James.Bottomley@suse.de>	2010-04-30 12:50:22 -05:00
Frederic Weisbecker	913769f24e	lockdep: Simplify debug atomic ops Simplify debug_atomic_inc/dec by using this_cpu_inc/dec() instead of doing it through an indirect get_cpu_var() and a manual incrementation. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org>	2010-04-30 19:15:56 +02:00
Frederic Weisbecker	8795d7717c	lockdep: Fix redundant_hardirqs_on incremented with irqs enabled When a path restore the flags while irqs are already enabled, we update the per cpu var redundant_hardirqs_on in a racy fashion and debug_atomic_inc() warns about this situation. In this particular case, loosing a few hits in a stat is not a big deal, so increment it without protection. v2: Don't bother with disabling irq, we can miss one count in rare situations Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Miller <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2010-04-30 19:15:49 +02:00
Frederic Weisbecker	868c522b1b	Merge commit 'v2.6.34-rc6' into core/locking Merge reason: Further lockdep patches depend on per cpu updates made in -rc1.	2010-04-30 19:12:47 +02:00
Paul E. McKenney	8b46f88084	rcu: Fix RCU lockdep splat on freezer_fork path Add an RCU read-side critical section to suppress this false positive. Located-by: Eric Paris <eparis@parisplace.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: eric.dumazet@gmail.com LKML-Reference: <1271880131-3951-2-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-30 12:03:17 +02:00
Peter Zijlstra	8b08ca52f5	rcu: Fix RCU lockdep splat in set_task_cpu on fork path Add an RCU read-side critical section to suppress this false positive. Located-by: Eric Paris <eparis@parisplace.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: eric.dumazet@gmail.com LKML-Reference: <1271880131-3951-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-30 12:03:17 +02:00
Ingo Molnar	3ca50496c2	Merge commit 'v2.6.34-rc6' into perf/core Merge reason: update to the latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-30 09:56:44 +02:00
Oleg Nesterov	4d707b9f48	workqueue: change cancel_work_sync() to clear work->data In short: change cancel_work_sync(work) to mark this work as "never queued" upon return. When cancel_work_sync(work) succeeds, we know that this work can't be queued or running, and since we own WORK_STRUCT_PENDING nobody can change the bits in work->data under us. This means we can also clear the "cwq" part along with _PENDING bit lockless before return, unless the work is queued nobody can assume get_wq_data() is stable even under cwq->lock. This change can speedup the subsequent cancel/flush requests, and as Dmitry pointed out this simplifies the usage of work_struct's which can be queued on different workqueues. Consider this pseudo code from the input subsystem: struct workqueue_struct WQ; struct work_struct WORK; for (;;) { WQ = create_workqueue(); ... if (condition()) queue_work(WQ, WORK); ... cancel_work_sync(WORK); destroy_workqueue(WQ); } If condition() returns T and then F, cancel_work_sync() will crash the kernel because WORK->data still points to the already destroyed workqueue. With this patch the code like above becomes correct. Suggested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-04-30 08:57:25 +02:00
Alan Stern	eef6a7d5c2	workqueue: warn about flush_scheduled_work() This patch (as1319) adds kerneldoc and a pointed warning to flush_scheduled_work(). Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-04-30 08:57:25 +02:00
Oleg Nesterov	47dd5be2d6	workqueue: flush_delayed_work: keep the original workqueue for re-queueing flush_delayed_work() always uses keventd_wq for re-queueing, but it should use the workqueue this dwork was queued on. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-04-30 07:24:51 +02:00
Jens Axboe	7407cf355f	Merge branch 'master' into for-2.6.35 Conflicts: fs/block_dev.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-04-29 09:36:24 +02:00
Steven Rostedt	37e44bc50d	tracing: Fix sleep time function profiling When sleep_time is off the function profiler ignores the time that a task is scheduled out. When the task is scheduled out a timestamp is taken. When the task is scheduled back in, the timestamp is compared to the current time and the saved calltimes are adjusted accordingly. But when stopping the function profiler, the sched switch hook that does this adjustment was stopped before shutting down the tracer. This allowed some tasks to not get their timestamps set when they scheduled out. When the function profiler started again, this would skew the times of the scheduler functions. This patch moves the stopping of the sched switch to after the function profiler is stopped. It also ignores zero set calltimes, which may happen on start up. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-27 21:04:24 -04:00
Chase Douglas	e330b3bcd8	tracing: Show sample std dev in function profiling When combined with function graph tracing the ftrace function profiler also prints the average run time of functions. While this gives us some good information, it doesn't tell us anything about the variance of the run times of the function. This change prints out the s^2 sample standard deviation alongside the average. This change adds one entry to the profile record structure. This increases the memory footprint of the function profiler by 1/3 on a 32-bit system, and by 1/5 on a 64-bit system when function graphing is enabled, though the memory is only allocated when the profiler is turned on. During the profiling, one extra line of code adds the squared calltime to the new record entry, so this should not adversly affect performance. Note that the square of the sample standard deviation is printed because there is no sqrt implementation for unsigned long long in the kernel. Signed-off-by: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1272304925-2436-1-git-send-email-chase.douglas@canonical.com> [ fixed comment about ns^2 -> us^2 conversion ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-27 18:23:15 -04:00
Steven Rostedt	a838b2e634	ring-buffer: Make benchmark handle missed events With the addition of the "missed events" flags that is stored in the commit field of the ring buffer page, the ring_buffer_benchmark was not updated to handle this. If events are missed, then the missed events flag is set in the ring buffer page, the benchmark will count that flag as part of the size of the page and will hit the BUG() when it tries to read beyond the page. The solution is simply to have the ring buffer benchmark mask off the extra bits. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-27 13:26:58 -04:00
David Miller	72c9ddfd4c	ring-buffer: Make non-consuming read less expensive with lots of cpus. When performing a non-consuming read, a synchronize_sched() is performed once for every cpu which is actively tracing. This is very expensive, and can make it take several seconds to open up the 'trace' file with lots of cpus. Only one synchronize_sched() call is actually necessary. What is desired is for all cpus to see the disabling state change. So we transform the existing sequence: for_each_cpu() { ring_buffer_read_start(); } where each ring_buffer_start() call performs a synchronize_sched(), into the following: for_each_cpu() { ring_buffer_read_prepare(); } ring_buffer_read_prepare_sync(); for_each_cpu() { ring_buffer_read_start(); } wherein only the single ring_buffer_read_prepare_sync() call needs to do the synchronize_sched(). The first phase, via ring_buffer_read_prepare(), allocates the 'iter' memory and increments ->record_disabled. In the second phase, ring_buffer_read_prepare_sync() makes sure this ->record_disabled state is visible fully to all cpus. And in the final third phase, the ring_buffer_read_start() calls reset the 'iter' objects allocated in the first phase since we now know that none of the cpus are adding trace entries any more. This makes openning the 'trace' file nearly instantaneous on a sparc64 Niagara2 box with 128 cpus tracing. Signed-off-by: David S. Miller <davem@davemloft.net> LKML-Reference: <20100420.154711.11246950.davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-27 13:06:35 -04:00
Jiri Olsa	62b915f106	tracing: Add graph output support for irqsoff tracer Add function graph output to irqsoff tracer. The graph output is enabled by setting new 'display-graph' trace option. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1270227683-14631-4-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-27 12:36:53 -04:00
Alessio Igor Bogani	b8bc1389b7	ptrace: Cleanup useless header BKL isn't present anymore into this file thus we can safely remove smp_lock.h inclusion. Signed-off-by: Alessio Igor Bogani <abogani@texware.it> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: James Morris <jmorris@namei.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-04-26 23:42:51 +02:00
Jiri Olsa	d7a8d9e907	tracing: Have graph flags passed in to ouput functions Let the function graph tracer have custom flags passed to its output functions. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1270227683-14631-3-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-26 17:30:18 -04:00
Jiri Olsa	9106b69382	tracing: Add ftrace events for graph tracer Add ftrace events for graph tracer, so the graph output could be shared with other tracers. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1270227683-14631-2-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-26 16:55:08 -04:00
Andreas Schwab	46da276648	kernel/sys.c: fix compat uname machine On ppc64 you get this error: $ setarch ppc -R true setarch: ppc: Unrecognized architecture because uname still reports ppc64 as the machine. So mask off the personality flags when checking for PER_LINUX32. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-04-24 11:31:24 -07:00
Robert Richter	b971f06187	Merge commit 'tip/tracing/core' into oprofile/core Conflicts: drivers/oprofile/cpu_buffer.c Signed-off-by: Robert Richter <robert.richter@amd.com>	2010-04-23 16:47:51 +02:00
Ingo Molnar	77a7f2e94e	Merge branch 'tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into tracing/core	2010-04-23 11:25:31 +02:00
Ingo Molnar	70bce3ba77	Merge branch 'linus' into perf/core Merge reason: merge the latest fixes, update to latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-23 11:10:30 +02:00
Suresh Siddha	99bd5e2f24	sched: Fix select_idle_sibling() logic in select_task_rq_fair() Issues in the current select_idle_sibling() logic in select_task_rq_fair() in the context of a task wake-up: a) Once we select the idle sibling, we use that domain (spanning the cpu that the task is currently woken-up and the idle sibling that we found) in our wake_affine() decisions. This domain is completely different from the domain(we are supposed to use) that spans the cpu that the task currently woken-up and the cpu where the task previously ran. b) We do select_idle_sibling() check only for the cpu that the task is currently woken-up on. If select_task_rq_fair() selects the previously run cpu for waking the task, doing a select_idle_sibling() check for that cpu also helps and we don't do this currently. c) In the scenarios where the cpu that the task is woken-up is busy but with its HT siblings are idle, we are selecting the task be woken-up on the idle HT sibling instead of a core that it previously ran and currently completely idle. i.e., we are not taking decisions based on wake_affine() but directly selecting an idle sibling that can cause an imbalance at the SMT/MC level which will be later corrected by the periodic load balancer. Fix this by first going through the load imbalance calculations using wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu, then choose a possible idle sibling for waking up the task on. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1270079265.7835.8.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-23 11:02:02 +02:00
Peter Zijlstra	669c55e9f9	sched: Pre-compute cpumask_weight(sched_domain_span(sd)) Dave reported that his large SPARC machines spend lots of time in hweight64(), try and optimize some of those needless cpumask_weight() invocations (esp. with the large offstack cpumasks these are very expensive indeed). Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-23 11:02:02 +02:00
Peter Zijlstra	74f5187ac8	sched: Cure load average vs NO_HZ woes Chase reported that due to us decrementing calc_load_task prematurely (before the next LOAD_FREQ sample), the load average could be scewed by as much as the number of CPUs in the machine. This patch, based on Chase's patch, cures the problem by keeping the delta of the CPU going into NO_HZ idle separately and folding that in on the next LOAD_FREQ update. This restores the balance and we get strict LOAD_FREQ period samples. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1271934490.1776.343.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-23 11:02:02 +02:00
Benjamin Herrenschmidt	4b40221048	mutex: Don't spin when the owner CPU is offline or other weird cases Due to recent load-balancer changes that delay the task migration to the next wakeup, the adaptive mutex spinning ends up in a live lock when the owner's CPU gets offlined because the cpu_online() check lives before the owner running check. This patch changes mutex_spin_on_owner() to return 0 (don't spin) in any case where we aren't sure about the owner struct validity or CPU number, and if the said CPU is offline. There is no point going back & re-evaluate spinning in corner cases like that, let's just go to sleep. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1271212509.13059.135.camel@pasglop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-23 11:00:28 +02:00
Jiri Kosina	6c9468e9eb	Merge branch 'master' into for-next	2010-04-23 02:08:44 +02:00
David Howells	e134d200d5	CRED: Fix a race in creds_are_invalid() in credentials debugging creds_are_invalid() reads both cred->usage and cred->subscribers and then compares them to make sure the number of processes subscribed to a cred struct never exceeds the refcount of that cred struct. The problem is that this can cause a race with both copy_creds() and exit_creds() as the two counters, whilst they are of atomic_t type, are only atomic with respect to themselves, and not atomic with respect to each other. This means that if creds_are_invalid() can read the values on one CPU whilst they're being modified on another CPU, and so can observe an evolving state in which the subscribers count now is greater than the usage count a moment before. Switching the order in which the counts are read cannot help, so the thing to do is to remove that particular check. I had considered rechecking the values to see if they're in flux if the test fails, but I can't guarantee they won't appear the same, even if they've changed several times in the meantime. Note that this can only happen if CONFIG_DEBUG_CREDENTIALS is enabled. The problem is only likely to occur with multithreaded programs, and can be tested by the tst-eintr1 program from glibc's "make check". The symptoms look like: CRED: Invalid credentials CRED: At include/linux/cred.h:240 CRED: Specified credentials: ffff88003dda5878 [real][eff] CRED: ->magic=43736564, put_addr=(null) CRED: ->usage=766, subscr=766 CRED: ->uid = { 0,0,0,0 } CRED: ->gid = { 0,0,0,0 } CRED: ->security is ffff88003d72f538 CRED: ->security {359, 359} ------------[ cut here ]------------ kernel BUG at kernel/cred.c:850! ... RIP: 0010:[<ffffffff81049889>] [<ffffffff81049889>] __invalid_creds+0x4e/0x52 ... Call Trace: [<ffffffff8104a37b>] copy_creds+0x6b/0x23f Note the ->usage=766 and subscr=766. The values appear the same because they've been re-read since the check was made. Reported-by: Roland McGrath <roland@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-22 09:14:29 +10:00
Frederic Weisbecker	cecbca96da	tracing: Dump either the oops's cpu source or all cpus buffers The ftrace_dump_on_oops kernel parameter, sysctl and sysrq let one dump every cpu buffers when an oops or panic happens. It's nice when you have few cpus but it may take ages if have many, plus you miss the real origin of the problem in all the cpu traces. Sometimes, all you need is to dump the cpu buffer that triggered the opps, most of the time it is our main interest. This patch modifies ftrace_dump_on_oops to handle this choice. The ftrace_dump_on_oops kernel parameter, when it comes alone, has the same behaviour than before. But ftrace_dump_on_oops=orig_cpu will only dump the buffer of the cpu that oops'ed. Similarly, sysctl kernel.ftrace_dump_on_oops=1 and echo 1 > /proc/sys/kernel/ftrace_dump_on_oops keep their previous behaviour. But setting 2 jumps into cpu origin dump mode. v2: Fix double setup v3: Fix spelling issues reported by Randy Dunlap v4: Also update __ftrace_dump in the selftests Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>	2010-04-21 23:11:42 +02:00
Ingo Molnar	ac0053fd51	Merge commit 'v2.6.34-rc5' into tracing/core Merge reason: pick up latest -rc's. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-21 09:47:05 +02:00
David Howells	eff30363c0	CRED: Fix double free in prepare_usermodehelper_creds() error handling Patch `570b8fb505`: Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Date: Tue Mar 30 00:04:00 2010 +0100 Subject: CRED: Fix memory leak in error handling attempts to fix a memory leak in the error handling by making the offending return statement into a jump down to the bottom of the function where a kfree(tgcred) is inserted. This is, however, incorrect, as it does a kfree() after doing put_cred() if security_prepare_creds() fails. That will result in a double free if 'error' is jumped to as put_cred() will also attempt to free the new tgcred record by virtue of it being pointed to by the new cred record. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-21 09:20:35 +10:00
Zhang, Yanmin	39447b386c	perf: Enhance perf to allow for guest statistic collection from host Below patch introduces perf_guest_info_callbacks and related register/unregister functions. Add more PERF_RECORD_MISC_XXX bits meaning guest kernel and guest user space. Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2010-04-19 12:35:33 +03:00
Paul E. McKenney	bc293d62b2	rcu: Make RCU lockdep check the lockdep_recursion variable The lockdep facility temporarily disables lockdep checking by incrementing the current->lockdep_recursion variable. Such disabling happens in NMIs and in other situations where lockdep might expect to recurse on itself. This patch therefore checks current->lockdep_recursion, disabling RCU lockdep splats when this variable is non-zero. In addition, this patch removes the "likely()", as suggested by Lai Jiangshan. Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Reported-by: David Miller <davem@davemloft.net> Tested-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: eric.dumazet@gmail.com LKML-Reference: <20100415195039.GA22623@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-19 08:37:19 +02:00
Mike Galbraith	09a40af524	sched: Fix UP update_avg() build warning update_avg() is only used for SMP builds, move it to the nearest SMP block. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1271309399.14779.17.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-15 09:36:47 +02:00
Ingo Molnar	b257c14ceb	Merge branch 'linus' into sched/core Merge reason: merge the latest fixes, update to -rc4. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-15 09:36:16 +02:00
Ingo Molnar	b5a80b7e91	Merge branch 'perf' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core	2010-04-15 09:16:51 +02:00
Divyesh Shah	b6ac23af2c	blkio: fix for modular blk-cgroup build After merging the block tree, 20100414's linux-next build (x86_64 allmodconfig) failed like this: ERROR: "get_gendisk" [block/blk-cgroup.ko] undefined! ERROR: "sched_clock" [block/blk-cgroup.ko] undefined! This happens because the two symbols aren't exported and hence not available when blk-cgroup code is built as a module. I've tried to stay consistent with the use of EXPORT_SYMBOL or EXPORT_SYMBOL_GPL with the other symbols in the respective files. Signed-off-by: Divyesh Shah <dpshah@google.com> Acked-by: Gui Jianfeng <guijianfeng@cn.fujitsu.cn> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-04-15 08:54:59 +02:00
Frederic Weisbecker	95476b64ab	perf: Fix hlist related build error hlist helpers need to be available for all software events, not only trace events. Pull them out outside the ifdef CONFIG_EVENT_TRACING section. Fixes: kernel/perf_event.c:4573: error: implicit declaration of function 'swevent_hlist_put' kernel/perf_event.c:4614: error: implicit declaration of function 'swevent_hlist_get' kernel/perf_event.c:5534: error: implicit declaration of function 'swevent_hlist_release Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1271281338-23491-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-15 01:34:46 +02:00
Masami Hiramatsu	93ccae7a22	tracing/kprobes: Support basic types on dynamic events Support basic types of integer (u8, u16, u32, u64, s8, s16, s32, s64) in kprobe tracer. With this patch, users can specify above basic types on each arguments after ':'. If omitted, the argument type is set as unsigned long (u32 or u64, arch-dependent). e.g. echo 'p account_system_time+0 hardirq_offset=%si:s32' > kprobe_events adds a probe recording hardirq_offset in signed-32bits value on the entry of account_system_time. Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100412171708.3790.18599.stgit@localhost6.localdomain6> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-04-14 17:26:28 -03:00
Frederic Weisbecker	df8290bf7e	perf: Make clock software events consistent with general exclusion rules The cpu/task clock events implement their own version of exclusion on top of exclude_user and exclude_kernel. The result is that when the event triggered in the kernel but we have exclude_kernel set, we try to rewind using task_pt_regs. There are two side effects of this: - we call task_pt_regs even on kernel threads, which doesn't give us the desired result. - if the event occured in the kernel, we shouldn't rewind to the user context. We want to actually ignore the event. get_irq_regs() will always give us the right interrupted context, so use its result and submit it to perf_exclude_context() that knows when an event must be ignored. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu>	2010-04-14 18:20:33 +02:00
Frederic Weisbecker	76e1d9047e	perf: Store active software events in a hashlist Each time a software event triggers, we need to walk through the entire list of events from the current cpu and task contexts to retrieve a running perf event that matches. We also need to check a matching perf event is actually counting. This walk is wasteful and makes the event fast path scaling down with a growing number of events running on the same contexts. To solve this, we store the running perf events in a hashlist to get an immediate access to them against their type:event_id when they trigger. v2: - Fix SWEVENT_HLIST_SIZE definition (and re-learn some basic maths along the way) - Only allocate hlist for online cpus, but keep track of the refcount on offline possible cpus too, so that we allocate it if needed when it becomes online. - Drop the kref use as it's not adapted to our tricks anymore. v3: - Fix bad refcount check (address instead of value). Thanks to Eric Dumazet who spotted this. - While exiting cpu, move the hlist release out of the IPI path to lock the hlist mutex sanely. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu>	2010-04-14 18:20:33 +02:00
Ingo Molnar	b15c7b1cee	Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core	2010-04-14 12:15:23 +02:00
Dmitry Torokhov	97f5f0cd8c	Input: implement SysRq as a separate input handler Instead of keeping SysRq support inside of legacy keyboard driver split it out into a separate input handler (filter). This stops most SysRq input events from leaking into evdev clients (some events, such as first SysRq scancode - not keycode - event, are still leaked into both legacy keyboard and evdev). [martinez.javier@gmail.com: fix compile error when CONFIG_MAGIC_SYSRQ is not defined] Signed-off-by: Dmitry Torokhov <dtor@mail.ru>	2010-04-13 23:26:02 -07:00
Thomas Gleixner	6932bf37be	genirq: Remove IRQF_DISABLED from core code Remove all code which is related to IRQF_DISABLED from the core kernel code. IRQF_DISABLED still exists as a flag, but becomes a NOOP and will be removed after a grace period. That way we can easily revert to the previous behaviour by just restoring the core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Linus Torvalds <torvalds@osdl.org> LKML-Reference: <20100326000405.991244690@linutronix.de>	2010-04-13 16:36:40 +02:00
Ingo Molnar	e58aa3d2d0	genirq: Run irq handlers with interrupts disabled Running interrupt handlers with interrupts enabled can cause stack overflows. That has been observed with multiqueue NICs delivering all their interrupts to a single core. We might band aid that somehow by checking the interrupt stacks, but the real safe fix is to run the irq handlers with interrupts disabled. Drivers for whacky hardware still can reenable them in the handler itself, if the need arises. (They do already due to lockdep) The risk of doing this is rather low: - lockdep already enforces this - CONFIG_NOHZ has shaken out the drivers which relied on jiffies updates - time keeping is not longer sensitive to the timer interrupt being delayed Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Linus Torvalds <torvalds@osdl.org> LKML-Reference: <20100326000405.758579387@linutronix.de>	2010-04-13 16:36:40 +02:00
Marc Zyngier	ae731f8d07	genirq: Introduce request_any_context_irq() Now that we enjoy threaded interrupts, we're starting to see irq_chip implementations (wm831x, pca953x) that make use of threaded interrupts for the controller, and nested interrupts for the client interrupt. It all works very well, with one drawback: Drivers requesting an IRQ must now know whether the handler will run in a thread context or not, and call request_threaded_irq() or request_irq() accordingly. The problem is that the requesting driver sometimes doesn't know about the nature of the interrupt, specially when the interrupt controller is a discrete chip (typically a GPIO expander connected over I2C) that can be connected to a wide variety of otherwise perfectly supported hardware. This patch introduces the request_any_context_irq() function that mostly mimics the usual request_irq(), except that it checks whether the irq level is configured as nested or not, and calls the right backend. On success, it also returns either IRQC_IS_HARDIRQ or IRQC_IS_NESTED. [ tglx: Made return value an enum, simplified code and made the export of request_any_context_irq GPL ] Signed-off-by: Marc Zyngier <maz@misterjones.org> Cc: <joachim.eastwood@jotron.com> LKML-Reference: <927ea285bd0c68934ddae1a47e44a9ba@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-04-13 16:36:39 +02:00
Thomas Gleixner	7c7145f6ac	Merge branch 'linus' into irq/core Reason: Get the upstream IRQF_DISABLED related changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-04-13 14:12:17 +02:00
John Stultz	6a867a3955	time: Remove xtime_cache With the earlier logarithmic time accumulation patch, xtime will now always be within one "tick" of the current time, instead of possibly half a second off. This removes the need for the xtime_cache value, which always stored the time at the last interrupt, so this patch cleans that up removing the xtime_cache related code. This patch also addresses an issue with an earlier version of this change, where xtime_cache was normalizing xtime, which could in some cases be not valid (ie: tv_nsec == NSEC_PER_SEC). This is fixed by handling the edge case in update_wall_time(). Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Petr Titěra <P.Titera@century.cz> LKML-Reference: <1270589451-30773-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-04-13 12:43:42 +02:00
Eric Paris	05b90496f2	security: remove dead hook acct Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-12 12:19:19 +10:00
Eric Paris	6307f8fee2	security: remove dead hook task_setgroups Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-12 12:19:18 +10:00
Eric Paris	06ad187e28	security: remove dead hook task_setgid Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-12 12:19:17 +10:00
Eric Paris	43ed8c3b45	security: remove dead hook task_setuid Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-12 12:19:16 +10:00
Eric Paris	0968d0060a	security: remove dead hook cred_commit Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-04-12 12:19:15 +10:00
Jiri Slaby	d88d4050dc	PM / Hibernate: user.c, fix SNAPSHOT_SET_SWAP_AREA handling When CONFIG_DEBUG_BLOCK_EXT_DEVT is set we decode the device improperly by old_decode_dev and it results in an error while hibernating with s2disk. All users already pass the new device number, so switch to new_decode_dev(). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl>	2010-04-10 22:28:56 +02:00
Arnd Bergmann	5534ecb2dd	ptrace: kill BKL in ptrace syscall The comment suggests that this usage is stale. There is no bkl in the exec path so if there is a race lurking there, the bkl in ptrace is not going to help in this regard. Overview of the possibility of "accidental" races this bkl might protect: - ptrace_traceme() is protected against task removal and concurrent read/write on current->ptrace as it locks write tasklist_lock. - arch_ptrace_attach() is serialized by ptrace_traceme() against concurrent PTRACE_TRACEME or PTRACE_ATTACH - ptrace_attach() is protected the same way ptrace_traceme() and in turn serializes arch_ptrace_attach() - ptrace_check_attach() does its own well described serializing too. There is no obvious race here. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Roland McGrath <roland@redhat.com>	2010-04-10 15:34:21 +02:00
Linus Torvalds	2aedd192f7	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix sched_getaffinity()	2010-04-08 08:37:05 -07:00
Ingo Molnar	ca7e0c6120	Merge branch 'linus' into perf/core Semantic conflict: arch/x86/kernel/cpu/perf_event_intel_ds.c Merge reason: pick up latest fixes, fix the conflict Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-08 13:37:18 +02:00
Ingo Molnar	c1ab9cab75	Merge branch 'linus' into tracing/core Conflicts: include/linux/module.h kernel/module.c Semantic conflict: include/trace/events/module.h Merge reason: Resolve the conflict with upstream commit `5fbfb18` ("Fix up possibly racy module refcounting") Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-08 10:18:47 +02:00
KAMEZAWA Hiroyuki	a3a2e76c77	mm: avoid null-pointer deref in sync_mm_rss() - We weren't zeroing p->rss_stat[] at fork() - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel threads and was oopsing. - Make __sync_task_rss_stat() static, too. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648 [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)] Reported-by: Troels Liebe Bentsen <tlb@rapanden.dk> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> "Michael S. Tsirkin" <mst@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-04-07 08:38:02 -07:00
Linus Torvalds	94c4fcec01	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Force MSI irq handlers to run with interrupts disabled	2010-04-06 13:03:22 -07:00
Carsten Emde	351b3f7a21	hrtimers: Provide schedule_hrtimeout for CLOCK_REALTIME The current version of schedule_hrtimeout() always uses the monotonic clock. Some system calls such as mq_timedsend() and mq_timedreceive(), however, require the use of the wall clock due to the definition of the system call. This patch provides the infrastructure to use schedule_hrtimeout() with a CLOCK_REALTIME timer. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Tested-by: Pradyumna Sampath <pradysam@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Veen <arjan@infradead.org> LKML-Reference: <20100402204331.167439615@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-04-06 21:50:03 +02:00
Arjan van de Ven	3bbb9ec946	timers: Introduce the concept of timer slack for legacy timers While HR timers have had the concept of timer slack for quite some time now, the legacy timers lacked this concept, and had to make do with round_jiffies() and friends. Timer slack is important for power management; grouping timers reduces the number of wakeups which in turn reduces power consumption. This patch introduces timer slack to the legacy timers using the following pieces: * A slack field in the timer struct * An api (set_timer_slack) that callers can use to set explicit timer slack * A default slack of 0.4% of the requested delay for callers that do not set any explicit slack * Rounding code that is part of mod_timer() that tries to group timers around jiffies values every 'power of two' (so quick timers will group around every 2, but longer timers will group around every 4, 8, 16, 32 etc) Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: johnstul@us.ibm.com Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-04-06 21:50:02 +02:00
Anton Blanchard	84fba5ec91	sched: Fix sched_getaffinity() taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with the following error: sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument) This box has 128 threads and 16 bytes is enough to cover it. Commit `cd3d8031eb` (sched: sched_getaffinity(): Allow less than NR_CPUS length) is comparing this 16 bytes agains nr_cpu_ids. Fix it by comparing nr_cpu_ids to the number of bits in the cpumask we pass in. Signed-off-by: Anton Blanchard <anton@samba.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Sharyathi Nagesh <sharyath@in.ibm.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Mike Travis <travis@sgi.com> LKML-Reference: <20100406070218.GM5594@kryten> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-06 10:01:35 +02:00
Nick Piggin	5fbfb18d7a	Fix up possibly racy module refcounting Module refcounting is implemented with a per-cpu counter for speed. However there is a race when tallying the counter where a reference may be taken by one CPU and released by another. Reference count summation may then see the decrement without having seen the previous increment, leading to lower than expected count. A module which never has its actual reference drop below 1 may return a reference count of 0 due to this race. Module removal generally runs under stop_machine, which prevents this race causing bugs due to removal of in-use modules. However there are other real bugs in module.c code and driver code (module_refcount is exported) where the callers do not run under stop_machine. Fix this by maintaining running per-cpu counters for the number of module refcount increments and the number of refcount decrements. The increments are tallied after the decrements, so any decrement seen will always have its corresponding increment counted. The final refcount is the difference of the total increments and decrements, preventing a low-refcount from being returned. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-04-05 19:50:02 -07:00
Frederic Weisbecker	bd6d29c25b	lockstat: Make lockstat counting per cpu Locking statistics are implemented using global atomic variables. This is usually fine unless some path write them very often. This is the case for the function and function graph tracers that disable irqs for each entry saved (except if the function tracer is in preempt disabled only mode). And calls to local_irq_save/restore() increment hardirqs_on_events and hardirqs_off_events stats (or similar stats for redundant versions). Incrementing these global vars for each function ends up in too much cache bouncing if lockstats are enabled. To solve this, implement the debug_atomic_*() operations using per cpu vars. -v2: Use per_cpu() instead of get_cpu_var() to fetch the desired cpu vars on debug_atomic_read() -v3: Store the stats in a structure. No need for local_t as we are NMI/irq safe. -v4: Fix tons of build errors. I thought I had tested it but I probably forgot to select the relevant config. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1270505417-8144-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org>	2010-04-06 00:15:37 +02:00
Eric Paris	449cedf099	audit: preface audit printk with audit There have been a number of reports of people seeing the message: "name_count maxed, losing inode data: dev=00:05, inode=3185" in dmesg. These usually lead to people reporting problems to the filesystem group who are in turn clueless what they mean. Eventually someone finds me and I explain what is going on and that these come from the audit system. The basics of the problem is that the audit subsystem never expects a single syscall to 'interact' (for some wish washy meaning of interact) with more than 20 inodes. But in fact some operations like loading kernel modules can cause changes to lots of inodes in debugfs. There are a couple real fixes being bandied about including removing the fixed compile time limit of 20 or not auditing changes in debugfs (or both) but neither are small and obvious so I am not sending them for immediate inclusion (I hope Al forwards a real solution next devel window). In the meantime this patch simply adds 'audit' to the beginning of the crap message so if a user sees it, they come blame me first and we can talk about what it means and make sure we understand all of the reasons it can happen and make sure this gets solved correctly in the long run. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-04-05 13:19:45 -07:00
Linus Torvalds	b66696e3c0	Merge branch 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc * 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: eeepc-wmi: include slab.h staging/otus: include slab.h from usbdrv.h percpu: don't implicitly include slab.h from percpu.h kmemcheck: Fix build errors due to missing slab.h include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h iwlwifi: don't include iwl-dev.h from iwl-devtrace.h x86: don't include slab.h from arch/x86/include/asm/pgtable_32.h Fix up trivial conflicts in include/linux/percpu.h due to is_kernel_percpu_address() having been introduced since the slab.h cleanup with the percpu_up.c splitup.	2010-04-05 09:39:11 -07:00
Linus Torvalds	9e74e7c81a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: module: add stub for is_module_percpu_address percpu, module: implement and use is_kernel/module_percpu_address() module: encapsulate percpu handling better and record percpu_size	2010-04-05 09:16:37 -07:00
Lai Jiangshan	aa27497c2f	tracing: Fix uninitialized variable of tracing/trace output Because a local variable is not initialized, I got these when I did 'cat tracing/trace'. (not trace_pipe): CPU:0 [LOST 18446744071579453134 EVENTS] ps-3099 [000] 560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock CPU:0 [LOST 18446744071579453134 EVENTS] ps-3099 [000] 560.770221: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock CPU:0 [LOST 18446612133255294080 EVENTS] ps-3099 [000] 560.770221: lock_acquire: ffff880030865010 &(&dentry->d_lock)->rlock CPU:0 [LOST 18446744071579453134 EVENTS] ps-3099 [000] 560.770222: lock_release: ffff880030865010 &(&dentry->d_lock)->rlock CPU:0 [LOST 18446744071579453134 EVENTS] ps-3099 [000] 560.770222: lock_release: ffffffff816cfb98 dcache_lock See peek_next_entry(), it does not set *lost_events when we 'cat tracing/trace' Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4BB9A929.2000303@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-05 11:01:22 -04:00
Tejun Heo	336f5899d2	Merge branch 'master' into export-slabh	2010-04-05 11:37:28 +09:00
Linus Torvalds	8ce42c8b7f	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Always build the powerpc perf_arch_fetch_caller_regs version perf: Always build the stub perf_arch_fetch_caller_regs version perf, probe-finder: Build fix on Debian perf/scripts: Tuple was set from long in both branches in python_process_event() perf: Fix 'perf sched record' deadlock perf, x86: Fix callgraphs of 32-bit processes on 64-bit kernels perf, x86: Fix AMD hotplug & constraint initialization x86: Move notify_cpu_starting() callback to a later stage x86,kgdb: Always initialize the hw breakpoint attribute perf: Use hot regs with software sched switch/migrate events perf: Correctly align perf event tracing buffer	2010-04-04 12:13:10 -07:00
Linus Torvalds	0121b0c771	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlock sched: Fix proc_sched_set_task()	2010-04-04 12:12:31 -07:00
Linus Torvalds	a8941b0ed0	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ring-buffer: Add missing unlock tracing: Fix lockdep warning in global_clock()	2010-04-04 12:12:19 -07:00
Arnd Bergmann	3326c1ceee	perf_event: Make perf fd non seekable Perf_event does not need seeking, so prevent it in order to get rid of default_llseek, which uses the BKL. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> [drop the nonseekable_open, not needed for anon inodes] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-04-04 15:27:58 +02:00
Ingo Molnar	22a4e4c435	Merge branch 'perf/urgent' into perf/core Conflicts: tools/perf/Makefile Merge reason: resolve the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-03 18:17:55 +02:00
Frederic Weisbecker	26d80aa782	perf: Always build the stub perf_arch_fetch_caller_regs version Now that software events use perf_arch_fetch_caller_regs() too, we need the stub version to be always built in for archs that don't implement it. Fixes the following build error in PARISC: kernel/built-in.o: In function `perf_event_task_sched_out': (.text.perf_event_task_sched_out+0x54): undefined reference to `perf_arch_fetch_caller_regs' Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org>	2010-04-03 12:22:05 +02:00
Linus Torvalds	5e123e5d9b	Merge branch 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdb: Turn off tracing while in the debugger kgdb: use atomic_inc and atomic_dec instead of atomic_set kgdb: eliminate kgdb_wait(), all cpus enter the same way kgdbts,sh: Add in breakpoint pc offset for superh kgdb: have ebin2mem call probe_kernel_write once	2010-04-02 19:45:05 -07:00
Linus Torvalds	24b99d1576	Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: Freezer: Fix buggy resume test for tasks frozen with cgroup freezer Freezer: Only show the state of tasks refusing to freeze	2010-04-02 19:44:42 -07:00
Jason Wessel	4da75b9cea	kgdb: Turn off tracing while in the debugger The kernel debugger should turn off kernel tracing any time the debugger is active and restore it on resume. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org>	2010-04-02 14:58:19 -05:00
Jason Wessel	ae6bf53e02	kgdb: use atomic_inc and atomic_dec instead of atomic_set Memory barriers should be used for the kgdb cpu synchronization. The atomic_set() does not imply a memory barrier. Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-04-02 14:58:18 -05:00
Jason Wessel	62fae31219	kgdb: eliminate kgdb_wait(), all cpus enter the same way This is a kgdb architectural change to have all the cpus (master or slave) enter the same function. A cpu that hits an exception (wants to be the master cpu) will call kgdb_handle_exception() from the trap handler and then invoke a kgdb_roundup_cpu() to synchronize the other cpus and bring them into the kgdb_handle_exception() as well. A slave cpu will enter kgdb_handle_exception() from the kgdb_nmicallback() and set the exception state to note that the processor is a slave. Previously the salve cpu would have called kgdb_wait(). This change allows the debug core to change cpus without resuming the system in order to inspect arch specific cpu information. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-04-02 14:58:18 -05:00
Jason Wessel	a0279bd580	kgdb: have ebin2mem call probe_kernel_write once Rather than call probe_kernel_write() one byte at a time, process the whole buffer locally and pass the entire result in one go. This way, architectures that need to do special handling based on the length can do so, or we only end up calling memcpy() once. [sonic.zhang@analog.com: Reported original problem and preliminary patch] Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Sonic Zhang <sonic.zhang@analog.com> Signed-off-by: Mike Frysinger <vapier@gentoo.org>	2010-04-02 14:58:17 -05:00
Peter Zijlstra	371fd7e7a5	sched: Add enqueue/dequeue flags In order to reduce the dependency on TASK_WAKING rework the enqueue interface to support a proper flags field. Replace the int wakeup, bool head arguments with an int flags argument and create the following flags: ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task, ENQUEUE_WAKING - the enqueue has relative vruntime due to having sched_class::task_waking() called, ENQUEUE_HEAD - the waking task should be places on the head of the priority queue (where appropriate). For symmetry also convert sched_class::dequeue() to a flags scheme. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:05 +02:00
Peter Zijlstra	cc87f76a60	sched: Fix nr_uninterruptible count The cpuload calculation in calc_load_account_active() assumes rq->nr_uninterruptible will not change on an offline cpu after migrate_nr_uninterruptible(). However the recent migrate on wakeup changes broke that and would result in decrementing the offline cpu's rq->nr_uninterruptible. Fix this by accounting the nr_uninterruptible on the waking cpu. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:04 +02:00
Peter Zijlstra	65cc8e4859	sched: Optimize task_rq_lock() Now that we hold the rq->lock over set_task_cpu() again, we can do away with most of the TASK_WAKING checks and reduce them again to set_cpus_allowed_ptr(). Removes some conditionals from scheduling hot-paths. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Oleg Nesterov <oleg@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:04 +02:00
Peter Zijlstra	0017d73509	sched: Fix TASK_WAKING vs fork deadlock Oleg noticed a few races with the TASK_WAKING usage on fork. - since TASK_WAKING is basically a spinlock, it should be IRQ safe - since we set TASK_WAKING () without holding rq->lock it could be there still is a rq->lock holder, thereby not actually providing full serialization. () in fact we clear PF_STARTING, which in effect enables TASK_WAKING. Cure the second issue by not setting TASK_WAKING in sched_fork(), but only temporarily in wake_up_new_task() while calling select_task_rq(). Cure the first by holding rq->lock around the select_task_rq() call, this will disable IRQs, this however requires that we push down the rq->lock release into select_task_rq_fair()'s cgroup stuff. Because select_task_rq_fair() still needs to drop the rq->lock we cannot fully get rid of TASK_WAKING. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:03 +02:00
Oleg Nesterov	9084bb8246	sched: Make select_fallback_rq() cpuset friendly Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems with select_fallback_rq(). It can be called from any context and can't use any cpuset locks including task_lock(). It is called when the task doesn't have online cpus in ->cpus_allowed but ttwu/etc must be able to find a suitable cpu. I am not proud of this patch. Everything which needs such a fat comment can't be good even if correct. But I'd prefer to not change the locking rules in the code I hardly understand, and in any case I believe this simple change make the code much more correct compared to deadlocks we currently have. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091027.GA9155@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:03 +02:00
Oleg Nesterov	6a1bdc1b57	sched: _cpu_down(): Don't play with current->cpus_allowed _cpu_down() changes the current task's affinity and then recovers it at the end. The problems are well known: we can't restore old_allowed if it was bound to the now-dead-cpu, and we can race with the userspace which can change cpu-affinity during unplug. _cpu_down() should not play with current->cpus_allowed at all. Instead, take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable() removes the dying cpu from cpu_online_mask. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091023.GA9148@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:03 +02:00
Oleg Nesterov	30da688ef6	sched: sched_exec(): Remove the select_fallback_rq() logic sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless. This can race with other CPUs updating our ->cpus_allowed, and this looks meaningless to me. The task is current and running, it must have online cpus in ->cpus_allowed, the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu, this likely means we raced with set_cpus_allowed() which was called for reason, why should sched_exec() retry and call ->select_task_rq() again? Change the code to call sched_class->select_task_rq() directly and do nothing if the returned cpu is wrong after re-checking under rq->lock. From now task_struct->cpus_allowed is always stable under TASK_WAKING, select_fallback_rq() is always called under rq-lock or the caller or the caller owns TASK_WAKING (select_task_rq). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091019.GA9141@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:02 +02:00
Oleg Nesterov	c1804d547d	sched: move_task_off_dead_cpu(): Remove retry logic The previous patch preserved the retry logic, but it looks unneeded. __migrate_task() can only fail if we raced with migration after we dropped the lock, but in this case the caller of set_cpus_allowed/etc must initiate migration itself if ->on_rq == T. We already fixed p->cpus_allowed, the changes in active/online masks must be visible to racer, it should migrate the task to online cpu correctly. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091014.GA9138@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:02 +02:00
Oleg Nesterov	1445c08d06	sched: move_task_off_dead_cpu(): Take rq->lock around select_fallback_rq() move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed lockless. We can race with set_cpus_allowed() running in parallel. Change it to take rq->lock around select_fallback_rq(). Note that it is not trivial to move this spin_lock() into select_fallback_rq(), we must recheck the task was not migrated after we take the lock and other callers do not need this lock. To avoid the races with other callers of select_fallback_rq() which rely on TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise. The owner of TASK_WAKING must update ->cpus_allowed and choose the correct CPU anyway, and the subsequent __migrate_task() is just meaningless because p->se.on_rq must be false. Alternatively, we could change select_task_rq() to take rq->lock right after it calls sched_class->select_task_rq(), but this looks a bit ugly. Also, change it to not assume irqs are disabled and absorb __migrate_task_irq(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091010.GA9131@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:01 +02:00
Oleg Nesterov	897f0b3c3f	sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code This patch just states the fact the cpusets/cpuhotplug interaction is broken and removes the deadlockable code which only pretends to work. - cpuset_lock() doesn't really work. It is needed for cpuset_cpus_allowed_locked() but we can't take this lock in try_to_wake_up()->select_fallback_rq() path. - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take cpuset_lock() and hangs forever because CPU is already dead and thus T can't be scheduled. - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock() which is not irq-safe, but try_to_wake_up() can be called from irq. Kill them, and change select_fallback_rq() to use cpu_possible_mask, like we currently do without CONFIG_CPUSETS. Also, with or without this patch, with or without CONFIG_CPUSETS, the callers of select_fallback_rq() can race with each other or with set_cpus_allowed() pathes. The subsequent patches try to to fix these problems. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100315091003.GA9123@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:01 +02:00
Li Zefan	32bd7eb5a7	sched: Remove remaining USER_SCHED code This is left over from commit `7c9414385e` ("sched: Remove USER_SCHED"") Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Howells <dhowells@redhat.com> LKML-Reference: <4BA9A05F.7010407@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:12:00 +02:00
Oleg Nesterov	47a70985e5	sched: set_cpus_allowed_ptr(): Don't use rq->migration_thread after unlock Trivial typo fix. rq->migration_thread can be NULL after task_rq_unlock(), this is why we have "mt" which should be used instead. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100330165829.GA18284@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:11:05 +02:00
Mike Galbraith	269484a492	sched: Fix proc_sched_set_task() Latencytop clearing sum_exec_runtime via proc_sched_set_task() breaks task_times(). Other places in kernel use nvcsw and nivcsw, which are being cleared as well, Clear task statistics only. Reported-by: Török Edwin <edwintorok@gmail.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1269940193.19286.14.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:06:40 +02:00
Ingo Molnar	c9494727cf	Merge branch 'linus' into sched/core Merge reason: update to latest upstream Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 20:03:08 +02:00
Ingo Molnar	ec5e61aabe	Merge branch 'perf/urgent' into perf/core Conflicts: arch/x86/kernel/cpu/perf_event.c Merge reason: Resolve the conflict, pick up fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 19:38:10 +02:00
Mike Galbraith	8bb39f9aa0	perf: Fix 'perf sched record' deadlock perf sched record can deadlock a box should the holder of handle->data->lock take an interrupt, and then attempt to acquire an rq lock held by a CPU trying to acquire the same lock. Disable interrupts. CPU0 CPU1 sched event with rq->lock held grab handle->data->lock spin on handle->data->lock interrupt try to grab rq->lock Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1269598293.6174.8.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-04-02 19:30:05 +02:00
Ingo Molnar	50d11d190a	Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/urgent	2010-04-02 19:29:17 +02:00
Frederic Weisbecker	e49a5bd381	perf: Use hot regs with software sched switch/migrate events Scheduler's task migration events don't work because they always pass NULL regs perf_sw_event(). The event hence gets filtered in perf_swevent_add(). Scheduler's context switches events use task_pt_regs() to get the context when the event occured which is a wrong thing to do as this won't give us the place in the kernel where we went to sleep but the place where we left userspace. The result is even more wrong if we switch from a kernel thread. Use the hot regs snapshot for both events as they belong to the non-interrupt/exception based events family. Unlike page faults or so that provide the regs matching the exact origin of the event, we need to save the current context. This makes the task migration event working and fix the context switch callchains and origin ip. Example: perf record -a -e cs Before: 10.91% ksoftirqd/0 0 [k] 0000000000000000 \| --- (nil) perf_callchain perf_prepare_sample __perf_event_overflow perf_swevent_overflow perf_swevent_add perf_swevent_ctx_event do_perf_sw_event __perf_sw_event perf_event_task_sched_out schedule run_ksoftirqd kthread kernel_thread_helper After: 23.77% hald-addon-stor [kernel.kallsyms] [k] schedule \| --- schedule \| \|--60.00%-- schedule_timeout \| wait_for_common \| wait_for_completion \| blk_execute_rq \| scsi_execute \| scsi_execute_req \| sr_test_unit_ready \| \| \| \|--66.67%-- sr_media_change \| \| media_changed \| \| cdrom_media_changed \| \| sr_block_media_changed \| \| check_disk_change \| \| cdrom_open v2: Always build perf_arch_fetch_caller_regs() now that software events need that too. They don't need it from modules, unlike trace events, so we keep the EXPORT_SYMBOL in trace_event_perf.c Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net>	2010-04-01 08:26:31 +02:00
Frederic Weisbecker	eb1e79611c	perf: Correctly align perf event tracing buffer The trace event buffer used by perf to record raw sample events is typed as an array of char and may then not be aligned to 8 by alloc_percpu(). But we need it to be aligned to 8 in sparc64 because we cast this buffer into a random structure type built by the TRACE_EVENT() macro to store the traces. So if a random 64 bits field is accessed inside, it may be not under an expected good alignment. Use an array of long instead to force the appropriate alignment, and perform a compile time check to ensure the size in byte of the buffer is a multiple of sizeof(long) so that its actual size doesn't get shrinked under us. This fixes unaligned accesses reported while using perf lock in sparc 64. Suggested-by: David Miller <davem@davemloft.net> Suggested-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Steven Rostedt <rostedt@goodmis.org>	2010-04-01 08:26:30 +02:00
Steven Rostedt	ff0ff84a07	ring-buffer: Add lost event count to end of sub buffer Currently, binary readers of the ring buffer only know where events were lost, but not how many events were lost at that location. This information is available, but it would require adding another field to the sub buffer header to include it. But when a event can not fit at the end of a sub buffer, it is written to the next sub buffer. This means there is a good chance that the buffer may have room to hold this counter. If it does, write the counter at the end of the sub buffer and set another flag in the data size field that states that this information exists. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-31 22:57:08 -04:00
Steven Rostedt	bc21b47842	tracing: Show the lost events in the trace_pipe output Now that the ring buffer can keep track of where events are lost. Use this information to the output of trace_pipe: hackbench-3588 [001] 1326.701660: lock_acquire: ffffffff816591e0 read rcu_read_lock hackbench-3588 [001] 1326.701661: lock_acquire: ffff88003f4091f0 &(&dentry->d_lock)->rlock hackbench-3588 [001] 1326.701664: lock_release: ffff88003f4091f0 &(&dentry->d_lock)->rlock CPU:1 [LOST 673 EVENTS] hackbench-3588 [001] 1326.702711: kmem_cache_free: call_site=ffffffff81102b85 ptr=ffff880026d96738 hackbench-3588 [001] 1326.702712: lock_release: ffff88003e1480a8 &mm->mmap_sem hackbench-3588 [001] 1326.702713: lock_acquire: ffff88003e1480a8 &mm->mmap_sem Even works with the function graph tracer: 2) ! 170.098 us \| } 2) 4.036 us \| rcu_irq_exit(); 2) 3.657 us \| idle_cpu(); 2) ! 190.301 us \| } CPU:2 [LOST 2196 EVENTS] 2) 0.853 us \| } /* cancel_dirty_page */ 2) \| remove_from_page_cache() { 2) 1.578 us \| _raw_spin_lock_irq(); 2) \| __remove_from_page_cache() { Note, it does not work with the iterator "trace" file, since it requires the use of consuming the page from the ring buffer to determine how many events were lost, which the iterator does not do. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-31 22:57:06 -04:00
Steven Rostedt	66a8cb95ed	ring-buffer: Add place holder recording of dropped events Currently, when the ring buffer drops events, it does not record the fact that it did so. It does inform the writer that the event was dropped by returning a NULL event, but it does not put in any place holder where the event was dropped. This is not a trivial thing to add because the ring buffer mostly runs in overwrite (flight recorder) mode. That is, when the ring buffer is full, new data will overwrite old data. In a produce/consumer mode, where new data is simply dropped when the ring buffer is full, it is trivial to add the placeholder for dropped events. When there's more room to write new data, then a special event can be added to notify the reader about the dropped events. But in overwrite mode, any new write can overwrite events. A place holder can not be inserted into the ring buffer since there never may be room. A reader could also come in at anytime and miss the placeholder. Luckily, the way the ring buffer works, the read side can find out if events were lost or not, and how many events. Everytime a write takes place, if it overwrites the header page (the next read) it updates a "overrun" variable that keeps track of the number of lost events. When a reader swaps out a page from the ring buffer, it can record this number, perfom the swap, and then check to see if the number changed, and take the diff if it has, which would be the number of events dropped. This can be stored by the reader and returned to callers of the reader. Since the reader page swap will fail if the writer moved the head page since the time the reader page set up the swap, this gives room to record the overruns without worrying about races. If the reader sets up the pages, records the overrun, than performs the swap, if the swap succeeds, then the overrun variable has not been updated since the setup before the swap. For binary readers of the ring buffer, a flag is set in the header of each sub page (sub buffer) of the ring buffer. This flag is embedded in the size field of the data on the sub buffer, in the 31st bit (the size can be 32 or 64 bits depending on the architecture), but only 27 bits needs to be used for the actual size (less actually). We could add a new field in the sub buffer header to also record the number of events dropped since the last read, but this will change the format of the binary ring buffer a bit too much. Perhaps this change can be made if the information on the number of events dropped is considered important enough. Note, the notification of dropped events is only used by consuming reads or peeking at the ring buffer. Iterating over the ring buffer does not keep this information because the necessary data is only available when a page swap is made, and the iterator does not swap out pages. Cc: Robert Richter <robert.richter@amd.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "Luis Claudio R. Goncalves" <lclaudio@uudg.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-31 22:57:04 -04:00
Steven Rostedt	eb0c53771f	tracing: Fix compile error in module tracepoints when MODULE_UNLOAD not set If modules are configured in the build but unloading of modules is not, then the refcnt is not defined. Place the get/put module tracepoints under CONFIG_MODULE_UNLOAD since it references this field in the module structure. As a side-effect, this patch also reduces the code when MODULE_UNLOAD is not set, because these unused tracepoints are not created. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-31 22:56:59 -04:00
Li Zefan	ae832d1e03	tracing: Remove side effect from module tracepoints that caused a GPF Remove the @refcnt argument, because it has side-effects, and arguments with side-effects are not skipped by the jump over disabled instrumentation and are executed even when the tracepoint is disabled. This was also causing a GPF as found by Randy Dunlap: Subject: 2.6.33 GP fault only when built with tracing LKML-Reference: <4BA2B69D.3000309@oracle.com> Note, the current 2.6.34-rc has a fix for the actual cause of the GPF, but this fixes one of its triggers. Tested-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BA97FA7.6040406@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-31 22:56:58 -04:00
Thomas Gleixner	753649dbc4	genirq: Force MSI irq handlers to run with interrupts disabled Network folks reported that directing all MSI-X vectors of their multi queue NICs to a single core can cause interrupt stack overflows when enough interrupts fire at the same time. This is caused by the fact that we run interrupt handlers by default with interrupts enabled unless the driver reuqests the interrupt with the IRQF_DISABLED set. The NIC handlers do not set this flag, so simultaneous interrupts can nest unlimited and cause the stack overflow. The only safe counter measure is to run the interrupt handlers with interrupts disabled. We can't switch to this mode in general right now, but it is safe to do so for MSI interrupts. Force IRQF_DISABLED for MSI interrupt handlers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Linus Torvalds <torvalds@osdl.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: stable@kernel.org	2010-03-31 15:48:38 +02:00
Linus Torvalds	246750ffa1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: CRED: Fix memory leak in error handling	2010-03-30 07:26:30 -07:00
Linus Torvalds	be3fd3cc7c	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Do not free zero sized per cpu areas x86: Make sure free_init_pages() frees pages on page boundary x86: Make smp_locks end with page alignment	2010-03-30 07:22:38 -07:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Mathieu Desnoyers	570b8fb505	CRED: Fix memory leak in error handling Fix a memory leak on an OOM condition in prepare_usermodehelper_creds(). Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-03-30 17:15:38 +11:00
Julia Lawall	292f60c0c4	ring-buffer: Add missing unlock In some error handling cases the lock is not unlocked. The return is converted to a goto, to share the unlock at the end of the function. A simplified version of the semantic patch that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @r exists@ expression E1; identifier f; @@ f (...) { <+... * spin_lock_irq (E1,...); ... when != E1 * return ...; ...+> } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> LKML-Reference: <Pine.LNX.4.64.1003291736440.21896@ask.diku.dk> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-29 15:23:24 -04:00
Li Zefan	e36673ec51	tracing: Fix lockdep warning in global_clock() # echo 1 > events/enable # echo global > trace_clock ------------[ cut here ]------------ WARNING: at kernel/lockdep.c:3162 check_flags+0xb2/0x190() ... ---[ end trace 3f86734a89416623 ]--- possible reason: unannotated irqs-on. ... There's no reason to use the raw_local_irq_save() in trace_clock_global. The local_irq_save() version is fine, and does not cause the bug in lockdep. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4BA97FA1.7030606@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-29 15:16:44 -04:00
Ian Campbell	eed63519e3	x86: Do not free zero sized per cpu areas This avoids an infinite loop in free_early_partial(). Add a warning to free_early_partial() to catch future problems. -v5: put back start > end back into WARN_ONCE() -v6: use one line for warning, suggested by Linus -v7: more tests -v8: remove the function name as suggested by Johannes WARN_ONCE() will print out that function name. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Joel Becker <joel.becker@oracle.com> Tested-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Miller <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <1269830604-26214-4-git-send-email-yinghai@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-29 18:55:40 +02:00
David Howells	a53f4f9efa	SLOW_WORK: CONFIG_SLOW_WORK_PROC should be CONFIG_SLOW_WORK_DEBUG CONFIG_SLOW_WORK_PROC was changed to CONFIG_SLOW_WORK_DEBUG, but not in all instances. Change the remaining instances. This makes the debugfs file display the time mark and the owner's description again. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-29 09:14:47 -07:00
Dave Airlie	88be12c440	slow-work: use get_ref wrapper instead of directly calling get_ref Otherwise we can get an oops if the user has no get_ref/put_ref requirement. Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-29 09:13:30 -07:00
Tejun Heo	10fad5e46f	percpu, module: implement and use is_kernel/module_percpu_address() lockdep has custom code to check whether a pointer belongs to static percpu area which is somewhat broken. Implement proper is_kernel/module_percpu_address() and replace the custom code. On UP, percpu variables are regular static variables and can't be distinguished from them. Always return %false on UP. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@redhat.com>	2010-03-29 23:07:12 +09:00
Tejun Heo	259354deaa	module: encapsulate percpu handling better and record percpu_size Better encapsulate module static percpu area handling so that code outsidef of CONFIG_SMP ifdef doesn't deal with mod->percpu directly and add mod->percpu_size and record percpu_size in it. Both percpu fields are compiled out on UP. While at it, mark mod->percpu w/ __percpu. This is to prepare for is_module_percpu_address(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au>	2010-03-29 23:07:12 +09:00
Henrik Kretzschmar	975d260355	padata: Section cleanup This patch removes the __cupinit from padata_cpu_callback(), which is refered by the exportet function padata_alloc(). This could lead to problems if CONFIG_HOTPLUG_CPU is disabled, which should happen very often. WARNING: kernel/built-in.o(.text+0x7ffcb): Section mismatch in reference from the function padata_alloc() to the function .cpuinit.text:padata_cpu_callback() The function padata_alloc() references the function __cpuinit padata_cpu_callback(). This is often because padata_alloc lacks a __cpuinit annotation or the annotation of padata_cpu_callback is wrong. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-03-29 16:15:31 +08:00
Linus Torvalds	b72c40949b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: x86/PCI: truncate _CRS windows with _LEN > _MAX - _MIN + 1 x86/PCI: for host bridge address space collisions, show conflicting resource frv/PCI: remove redundant warnings x86/PCI: remove redundant warnings PCI: don't say we claimed a resource if we failed PCI quirk: Disable MSI on VIA K8T890 systems PCI quirk: RS780/RS880: work around missing MSI initialization PCI quirk: only apply CX700 PCI bus parking quirk if external VT6212L is present PCI: complain about devices that seem to be broken PCI: print resources consistently with %pR PCI: make disabled window printk style match the enabled ones PCI: break out primary/secondary/subordinate for readability PCI: for address space collisions, show conflicting resource resources: add interfaces that return conflict information PCI: cleanup error return for pcix get and set mmrbc functions PCI: fix access of PCI_X_CMD by pcix get and set mmrbc functions PCI: kill off pci_register_set_vga_state() symbol export. PCI: fix return value from pcix_get_max_mmrbc()	2010-03-26 16:34:29 -07:00
Matt Helsley	5a7aadfe2f	Freezer: Fix buggy resume test for tasks frozen with cgroup freezer When the cgroup freezer is used to freeze tasks we do not want to thaw those tasks during resume. Currently we test the cgroup freezer state of the resuming tasks to see if the cgroup is FROZEN. If so then we don't thaw the task. However, the FREEZING state also indicates that the task should remain frozen. This also avoids a problem pointed out by Oren Ladaan: the freezer state transition from FREEZING to FROZEN is updated lazily when userspace reads or writes the freezer.state file in the cgroup filesystem. This means that resume will thaw tasks in cgroups which should be in the FROZEN state if there is no read/write of the freezer.state file to trigger this transition before suspend. NOTE: Another "simple" solution would be to always update the cgroup freezer state during resume. However it's a bad choice for several reasons: Updating the cgroup freezer state is somewhat expensive because it requires walking all the tasks in the cgroup and checking if they are each frozen. Worse, this could easily make resume run in N^2 time where N is the number of tasks in the cgroup. Finally, updating the freezer state from this code path requires trickier locking because of the way locks must be ordered. Instead of updating the freezer state we rely on the fact that lazy updates only manage the transition from FREEZING to FROZEN. We know that a cgroup with the FREEZING state may actually be FROZEN so test for that state too. This makes sense in the resume path even for partially-frozen cgroups -- those that really are FREEZING but not FROZEN. Reported-by: Oren Ladaan <orenl@cs.columbia.edu> Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Cc: stable@kernel.org Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-03-26 23:51:44 +01:00
Xiaotian Feng	4f598458ea	Freezer: Only show the state of tasks refusing to freeze show_state will dump all tasks state, so if freezer failed to freeze any task, kernel will dump all tasks state and flood the dmesg log. This patch makes freezer only show state of tasks refusing to freeze. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-03-26 23:51:13 +01:00
Linus Torvalds	054319b5e2	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: time: Fix accumulation bug triggered by long delay. posix-cpu-timers: Reset expire cache when no timer is running timer stats: Fix del_timer_sync() and try_to_del_timer_sync() clockevents: Sanitize min_delta_ns adjustment and prevent overflows	2010-03-26 15:10:38 -07:00
Linus Torvalds	833961d81f	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align	2010-03-26 15:10:13 -07:00
Linus Torvalds	3cacf42462	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Use proper type in sched_getaffinity() kernel/sched.c: Suppress unused var warning sched: sched_getaffinity(): Allow less than NR_CPUS length	2010-03-26 15:09:59 -07:00
Linus Torvalds	309d1dcb5b	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Move two IRQ functions from .init.text to .text genirq: Protect access to irq_desc->action in can_request_irq() genirq: Prevent oneshot irq thread race	2010-03-26 15:09:06 -07:00
Linus Torvalds	8128f55a0b	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Remove excessive early_res debug output softlockup: Stop spurious softlockup messages due to overflow rcu: Fix local_irq_disable() CONFIG_PROVE_RCU=y false positives rcu: Fix tracepoints & lockdep false positive rcu: Make rcu_read_lock_bh_held() allow for disabled BH	2010-03-26 15:08:31 -07:00
Peter Zijlstra	faa4602e47	x86, perf, bts, mm: Delete the never used BTS-ptrace code Support for the PMU's BTS features has been upstreamed in v2.6.32, but we still have the old and disabled ptrace-BTS, as Linus noticed it not so long ago. It's buggy: TIF_DEBUGCTLMSR is trampling all over that MSR without regard for other uses (perf) and doesn't provide the flexibility needed for perf either. Its users are ptrace-block-step and ptrace-bts, since ptrace-bts was never used and ptrace-block-step can be implemented using a much simpler approach. So axe all 3000 lines of it. That includes the locked_memory() APIs in mm/mlock.c as well. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Markus Metzger <markus.t.metzger@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <20100325135413.938004390@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-26 11:33:55 +01:00
Miao Xie	53feb29767	cpuset: alloc nodemask_t on the heap rather than the stack Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-24 16:31:21 -07:00
Miao Xie	5ab116c934	cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node cpuset_mem_spread_node() returns an offline node, and causes an oops. This patch fixes it by initializing task->mems_allowed to node_states[N_HIGH_MEMORY], and updating task->mems_allowed when doing memory hotplug. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Reported-by: Nick Piggin <npiggin@suse.de> Tested-by: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-24 16:31:21 -07:00
Li Zefan	9d34706f42	cgroups: remove duplicate include commit `e6a1105b` ("cgroups: subsystem module loading interface") and commit `c50cc752` ("sched, cgroups: Fix module export") result in duplicate including of module.h Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-24 16:31:19 -07:00
Henrik Kretzschmar	860652bfb8	genirq: Move two IRQ functions from .init.text to .text Both functions should not be marked as __init, since they be called from modules after the init section is freed. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jkosina@suse.cz> LKML-Reference: <1269431961-5731-1-git-send-email-henne@nachtwindheim.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-24 14:38:23 +01:00
Thomas Gleixner	cc8c3b7843	genirq: Protect access to irq_desc->action in can_request_irq() can_request_irq() accesses and dereferences irq_desc->action w/o holding irq_desc->lock. So action can be freed on another CPU before it's dereferenced. Unlikely, but ... Protect it with desc->lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-24 14:38:23 +01:00
Dimitri Sivanich	92d6b71ab9	genirq: Expose irq_desc->node in proc/irq Expose irq_desc->node as /proc/irq/*/node. This file provides device hardware locality information for apps desiring to include hardware locality in irq mapping decisions. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-24 14:10:03 +01:00
Bjorn Helgaas	66f1207bce	resources: add interfaces that return conflict information request_resource() and insert_resource() only return success or failure, which no information about what existing resource conflicted with the proposed new reservation. This patch adds request_resource_conflict() and insert_resource_conflict(), which return the conflicting resource. Callers may use this for better error messages or to adjust the new resource and retry the request. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-03-23 13:33:50 -07:00
John Stultz	e1292ba164	ntp: Make time_adjust static Now that no arches are accessing time_adjust directly, make it static. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1268968769-19209-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-23 17:19:37 +01:00
John Stultz	830ec0458c	time: Fix accumulation bug triggered by long delay. The logarithmic accumulation done in the timekeeping has some overflow protection that limits the max shift value. That means it will take more then shift loops to accumulate all of the cycles. This causes the shift decrement to underflow, which causes the loop to never exit. The simplest fix would be simply to do a: if (shift) shift--; However that is not optimal, as we know the cycle offset is larger then the interval << shift, the above would make shift drop to zero, then we would be spinning for quite awhile accumulating at interval chunks at a time. Instead, this patch only decreases shift if the offset is smaller then cycle_interval << shift. This makes sure we accumulate using the largest chunks possible without overflowing tick_length, and limits the number of iterations through the loop. This issue was found and reported by Sonic Zhang, who also tested the fix. Many thanks your explanation and testing! Reported-by: Sonic Zhang <sonic.adi@gmail.com> Signed-off-by: John Stultz <johnstul@us.ibm.com> Tested-by: Sonic Zhang <sonic.adi@gmail.com> LKML-Reference: <1268948850-5225-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-23 16:41:01 +01:00
Ingo Molnar	d2f1e15b66	Merge commit 'v2.6.34-rc2' into perf/core Merge reason: Pick up latest perf fixes from upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-22 18:47:01 +01:00
Colin Ian King	8c2eb4805d	softlockup: Stop spurious softlockup messages due to overflow Ensure additions on touch_ts do not overflow. This can occur when the top 32 bits of the TSC reach 0xffffffff causing additions to touch_ts to overflow and this in turn generates spurious softlockup warnings. Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1268994482.1798.6.camel@lenovo> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-21 19:30:13 +01:00
Steven Rostedt	2271048d1b	ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align The ring buffer uses 4 byte alignment while recording events into the buffer, even on 64bit machines. This saves space when there are lots of events being recorded at 4 byte boundaries. The ring buffer has a zero copy method to write into the buffer, with the reserving of space and then committing it. This may cause problems when writing an 8 byte word into a 4 byte alignment (not 8). For x86 and PPC this is not an issue, but on some architectures this would cause an out-of-alignment exception. This patch uses CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine if it is OK to use 4 byte alignments on 64 bit machines. If it is not, it forces the ring buffer event header to be 8 bytes and not 4, and will align the length of the data to be 8 byte aligned. This keeps the data payload at 8 byte alignments and will allow these machines to run without issue. The trick to this is that the header can be either 4 bytes or 8 bytes depending on the length of the data payload. The 4 byte header has a length field that supports up to 112 bytes. If the length of the data is more than 112, the length field is set to zero, and the actual length is stored in the next 4 bytes after the header. When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the code forces zero in the 4 byte header forcing the length to be stored in the 4 byte array, even with a small data load. It also forces the length of the data load to be 8 byte aligned. The combination of these two guarantee that the data is always at 8 byte alignment. Tested-by: Frederic Weisbecker <fweisbec@gmail.com> (on sparc64) Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-18 23:11:35 -04:00
Linus Torvalds	f82c37e7bb	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits) perf: Fix unexported generic perf_arch_fetch_caller_regs perf record: Don't try to find buildids in a zero sized file perf: export perf_trace_regs and perf_arch_fetch_caller_regs perf, x86: Fix hw_perf_enable() event assignment perf, ppc: Fix compile error due to new cpu notifiers perf: Make the install relative to DESTDIR if specified kprobes: Calculate the index correctly when freeing the out-of-line execution slot perf tools: Fix sparse CPU numbering related bugs perf_event: Fix oops triggered by cpu offline/online perf: Drop the obsolete profile naming for trace events perf: Take a hot regs snapshot for trace events perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot perf/x86-64: Use frame pointer to walk on irq and process stacks lockdep: Move lock events under lockdep recursion protection perf report: Print the map table just after samples for which no map was found perf report: Add multiple event support perf session: Change perf_session post processing functions to take histogram tree perf session: Add storage for seperating event types in report perf session: Change add_hist_entry to take the tree root instead of session perf record: Add ID and to recorded event data when recording multiple events ...	2010-03-18 16:52:46 -07:00
Frederic Weisbecker	dcd5c1662d	perf: Fix unexported generic perf_arch_fetch_caller_regs perf_arch_fetch_caller_regs() is exported for the overriden x86 version, but not for the generic weak version. As a general rule, weak functions should not have their symbol exported in the same file they are defined. So let's export it on trace_event_perf.c as it is used by trace events only. This fixes: ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined! ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined! -v2: And also only build it if trace events are enabled. -v3: Fix changelog mistake Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1268697902-9518-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-17 12:26:49 +01:00
KOSAKI Motohiro	8bc037fb89	sched: Use proper type in sched_getaffinity() Using the proper type fixes the following compiler warning: kernel/sched.c:4850: warning: comparison of distinct pointer types lacks a cast Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: torvalds@linux-foundation.org Cc: travis@sgi.com Cc: peterz@infradead.org Cc: drepper@redhat.com Cc: rja@sgi.com Cc: sharyath@in.ibm.com Cc: steiner@sgi.com LKML-Reference: <20100317090046.4C79.A69D9226@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-17 10:48:49 +01:00
Thomas Weber	8839316121	Fix typos in comments [Ss]ytem => [Ss]ystem udpate => update paramters => parameters orginal => original Signed-off-by: Thomas Weber <swirl@gmx.li> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-03-16 11:47:56 +01:00
Andrew Morton	c890692bf3	kernel/sched.c: Suppress unused var warning On UP: kernel/sched.c: In function 'wake_up_new_task': kernel/sched.c:2631: warning: unused variable 'cpu' Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-16 11:13:42 +01:00
Dan Carpenter	6427462bfa	sched: Remove some dead code This was left over from "7c9414385e sched: Remove USER_SCHED" Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Dhaval Giani <dhaval.giani@gmail.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> LKML-Reference: <20100315082148.GD18181@bicker> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-16 11:05:44 +01:00
Paul E. McKenney	e3818b8dce	rcu: Make rcu_read_lock_bh_held() allow for disabled BH Disabling BH can stand in for rcu_read_lock_bh(), and this patch updates rcu_read_lock_bh_held() to allow for this. In order to avoid include-file hell, this function is moved out of line to kernel/rcupdate.c. This fixes a false positive RCU warning. Reported-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20100316000343.GA25857@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-16 09:57:49 +01:00
Frederic Weisbecker	1d199b1ad6	perf: Fix unexported generic perf_arch_fetch_caller_regs perf_arch_fetch_caller_regs() is exported for the overriden x86 version, but not for the generic weak version. As a general rule, weak functions should not have their symbol exported in the same file they are defined. So let's export it on trace_event_perf.c as it is used by trace events only. This fixes: ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined! ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined! -v2: And also only build it if trace events are enabled. -v3: Fix changelog mistake Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1268697902-9518-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-16 09:27:27 +01:00
KOSAKI Motohiro	cd3d8031eb	sched: sched_getaffinity(): Allow less than NR_CPUS length [ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ] Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons. Unfortunately, glibc sched interface has the following definition: # define __CPU_SETSIZE 1024 # define __NCPUBITS (8 * sizeof (__cpu_mask)) typedef unsigned long int __cpu_mask; typedef struct { __cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS]; } cpu_set_t; It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an ABI issue ... More recently, Sharyathi Nagesh reported following test program makes misterious syscall failure: ----------------------------------------------------------------------- #define _GNU_SOURCE #include<stdio.h> #include<errno.h> #include<sched.h> int main() { cpu_set_t set; if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0) printf("\n Call is failing with:%d", errno); } ----------------------------------------------------------------------- Because the kernel assumes len argument of sched_getaffinity() is bigger than NR_CPUS. But now it is not correct. Now we are faced with the following annoying dilemma, due to the limitations of the glibc interface built in years ago: (1) if we change glibc's __CPU_SETSIZE definition, we lost binary compatibility of _all_ application. (2) if we don't change it, we also lost binary compatibility of Sharyathi's use case. Then, I would propse to change the rule of the len argument of sched_getaffinity(). Old: len should be bigger than NR_CPUS New: len should be bigger than maximum possible cpu id This creates the following behavior: (A) In the real 4096 cpus machine, the above test program still return -EINVAL. (B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost all machines in the world), the above can run successfully. Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means they can rebuild their programs. IOW we hope they are not annoyed by this issue ... Reported-by: Sharyathi Nagesh <sharyath@in.ibm.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Mike Travis <travis@sgi.com> LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-15 08:28:44 +01:00
Ingo Molnar	12b8aeee3e	Merge branch 'linus' into timers/core Conflicts: Documentation/feature-removal-schedule.txt Merge reason: Resolve the conflict, update to upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-15 08:17:33 +01:00
Linus Torvalds	80a186074e	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix pick_next_highest_task_rt() for cgroups sched: Cleanup: remove unused variable in try_to_wake_up() x86: Fix sched_clock_cpu for systems with unsynchronized TSC	2010-03-13 14:46:18 -08:00
Linus Torvalds	4e3eaddd14	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: locking: Make sparse work with inline spinlocks and rwlocks x86/mce: Fix RCU lockdep splats rcu: Increase RCU CPU stall timeouts if PROVE_RCU ftrace: Replace read_barrier_depends() with rcu_dereference_raw() rcu: Suppress RCU lockdep warnings during early boot rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare() rcu: Suppress __mpol_dup() false positive from RCU lockdep rcu: Make rcu_read_lock_sched_held() handle !PREEMPT rcu: Add control variables to lockdep_rcu_dereference() diagnostics rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU rcu: Use wrapper function instead of exporting tasklist_lock sched, rcu: Fix rcu_dereference() for RCU-lockdep rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU x86/gart: Unexport gart_iommu_aperture Fix trivial conflicts in kernel/trace/ftrace.c	2010-03-13 14:43:01 -08:00
Linus Torvalds	8655e7e3dd	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Do not record user stack trace from NMI context tracing: Disable buffer switching when starting or stopping trace tracing: Use same local variable when resetting the ring buffer function-graph: Init curr_ret_stack with ret_stack ring-buffer: Move disabled check into preempt disable section function-graph: Add tracing_thresh support to function_graph tracer tracing: Update the comm field in the right variable in update_max_tr function-graph: Use comment notation for func names of dangling '}' function-graph: Fix unused reference to ftrace_set_func() tracing: Fix warning in s_next of trace file ops tracing: Include irqflags headers from trace clock	2010-03-13 14:40:50 -08:00
Linus Torvalds	9fdfbc2bff	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Provide generic perf_sample_data initialization MAINTAINERS: Add Arnaldo as tools/perf/ co-maintainer perf trace: Don't use pager if scripting perf trace/scripting: Remove extraneous header read perf, ARM: Modify kuser rmb() call to compile for Thumb-2 x86/stacktrace: Don't dereference bad frame pointers perf archive: Don't try to collect files without a build-id perf_events, x86: Fixup fixed counter constraints perf, x86: Restrict the ANY flag perf, x86: rename macro in ARCH_PERFMON_EVENTSEL_ENABLE perf, x86: add some IBS macros to perf_event.h perf, x86: make IBS macros available in perf_event.h hw-breakpoints: Remove stub unthrottle callback x86/hw-breakpoints: Remove the name field perf: Remove pointless breakpoint union perf lock: Drop the buffers multiplexing dependency perf lock: Fix and add misc documentally things percpu: Add __percpu sparse annotations to hw_breakpoint	2010-03-13 14:39:42 -08:00
Steven Rostedt	b6345879cc	tracing: Do not record user stack trace from NMI context A bug was found with Li Zefan's ftrace_stress_test that caused applications to segfault during the test. Placing a tracing_off() in the segfault code, and examining several traces, I found that the following was always the case. The lock tracer was enabled (lockdep being required) and userstack was enabled. Testing this out, I just enabled the two, but that was not good enough. I needed to run something else that could trigger it. Running a load like hackbench did not work, but executing a new program would. The following would trigger the segfault within seconds: # echo 1 > /debug/tracing/options/userstacktrace # echo 1 > /debug/tracing/events/lock/enable # while :; do ls > /dev/null ; done Enabling the function graph tracer and looking at what was happening I finally noticed that all cashes happened just after an NMI. 1) \| copy_user_handle_tail() { 1) \| bad_area_nosemaphore() { 1) \| __bad_area_nosemaphore() { 1) \| no_context() { 1) \| fixup_exception() { 1) 0.319 us \| search_exception_tables(); 1) 0.873 us \| } [...] 1) 0.314 us \| __rcu_read_unlock(); 1) 0.325 us \| native_apic_mem_write(); 1) 0.943 us \| } 1) 0.304 us \| rcu_nmi_exit(); [...] 1) 0.479 us \| find_vma(); 1) \| bad_area() { 1) \| __bad_area() { After capturing several traces of failures, all of them happened after an NMI. Curious about this, I added a trace_printk() to the NMI handler to read the regs->ip to see where the NMI happened. In which I found out it was here: ffffffff8135b660 <page_fault>: ffffffff8135b660: 48 83 ec 78 sub $0x78,%rsp ffffffff8135b664: e8 97 01 00 00 callq ffffffff8135b800 <error_entry> What was happening is that the NMI would happen at the place that a page fault occurred. It would call rcu_read_lock() which was traced by the lock events, and the user_stack_trace would run. This would trigger a page fault inside the NMI. I do not see where the CR2 register is saved or restored in NMI handling. This means that it would corrupt the page fault handling that the NMI interrupted. The reason the while loop of ls helped trigger the bug, was that each execution of ls would cause lots of pages to be faulted in, and increase the chances of the race happening. The simple solution is to not allow user stack traces in NMI context. After this patch, I ran the above "ls" test for a couple of hours without any issues. Without this patch, the bug would trigger in less than a minute. Cc: stable@kernel.org Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-12 20:31:49 -05:00
Steven Rostedt	a2f8071428	tracing: Disable buffer switching when starting or stopping trace When the trace iterator is read, tracing_start() and tracing_stop() is called to stop tracing while the iterator is processing the trace output. These functions disable both the standard buffer and the max latency buffer. But if the wakeup tracer is running, it can switch these buffers between the two disables: buffer = global_trace.buffer; if (buffer) ring_buffer_record_disable(buffer); <<<--------- swap happens here buffer = max_tr.buffer; if (buffer) ring_buffer_record_disable(buffer); What happens is that we disabled the same buffer twice. On tracing_start() we can enable the same buffer twice. All ring_buffer_record_disable() must be matched with a ring_buffer_record_enable() or the buffer can be disable permanently, or enable prematurely, and cause a bug where a reset happens while a trace is commiting. This patch protects these two by taking the ftrace_max_lock to prevent a switch from occurring. Found with Li Zefan's ftrace_stress_test. Cc: stable@kernel.org Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-12 20:30:21 -05:00
Steven Rostedt	283740c619	tracing: Use same local variable when resetting the ring buffer In the ftrace code that resets the ring buffer it references the buffer with a local variable, but then uses the tr->buffer as the parameter to reset. If the wakeup tracer is running, which can switch the tr->buffer with the max saved buffer, this can break the requirement of disabling the buffer before the reset. buffer = tr->buffer; ring_buffer_record_disable(buffer); synchronize_sched(); __tracing_reset(tr->buffer, cpu); If the tr->buffer is swapped, then the reset is not happening to the buffer that was disabled. This will cause the ring buffer to fail. Found with Li Zefan's ftrace_stress_test. Cc: stable@kernel.org Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-12 20:29:20 -05:00
Steven Rostedt	ea14eb7140	function-graph: Init curr_ret_stack with ret_stack If the graph tracer is active, and a task is forked but the allocating of the processes graph stack fails, it can cause crash later on. This is due to the temporary stack being NULL, but the curr_ret_stack variable is copied from the parent. If it is not -1, then in ftrace_graph_probe_sched_switch() the following: for (index = next->curr_ret_stack; index >= 0; index--) next->ret_stack[index].calltime += timestamp; Will cause a kernel OOPS. Found with Li Zefan's ftrace_stress_test. Cc: stable@kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-12 20:28:02 -05:00
Lai Jiangshan	52fbe9cde7	ring-buffer: Move disabled check into preempt disable section The ring buffer resizing and resetting relies on a schedule RCU action. The buffers are disabled, a synchronize_sched() is called and then the resize or reset takes place. But this only works if the disabling of the buffers are within the preempt disabled section, otherwise a window exists that the buffers can be written to while a reset or resize takes place. Cc: stable@kernel.org Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B949E43.2010906@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-12 20:26:56 -05:00
Linus Torvalds	64d5aea300	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timekeeping: Prevent oops when GENERIC_TIME=n	2010-03-12 16:27:08 -08:00
Linus Torvalds	c32da02342	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits) doc: fix typo in comment explaining rb_tree usage Remove fs/ntfs/ChangeLog doc: fix console doc typo doc: cpuset: Update the cpuset flag file Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed Remove drivers/parport/ChangeLog Remove drivers/char/ChangeLog doc: typo - Table 1-2 should refer to "status", not "statm" tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h devres/irq: Fix devm_irq_match comment Remove reference to kthread_create_on_cpu tree-wide: Assorted spelling fixes tree-wide: fix 'lenght' typo in comments and code drm/kms: fix spelling in error message doc: capitalization and other minor fixes in pnp doc devres: typo fix s/dev/devm/ Remove redundant trailing semicolons from macros fix typo "definetly" -> "definitely" in comment tree-wide: s/widht/width/g typo in comments ... Fix trivial conflict in Documentation/laptops/00-INDEX	2010-03-12 16:04:50 -08:00
Dave Young	2edf5e4980	sysctl extern cleanup: lockdep Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move lockdep extern declarations to linux/lockdep.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:10 -08:00
Dave Young	4f0e056fde	sysctl extern cleanup: rtmutex Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move max_lock_depth extern declaration to linux/rtmutex.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:10 -08:00
Dave Young	c55b7c3e82	sysctl extern cleanup: acct Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move acct_parm extern declaration to linux/acct.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:10 -08:00
Dave Young	15485a4682	sysctl extern cleanup: sg Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move sg_big_buff extern declaration to scsi/sg.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Acked-by: Doug Gilbert <dgilbert@interlog.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:10 -08:00
Dave Young	5ed109103d	sysctl extern cleanup: module Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move modprobe_path extern declaration to linux/kmod.h Move modules_disabled extern declaration to linux/module.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:10 -08:00
Dave Young	e5ab67726f	sysctl extern cleanup: rcu Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move rcutorture_runnable extern declaration to linux/rcupdate.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:53:08 -08:00
Dave Young	d33ed52d57	sysctl extern cleanup: signal Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move print_fatal_signals extern declaration to linux/signal.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:44 -08:00
Dave Young	eb5572fed5	sysctl extern cleanup: C_A_D Extern declarations in sysctl.c should be moved to their own header file, and then include them in relavant .c files. Move C_A_D extern variable declaration to linux/reboot.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:44 -08:00
Alexey Dobriyan	8467005da3	nsproxy: remove INIT_NSPROXY() Remove INIT_NSPROXY(), use C99 initializer. Remove INIT_IPC_NS(), INIT_NET_NS() while I'm at it. Note: headers trim will be done later, now it's quite pointless because results will be invalidated by merge window. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:40 -08:00
Oleg Nesterov	13aa9a6b0f	pid_ns: zap_pid_ns_processes: use SEND_SIG_NOINFO instead of force_sig() zap_pid_ns_processes() uses force_sig(SIGKILL) to ensure SIGKILL will be delivered to sub-namespace inits as well. This is correct, but we are going to change force_sig_info() semantics. See http://bugzilla.kernel.org/show_bug.cgi?id=15395#c31 We can use send_sig_info(SEND_SIG_NOINFO) instead, since `614c517d7c` ("signals: SEND_SIG_NOINFO should be considered as SI_FROMUSER()") SEND_SIG_NOINFO means "from user" and therefore send_signal() will get the correct from_ancestor_ns = T flag. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:40 -08:00
Veaceslav Falico	93c59907c6	copy_signal() cleanup: clean thread_group_cputime_init() Remove unneeded initializations in thread_group_cputime_init() and in posix_cpu_timers_init_group(). They are useless after kmem_cache_zalloc() was used in copy_signal(). Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:39 -08:00
Veaceslav Falico	4dd66e69d4	copy_signal() cleanup: kill taskstats_tgid_init() and acct_init_pacct() Kill unused functions taskstats_tgid_init() and acct_init_pacct() because we don't use them anywhere after using kmem_cache_zalloc() in copy_signal(). Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:39 -08:00
Veaceslav Falico	a56704ef6b	copy_signal() cleanup: use zalloc and remove initializations Use kmem_cache_zalloc() on signal creation and remove unneeded initialization lines in copy_signal(). Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:39 -08:00
Kirill A. Shutemov	a0a4db548e	cgroups: remove events before destroying subsystem state objects Events should be removed after rmdir of cgroup directory, but before destroying subsystem state objects. Let's take reference to cgroup directory dentry to do that. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:37 -08:00
Kirill A. Shutemov	4ab78683c1	cgroups: fix race between userspace and kernelspace Notify userspace about cgroup removing only after rmdir of cgroup directory to avoid race between userspace and kernelspace. eventfd are used to notify about two types of event: - control file-specific, like crossing memory threshold; - cgroup removing. To understand what really happen, userspace can check if the cgroup still exists. To avoid race beetween userspace and kernelspace we have to notify userspace about cgroup removing only after rmdir of cgroup directory. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:37 -08:00
Kirill A. Shutemov	0dea116876	cgroup: implement eventfd-based generic API for notifications This patchset introduces eventfd-based API for notifications in cgroups and implements memory notifications on top of it. It uses statistics in memory controler to track memory usage. Output of time(1) on building kernel on tmpfs: Root cgroup before changes: make -j2 506.37 user 60.93s system 193% cpu 4:52.77 total Non-root cgroup before changes: make -j2 507.14 user 62.66s system 193% cpu 4:54.74 total Root cgroup after changes (0 thresholds): make -j2 507.13 user 62.20s system 193% cpu 4:53.55 total Non-root cgroup after changes (0 thresholds): make -j2 507.70 user 64.20s system 193% cpu 4:55.70 total Root cgroup after changes (1 thresholds, never crossed): make -j2 506.97 user 62.20s system 193% cpu 4:53.90 total Non-root cgroup after changes (1 thresholds, never crossed): make -j2 507.55 user 64.08s system 193% cpu 4:55.63 total This patch: Introduce the write-only file "cgroup.event_control" in every cgroup. To register new notification handler you need: - create an eventfd; - open a control file to be monitored. Callbacks register_event() and unregister_event() must be defined for the control file; - write "<event_fd> <control_fd> <args>" to cgroup.event_control. Interpretation of args is defined by control file implementation; eventfd will be woken up by control file implementation or when the cgroup is removed. To unregister notification handler just close eventfd. If you need notification functionality for a control file you have to implement callbacks register_event() and unregister_event() in the struct cftype. [kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix] Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Vladislav Buzov <vbuzov@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Alexander Shishkin <virtuoso@slind.org> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:37 -08:00
Li Zefan	b70cc5fdb4	cgroups: clean up cgroup_pidlist_find() a bit Don't call get_pid_ns() before we locate/alloc the ns. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Serge Hallyn <serue@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Ben Blum	67523c48aa	cgroups: blkio subsystem as module Modify the Block I/O cgroup subsystem to be able to be built as a module. As the CFQ disk scheduler optionally depends on blk-cgroup, config options in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are enhanced to support the new module dependency. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Ben Blum	cf5d5941fd	cgroups: subsystem module unloading Provides support for unloading modular subsystems. This patch adds a new function cgroup_unload_subsys which is to be used for removing a loaded subsystem during module deletion. Reference counting of the subsystems' modules is moved from once (at load time) to once per attached hierarchy (in parse_cgroupfs_options and rebind_subsystems) (i.e., 0 or 1). Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Ben Blum	e6a1105ba0	cgroups: subsystem module loading interface Add interface between cgroups subsystem management and module loading This patch implements rudimentary module-loading support for cgroups - namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a module initcall, and a struct module pointer in struct cgroup_subsys. Several functions that might be wanted by modules have had EXPORT_SYMBOL added to them, but it's unclear exactly which functions want it and which won't. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Ben Blum	aae8aab403	cgroups: revamp subsys array This patch series provides the ability for cgroup subsystems to be compiled as modules both within and outside the kernel tree. This is mainly useful for classifiers and subsystems that hook into components that are already modules. cls_cgroup and blkio-cgroup serve as the example use cases for this feature. It provides an interface cgroup_load_subsys() and cgroup_unload_subsys() which modular subsystems can use to register and depart during runtime. The net_cls classifier subsystem serves as the example for a subsystem which can be converted into a module using these changes. Patch #1 sets up the subsys[] array so its contents can be dynamic as modules appear and (eventually) disappear. Iterations over the array are modified to handle when subsystems are absent, and the dynamic section of the array is protected by cgroup_mutex. Patch #2 implements an interface for modules to load subsystems, called cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module pointer in struct cgroup_subsys. Patch #3 adds a mechanism for unloading modular subsystems, which includes a more advanced rework of the rudimentary reference counting introduced in patch 2. Patch #4 modifies the net_cls subsystem, which already had some module declarations, to be configurable as a module, which also serves as a simple proof-of-concept. Part of implementing patches 2 and 4 involved updating css pointers in each css_set when the module appears or leaves. In doing this, it was discovered that css_sets always remain linked to the dummy cgroup, regardless of whether or not any subsystems are actually bound to it (i.e., not mounted on an actual hierarchy). The subsystem loading and unloading code therefore should keep in mind the special cases where the added subsystem is the only one in the dummy cgroup (and therefore all css_sets need to be linked back into it) and where the removed subsys was the only one in the dummy cgroup (and therefore all css_sets should be unlinked from it) - however, as all css_sets always stay attached to the dummy cgroup anyway, these cases are ignored. Any fix that addresses this issue should also make sure these cases are addressed in the subsystem loading and unloading code. This patch: Make subsys[] able to be dynamically populated to support modular subsystems This patch reworks the way the subsys[] array is used so that subsystems can register themselves after boot time, and enables the internals of cgroups to be able to handle when subsystems are not present or may appear/disappear. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Daisuke Nishimura	d7b9fff711	cgroup: introduce coalesce css_get() and css_put() Current css_get() and css_put() increment/decrement css->refcnt one by one. This patch add a new function __css_get(), which takes "count" as a arg and increment the css->refcnt by "count". And this patch also add a new arg("count") to __css_put() and change the function to decrement the css->refcnt by "count". These coalesce version of __css_get()/__css_put() will be used to improve performance of memcg's moving charge feature later, where instead of calling css_get()/css_put() repeatedly, these new functions will be used. No change is needed for current users of css_get()/css_put(). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Daisuke Nishimura	2468c7234b	cgroup: introduce cancel_attach() Add cancel_attach() operation to struct cgroup_subsys. cancel_attach() can be used when can_attach() operation prepares something for the subsys, but we should rollback what can_attach() operation has prepared if attach task fails after we've succeeded in can_attach(). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:35 -08:00
Christoph Hellwig	5cacdb4add	Add generic sys_olduname() Add generic implementations of the old and really old uname system calls. Note that sh only implements sys_olduname but not sys_oldolduname, but I'm not going to bother with another ifdef for that special case. m32r implemented an old uname but never wired it up, so kill it, too. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:32 -08:00
Christoph Hellwig	e28cbf2293	improve sys_newuname() for compat architectures On an architecture that supports 32-bit compat we need to override the reported machine in uname with the 32-bit value. Instead of doing this separately in every architecture introduce a COMPAT_UTS_MACHINE define in <asm/compat.h> and apply it directly in sys_newuname(). Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:32 -08:00
Christoph Hellwig	baed7fc9b5	Add generic sys_ipc wrapper Add a generic implementation of the ipc demultiplexer syscall. Except for s390 and sparc64 all implementations of the sys_ipc are nearly identical. There are slight differences in the types of the parameters, where mips and powerpc as the only 64-bit architectures with sys_ipc use unsigned long for the "third" argument as it gets casted to a pointer later, while it traditionally is an "int" like most other paramters. frv goes even further and uses unsigned long for all parameters execept for "ptr" which is a pointer type everywhere. The change from int to unsigned long for "third" and back to "int" for the others on frv should be fine due to the in-register calling conventions for syscalls (we already had a similar issue with the generic sys_ptrace), but I'd prefer to have the arch maintainers looks over this in details. Except for that h8300, m68k and m68knommu lack an impplementation of the semtimedop sub call which this patch adds, and various architectures have gets used - at least on i386 it seems superflous as the compat code on x86-64 and ia64 doesn't even bother to implement it. [akpm@linux-foundation.org: add sys_ipc to sys_ni.c] Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: James Morris <jmorris@namei.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:32 -08:00
Thomas Gleixner	802702e0c2	timer: Try to survive timer callback preempt_count leak If a timer callback leaks preempt_count we currently assert a BUG(). That makes it unnecessarily hard to retrieve information about the problem especially on laptops and headless stations. There is a decent chance to survive the preempt_count leak by restoring the preempt_count to the value before the callback. That allows in many cases to get valuable information about the root cause of the problem. We carried that fixup in preempt-rt for years and were able to decode such wreckage quite a few times. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Veen <arjan@infradead.org>	2010-03-12 22:40:44 +01:00
Thomas Gleixner	576da126a6	timer: Split out timer function call The ident level is starting to be annoying. More white space than actual code. Split out the timer function call into its own function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:43 +01:00
Uwe Kleine-König	06f71b922c	timer: Print function name for timer callbacks modifying preemption count A function scheduled with a timer must not exit with a different preempt count than it was entered. To make helping users running into the corresponding BUG() easier also print the name of the bad function not only its address. [ tglx: Sanitized printk ] Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: johnstul@us.ibm.com Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:42 +01:00
John Stultz	64ce4c2f52	time: Clean up warp_clock() warp_clock() currently accesses timekeeping internal state directly, which is unnecessary. Convert it to use the proper timekeeping interfaces. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:42 +01:00
Stanislaw Gruszka	c28739375b	cpu-timers: Avoid iterating over all threads in fastpath_timer_check() Spread p->sighand->siglock locking scope to make sure that fastpath_timer_check() never iterates over all threads. Without locking there is small possibility that signal->cputimer will stop running while we write values to signal->cputime_expires. Calling thread_group_cputime() from fastpath_timer_check() is not only bad because it is slow, also it is racy with __exit_signal() which can lead to invalid signal->{s,u}time values. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:41 +01:00
Stanislaw Gruszka	1f169f84d2	cpu-timers: Change SIGEV_NONE timer implementation When user sets up a timer without associated signal and process does not use any other cpu timers and does not exit, tsk->signal->cputimer is enabled and running forever. Avoid running the timer for no reason. I used below program to check patch does not break current user space visible behavior. #include <sys/time.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <unistd.h> #include <assert.h> void consume_cpu(void) { int i = 0; int count = 0; for(i=0; i<100000000; i++) count++; } int main(void) { int i; struct sigaction act; struct sigevent evt = { }; timer_t tid; struct itimerspec spec = { }; evt.sigev_notify = SIGEV_NONE; assert(timer_create(CLOCK_PROCESS_CPUTIME_ID, &evt, &tid) == 0); spec.it_value.tv_sec = 10; assert(timer_settime(tid, 0, &spec, NULL) == 0); for (i = 0; i < 30; i++) { consume_cpu(); memset(&spec, 0, sizeof(spec)); assert(timer_gettime(tid, &spec) == 0); printf("%lu.%09lu\n", (unsigned long) spec.it_value.tv_sec, (unsigned long) spec.it_value.tv_nsec); } assert(timer_delete(tid) == 0); return 0; } Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:40 +01:00
Stanislaw Gruszka	ae1a78eecc	cpu-timers: Return correct previous timer reload value According POSIX we need to correctly set old timer it_interval value when user request that in timer_settime(). Tested using below program. #include <sys/time.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <assert.h> int main(void) { struct sigaction act; struct sigevent evt = { }; timer_t tid; struct itimerspec spec, u_spec, k_spec; evt.sigev_notify = SIGEV_SIGNAL; evt.sigev_signo = SIGPROF; assert(timer_create(CLOCK_PROCESS_CPUTIME_ID, &evt, &tid) == 0); spec.it_value.tv_sec = 1; spec.it_value.tv_nsec = 2; spec.it_interval.tv_sec = 3; spec.it_interval.tv_nsec = 4; u_spec = spec; assert(timer_settime(tid, 0, &spec, NULL) == 0); spec.it_value.tv_sec = 5; spec.it_value.tv_nsec = 6; spec.it_interval.tv_sec = 7; spec.it_interval.tv_nsec = 8; assert(timer_settime(tid, 0, &spec, &k_spec) == 0); #define PRT(val) printf(#val ":\t%d/%d\n", (int) u_spec.val, (int) k_spec.val) PRT(it_value.tv_sec); PRT(it_value.tv_nsec); PRT(it_interval.tv_sec); PRT(it_interval.tv_nsec); return 0; } Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:40 +01:00
Stanislaw Gruszka	5eb9aa6414	cpu-timers: Cleanup arm_timer() Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:39 +01:00
Stanislaw Gruszka	f55db60904	cpu-timers: Simplify RLIMIT_CPU handling Let always set signal->cputime_expires expiration cache when setting new itimer, POSIX 1.b timer, and RLIMIT_CPU. Since we are initializing prof_exp expiration cache during fork(), this allows to remove "RLIMIT_CPU != inf" check from fastpath_timer_check() and do some other cleanups. Checked against regression using test cases from: http://marc.info/?l=linux-kernel&m=123749066504641&w=4 http://marc.info/?l=linux-kernel&m=123811277916642&w=2 Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 22:40:39 +01:00
Stanislaw Gruszka	15365c108e	posix-cpu-timers: Reset expire cache when no timer is running When a process deletes cpu timer or a timer expires we do not clear the expiration cache sig->cputimer_expires. As a result the fastpath_timer_check() which prevents us to loop over all threads in case no timer is active is not working and we run the slow path needlessly on every tick. Zero sig->cputimer_expires in stop_process_timers(). Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Spencer Candland <spencer@bluehost.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 19:12:18 +01:00
Andrew Morton	829b6c1ef4	timer stats: Fix del_timer_sync() and try_to_del_timer_sync() These functions forgot to run timer_stats_timer_clear_start_info(). It's unobvious what effect this has and whether it matters much - we won't be printing it out anyway if the timer's detached. Untested, just an Ingo trollpatch. [ Nevertheless correct - tglx ] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 19:11:29 +01:00
Thomas Gleixner	80a05b9ffa	clockevents: Sanitize min_delta_ns adjustment and prevent overflows The current logic which handles clock events programming failures can increase min_delta_ns unlimited and even can cause overflows. Sanitize it by: - prevent zero increase when min_delta_ns == 1 - limiting min_delta_ns to a jiffie - bail out if the jiffie limit is hit - add retries stats for /proc/timer_list so we can gather data Reported-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-12 19:10:29 +01:00
Ingo Molnar	937779db13	Merge branch 'perf/urgent' into perf/core Merge reason: We want to queue up a dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-12 10:20:59 +01:00
Mike Galbraith	beac4c7e4a	sched: Remove AFFINE_WAKEUPS feature Disabling affine wakeups is too horrible to contemplate. Remove the feature flag. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301890.6785.50.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:53 +01:00
Mike Galbraith	13814d42e4	sched: Remove ASYM_GRAN feature This features has been enabled for quite a while, after testing showed that easing preemption for light tasks was harmful to high priority threads. Remove the feature flag. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301675.6785.44.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:53 +01:00
Mike Galbraith	c6ee36c423	sched: Remove SYNC_WAKEUPS feature Sync wakeups are critical functionality with a long history. Remove it, we don't need the branch or icache footprint. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301817.6785.47.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:53 +01:00
Mike Galbraith	f2e74eeac0	sched: Remove WAKEUP_SYNC feature This feature never earned its keep, remove it. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301591.6785.42.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:52 +01:00
Mike Galbraith	5ca9880c6f	sched: Remove FAIR_SLEEPERS feature Our preemption model relies too heavily on sleeper fairness to disable it without dire consequences. Remove the feature, and save a branch or two. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301520.6785.40.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:52 +01:00
Mike Galbraith	6bc6cf2b61	sched: Remove NORMALIZED_SLEEPER This feature hasn't been enabled in a long time, remove effectively dead code. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301447.6785.38.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:52 +01:00
Mike Galbraith	8b911acdf0	sched: Fix select_idle_sibling() Don't bother with selection when the current cpu is idle. Recent load balancing changes also make it no longer necessary to check wake_affine() success before returning the selected sibling, so we now always use it. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301369.6785.36.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:51 +01:00
Mike Galbraith	21406928af	sched: Tweak sched_latency and min_granularity Allow LAST_BUDDY to kick in sooner, improving cache utilization as soon as a second buddy pair arrives on scene. The cost is latency starting to climb sooner, the tbenefit for tbench 8 on my Q6600 box is ~2%. No detrimental effects noted in normal idesktop usage. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301285.6785.34.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:51 +01:00
Mike Galbraith	a64692a3af	sched: Cleanup/optimize clock updates Now that we no longer depend on the clock being updated prior to enqueueing on migratory wakeup, we can clean up a bit, placing calls to update_rq_clock() exactly where they are needed, ie on enqueue, dequeue and schedule events. In the case of a freshly enqueued task immediately preempting, we can skip the update during preemption, as the clock was just updated by the enqueue event. We also save an unneeded call during a migratory wakeup by not updating the previous runqueue, where update_curr() won't be invoked. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301199.6785.32.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:50 +01:00
Mike Galbraith	e12f31d3e5	sched: Remove avg_overlap Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy was detrimentally affected by cross-cpu wakeups, this because we are missing the necessary call to update_curr(). This can't be fixed without increasing overhead in our already too fat fastpath. Additionally, with recent load balancing changes making us prefer to place tasks in an idle cache domain (which is good for compute bound loads), communicating tasks suffer when a sync wakeup, which would enable affine placement, is turned into a non-sync wakeup by SYNC_LESS. With one task on the runqueue, wake_affine() rejects the affine wakeup request, leaving the unfortunate where placed, taking frequent cache misses. Remove it, and recover some fastpath cycles. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301121.6785.30.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:50 +01:00
Mike Galbraith	b42e0c41a4	sched: Remove avg_wakeup Testing the load which led to this heuristic (nfs4 kbuild) shows that it has outlived it's usefullness. With intervening load balancing changes, I cannot see any difference with/without, so recover there fastpath cycles. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301062.6785.29.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:50 +01:00
Mike Galbraith	39c0cbe215	sched: Rate-limit nohz Entering nohz code on every micro-idle is costing ~10% throughput for netperf TCP_RR when scheduling cross-cpu. Rate limiting entry fixes this, but raises ticks a bit. On my Q6600, an idle box goes from ~85 interrupts/sec to 128. The higher the context switch rate, the more nohz entry costs. With this patch and some cycle recovery patches in my tree, max cross cpu context switch rate is improved by ~16%, a large portion of which of which is this ratelimiting. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268301003.6785.28.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 18:32:49 +01:00
eranian@google.com	9b33fa6ba0	perf_events: Improve task_sched_in() This patch is an optimization in perf_event_task_sched_in() to avoid scheduling the events twice in a row. Without it, the perf_disable()/perf_enable() pair is invoked twice, thereby pinned events counts while scheduling flexible events and we go throuh hw_perf_enable() twice. By encapsulating, the whole sequence into perf_disable()/perf_enable() we ensure, hw_perf_enable() is going to be invoked only once because of the refcount protection. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268288765-5326-1-git-send-email-eranian@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 15:23:28 +01:00
Lucas De Marchi	41acab8851	sched: Implement group scheduler statistics in one struct Put all statistic fields of sched_entity in one struct, sched_statistics, and embed it into sched_entity. This change allows to memset the sched_statistics to 0 when needed (for instance when forking), avoiding bugs of non initialized fields. Signed-off-by: Lucas De Marchi <lucas.de.marchi@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1268275065-18542-1-git-send-email-lucas.de.marchi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 15:22:28 +01:00
Peter Zijlstra	3d07467b7a	sched: Fix pick_next_highest_task_rt() for cgroups Since pick_next_highest_task_rt() already iterates all the cgroups and is really only interested in tasks, skip over the !task entries. Reported-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Dhaval Giani <dhaval.giani@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 15:21:50 +01:00
Xiao Guangrong	639fe4b12f	perf: export perf_trace_regs and perf_arch_fetch_caller_regs Export perf_trace_regs and perf_arch_fetch_caller_regs since module will use these. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> [ use EXPORT_PER_CPU_SYMBOL_GPL() ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4B989C1B.2090407@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 15:21:29 +01:00
Masami Hiramatsu	83ff56f46a	kprobes: Calculate the index correctly when freeing the out-of-line execution slot From : Ananth N Mavinakayanahalli <ananth@in.ibm.com> When freeing the instruction slot, the arithmetic to calculate the index of the slot in the page needs to account for the total size of the instruction on the various architectures. Calculate the index correctly when freeing the out-of-line execution slot. Reported-by: Sachin Sant <sachinp@in.ibm.com> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> LKML-Reference: <4B9667AB.9050507@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 14:06:16 +01:00
Dan Carpenter	ab3b3aa5dd	sched: Cleanup: remove unused variable in try_to_wake_up() We haven't used the "orig_rq" variable since `055a00865d` "Fix/add missing update_rq_clock() calls" Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: efault@gmx.de LKML-Reference: <20100306111752.GL4958@bicker> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 13:59:59 +01:00
Ingo Molnar	915a0b575f	Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent	2010-03-11 13:39:33 +01:00
Paul E. McKenney	007b09243b	rcu: Increase RCU CPU stall timeouts if PROVE_RCU CONFIG_PROVE_RCU imposes additional overhead on the kernel, so increase the RCU CPU stall timeouts in an attempt to allow for this effect. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267830207-9474-2-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 13:38:01 +01:00
Paul E. McKenney	3f379b03fb	ftrace: Replace read_barrier_depends() with rcu_dereference_raw() Replace the calls to read_barrier_depends() in ftrace_list_func() with rcu_dereference_raw() to improve readability. The reason that we use rcu_dereference_raw() here is that removed entries are never freed, instead they are simply leaked. This is one of a very few cases where use of rcu_dereference_raw() is the long-term right answer. And I don't yet know of any others. ;-) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267830207-9474-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 13:38:01 +01:00
Paul Mackerras	220b140b52	perf_event: Fix oops triggered by cpu offline/online Anton Blanchard found that he could reliably make the kernel hit a BUG_ON in the slab allocator by taking a cpu offline and then online while a system-wide perf record session was running. The reason is that when the cpu comes up, we completely reinitialize the ctx field of the struct perf_cpu_context for the cpu. If there is a system-wide perf record session running, then there will be a struct perf_event that has a reference to the context, so its refcount will be 2. (The perf_event has been removed from the context's group_entry and event_entry lists by perf_event_exit_cpu(), but that doesn't remove the perf_event's reference to the context and doesn't decrement the context's refcount.) When the cpu comes up, perf_event_init_cpu() gets called, and it calls __perf_event_init_context() on the cpu's context. That resets the refcount to 1. Then when the perf record session finishes and the perf_event is closed, the refcount gets decremented to 0 and the context gets kfreed after an RCU grace period. Since the context wasn't kmalloced -- it's part of a per-cpu variable -- bad things happen. In fact we don't need to completely reinitialize the context when the cpu comes up. It's sufficient to initialize the context once at boot, but we need to do it for all possible cpus. This moves the context initialization to happen at boot time. With this, we don't trash the refcount and the context never gets kfreed, and we don't hit the BUG_ON. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Tested-by: Anton Blanchard <anton@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-11 12:43:51 +01:00
Thomas Gleixner	0b1adaa031	genirq: Prevent oneshot irq thread race Lars-Peter pointed out that the oneshot threaded interrupt handler code has the following race: CPU0 CPU1 hande_level_irq(irq X) mask_ack_irq(irq X) handle_IRQ_event(irq X) wake_up(thread_handler) thread handler(irq X) runs finalize_oneshot(irq X) does not unmask due to !(desc->status & IRQ_MASKED) return from irq does not unmask due to (desc->status & IRQ_ONESHOT) This leaves the interrupt line masked forever. The reason for this is the inconsistent handling of the IRQ_MASKED flag. Instead of setting it in the mask function the oneshot support sets the flag after waking up the irq thread. The solution for this is to set/clear the IRQ_MASKED status whenever we mask/unmask an interrupt line. That's the easy part, but that cleanup opens another race: CPU0 CPU1 hande_level_irq(irq) mask_ack_irq(irq) handle_IRQ_event(irq) wake_up(thread_handler) thread handler(irq) runs finalize_oneshot_irq(irq) unmask(irq) irq triggers again handle_level_irq(irq) mask_ack_irq(irq) return from irq due to IRQ_INPROGRESS return from irq does not unmask due to (desc->status & IRQ_ONESHOT) This requires that we synchronize finalize_oneshot_irq() with the primary handler. If IRQ_INPROGESS is set we wait until the primary handler on the other CPU has returned before unmasking the interrupt line again. We probably have never seen that problem because it does not happen on UP and on SMP the irqbalancer protects us by pinning the primary handler and the thread to the same CPU. Reported-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2010-03-10 17:45:14 +01:00
Frederic Weisbecker	97d5a22005	perf: Drop the obsolete profile naming for trace events Drop the obsolete "profile" naming used by perf for trace events. Perf can now do more than simple events counting, so generalize the API naming. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jason Baron <jbaron@redhat.com>	2010-03-10 14:47:18 +01:00
Frederic Weisbecker	c530665c31	perf: Take a hot regs snapshot for trace events We are taking a wrong regs snapshot when a trace event triggers. Either we use get_irq_regs(), which gives us the interrupted registers if we are in an interrupt, or we use task_pt_regs() which gives us the state before we entered the kernel, assuming we are lucky enough to be no kernel thread, in which case task_pt_regs() returns the initial set of regs when the kernel thread was started. What we want is different. We need a hot snapshot of the regs, so that we can get the instruction pointer to record in the sample, the frame pointer for the callchain, and some other things. Let's use the new perf_fetch_caller_regs() for that. Comparison with perf record -e lock: -R -a -f -g Before: perf [kernel] [k] __do_softirq \| --- __do_softirq \| \|--55.16%-- __open \| --44.84%-- __write_nocancel After: perf [kernel] [k] perf_tp_event \| --- perf_tp_event \| \|--41.07%-- lock_acquire \| \| \| \|--39.36%-- _raw_spin_lock \| \| \| \| \| \|--7.81%-- hrtimer_interrupt \| \| \| smp_apic_timer_interrupt \| \| \| apic_timer_interrupt The old case was producing unreliable callchains. Now having right frame and instruction pointers, we have the trace we want. Also syscalls and kprobe events already have the right regs, let's use them instead of wasting a retrieval. v2: Follow the rename perf_save_regs() -> perf_fetch_caller_regs() Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Archs <linux-arch@vger.kernel.org>	2010-03-10 14:40:38 +01:00
Frederic Weisbecker	5331d7b846	perf: Introduce new perf_fetch_caller_regs() for hot regs snapshot Events that trigger overflows by interrupting a context can use get_irq_regs() or task_pt_regs() to retrieve the state when the event triggered. But this is not the case for some other class of events like trace events as tracepoints are executed in the same context than the code that triggered the event. It means we need a different api to capture the regs there, namely we need a hot snapshot to get the most important informations for perf: the instruction pointer to get the event origin, the frame pointer for the callchain, the code segment for user_mode() tests (we always use __KERNEL_CS as trace events always occur from the kernel) and the eflags for further purposes. v2: rename perf_save_regs to perf_fetch_caller_regs as per Masami's suggestion. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Archs <linux-arch@vger.kernel.org>	2010-03-10 14:39:35 +01:00
Frederic Weisbecker	db2c4c7791	lockdep: Move lock events under lockdep recursion protection There are rcu locked read side areas in the path where we submit a trace event. And these rcu_read_(un)lock() trigger lock events, which create recursive events. One pair in do_perf_sw_event: __lock_acquire \| \|--96.11%-- lock_acquire \| \| \| \|--27.21%-- do_perf_sw_event \| \| perf_tp_event \| \| \| \| \| \|--49.62%-- ftrace_profile_lock_release \| \| \| lock_release \| \| \| \| \| \| \| \|--33.85%-- _raw_spin_unlock Another pair in perf_output_begin/end: __lock_acquire \|--23.40%-- perf_output_begin \| \| __perf_event_overflow \| \| perf_swevent_overflow \| \| perf_swevent_add \| \| perf_swevent_ctx_event \| \| do_perf_sw_event \| \| perf_tp_event \| \| \| \| \| \|--55.37%-- ftrace_profile_lock_acquire \| \| \| lock_acquire \| \| \| \| \| \| \| \|--37.31%-- _raw_spin_lock The problem is not that much the trace recursion itself, as we have a recursion protection already (though it's always wasteful to recurse). But the trace events are outside the lockdep recursion protection, then each lockdep event triggers a lock trace, which will trigger two other lockdep events. Here the recursive lock trace event won't be taken because of the trace recursion, so the recursion stops there but lockdep will still analyse these new events: To sum up, for each lockdep events we have: lock_*() \| trace lock_acquire \| ----- rcu_read_lock() \| \| \| lock_acquire() \| \| \| trace_lock_acquire() (stopped) \| \| \| lockdep analyze \| ----- rcu_read_unlock() \| lock_release \| trace_lock_release() (stopped) \| lockdep analyze And you can repeat the above two times as we have two rcu read side sections when we submit an event. This is fixed in this patch by moving the lock trace event under the lockdep recursion protection. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com>	2010-03-10 14:26:07 +01:00
Peter Zijlstra	d4944a0666	perf: Provide better condition for event rotation Try to avoid useless rotation and PMU disables. [ Could be improved by keeping a nr_runnable count to better account for the < PERF_STAT_INACTIVE counters ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: paulus@samba.org Cc: eranian@google.com Cc: robert.richter@amd.com Cc: fweisbec@gmail.com LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-10 13:22:36 +01:00
Peter Zijlstra	32975a4f11	perf: Optimize perf_disable Currently we always call hw_perf_disable(), even if its already disabled, this seems superflous, esp. since it cannot be made NMI safe (see further patches). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org Cc: eranian@google.com Cc: robert.richter@amd.com Cc: fweisbec@gmail.com Cc: Arnaldo Carvalho de Melo <acme@infradead.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-10 13:22:25 +01:00
Peter Zijlstra	3f6da39053	perf: Rework and fix the arch CPU-hotplug hooks Remove the hw_perf_event_*() hotplug hooks in favour of per PMU hotplug notifiers. This has the advantage of reducing the static weak interface as well as exposing all hotplug actions to the PMU. Use this to fix x86 hotplug usage where we did things in ONLINE which should have been done in UP_PREPARE or STARTING. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mundt <lethal@linux-sh.org> Cc: paulus@samba.org Cc: eranian@google.com Cc: robert.richter@amd.com Cc: fweisbec@gmail.com Cc: Arnaldo Carvalho de Melo <acme@infradead.org> LKML-Reference: <20100305154128.736225361@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-10 13:22:24 +01:00
Peter Zijlstra	dc1d628a67	perf: Provide generic perf_sample_data initialization This makes it easier to extend perf_sample_data and fixes a bug on arm and sparc, which failed to set ->raw to NULL, which can cause crashes when combined with PERF_SAMPLE_RAW. It also optimizes PowerPC and tracepoint, because the struct initialization is forced to zero out the whole structure. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Jean Pihet <jpihet@mvista.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Jamie Iles <jamie.iles@picochip.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: stable@kernel.org LKML-Reference: <20100304140100.315416040@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-10 13:22:23 +01:00
Ingo Molnar	548b841669	Merge commit 'v2.6.34-rc1' into perf/urgent Conflicts: tools/perf/util/probe-event.c Merge reason: Pick up -rc1 and resolve the conflict as well. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-09 17:11:53 +01:00
Jiri Kosina	318ae2edc3	Merge branch 'for-next' into for-linus Conflicts: Documentation/filesystems/proc.txt arch/arm/mach-u300/include/mach/debug-macro.S drivers/net/qlge/qlge_ethtool.c drivers/net/qlge/qlge_main.c drivers/net/typhoon.c	2010-03-08 16:55:37 +01:00
Eric W. Biederman	361795b1eb	sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on module dynamic attributes A little more whack-a-mole annotating the dynamic sysfs attributes. I had everything built into my earlier test kernel, and so I missed these. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:51 -08:00
Eric W. Biederman	a07e4156a2	sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on dynamic attributes These are the non-static sysfs attributes that exist on my test machine. Fix them to use sysfs_attr_init or sysfs_bin_attr_init as appropriate. It simply requires making a sysfs attribute present to see this. So this is a little bit tedious but otherwise not too bad. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:51 -08:00
Emese Revfy	52cf25d0ab	Driver core: Constify struct sysfs_ops in struct kobj_type Constify struct sysfs_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <re.emese@gmail.com> Acked-by: David Teigland <teigland@redhat.com> Acked-by: Matt Domsch <Matt_Domsch@dell.com> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Acked-by: Hans J. Koch <hjk@linutronix.de> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Jens Axboe <jens.axboe@oracle.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:49 -08:00
Emese Revfy	9cd43611cc	kobject: Constify struct kset_uevent_ops Constify struct kset_uevent_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <re.emese@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:49 -08:00
Andi Kleen	c9be0a36f9	sysdev: Pass attribute in sysdev_class attributes show/store Passing the attribute to the low level IO functions allows all kinds of cleanups, by sharing low level IO code without requiring an own function for every piece of data. Also drivers can extend the attributes with own data fields and use that in the low level function. Similar to sysdev_attributes and normal attributes. This is a tree-wide sweep, converting everything in one go. No functional changes in this patch other than passing the new argument everywhere. Tested on x86, the non x86 parts are uncompiled. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:47 -08:00
Daisuke HATAYAMA	8d9032bbe4	elf coredump: add extended numbering support The current ELF dumper implementation can produce broken corefiles if program headers exceed 65535. This number is determined by the number of vmas which the process have. In particular, some extreme programs may use more than 65535 vmas. (If you google max_map_count, you can find some users facing this problem.) This kind of program never be able to generate correct coredumps. This patch implements ``extended numbering'' that uses sh_info field of the first section header instead of e_phnum field in order to represent upto 4294967295 vmas. This is supported by AMD64-ABI(http://www.x86-64.org/documentation.html) and Solaris(http://docs.sun.com/app/docs/doc/817-1984/). Of course, we are preparing patches for gdb and binutils. Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:46 -08:00
Daisuke HATAYAMA	1fcccbac89	elf coredump: replace ELF_CORE_EXTRA_* macros by functions elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding macro for hiding _multiline_ logics in functions. This patch removes #ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions. For architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in order to reduce a range of modification. This cleanup is for my next patches, but I think this cleanup itself is worth doing regardless of my firnal purpose. Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:45 -08:00
Gustavo F. Padovan	cea83886dd	printk: avoid warning when CONFIG_PRINTK is disabled kernel/printk.c:72: warning: `saved_console_loglevel' defined but not used Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:33 -08:00
Tetsuo Handa	9728e5d6e6	kernel/pid.c: update comment on find_task_by_pid_ns tasklist_lock does protect the task and its pid, it can't go away. The problem is that find_pid_ns() itself is unsafe without rcu lock, it can race with copy_process()->free_pid(any_pid). Protecting copy_process()->free_pid(any_pid) with tasklist_lock would make it possible to call find_task_by_pid_ns() under tasklist safely, but we don't do so because we are trying to get rid of the read_lock sites of tasklist_lock. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:33 -08:00
Anton Blanchard	8aeee85a29	panic: fix panic_timeout accuracy when running on a hypervisor I've had some complaints about panic_timeout being wildly innacurate on shared processor PowerPC partitions (a 3 minute panic_timeout taking 30 minutes). The problem is we loop on mdelay(1) and with a 1ms in 10ms hypervisor timeslice each of these will take 10ms (ie 10x) longer. I expect other platforms with shared processor hypervisors will see the same issue. This patch keeps the old behaviour if we have a panic_blink (only keyboard LEDs right now) and does 1 second mdelays if we don't. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:33 -08:00
Jiri Slaby	78d7d407b6	kernel core: use helpers for rlimits Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in commit `3e10e716ab` ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:33 -08:00
Jiri Slaby	d4bb527438	posix-cpu-timers: cleanup rlimits usage Fetch rlimit (both hard and soft) values only once and work on them. It removes many accesses through sig structure and makes the code cleaner. Mostly a preparation for writable resource limits support. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:32 -08:00
Thiago Farina	f3abd4f953	kernel/exit.c: fix shadows sparse warning kernel/exit.c:1183:26: warning: symbol 'status' shadows an earlier one kernel/exit.c:1173:21: originally declared here Signed-off-by: Thiago Farina <tfransosi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:32 -08:00
Jaswinder Singh Rajput	9c03c38356	includecheck fix for kernel/params.c Fix the following 'make includecheck' warning: kernel/params.c: linux/string.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Cc: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:32 -08:00
Dan Carpenter	5f1664f92b	splice: comparing unsigned int < 0 "ret" needs to be signed or the error handling for splice_to_pipe() won't work correctly. Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: Tom Zanussi <zanussi@comcast.net> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:32 -08:00
Chen Gong	87d5e0236d	kernel/cpu.c: delete deprecated definition in cpu_up() Additional_cpus is only supported for IA64 now. X86_64 should not be included. Signed-off-by: Chen Gong <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:28 -08:00
Rafael J. Wysocki	452aa6999e	mm/pm: force GFP_NOIO during suspend/hibernation and resume There are quite a few GFP_KERNEL memory allocations made during suspend/hibernation and resume that may cause the system to hang, because the I/O operations they depend on cannot be completed due to the underlying devices being suspended. Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in gfp_allowed_mask before suspend/hibernation and restoring the original values of these bits in gfp_allowed_mask durig the subsequent resume. [akpm@linux-foundation.org: fix CONFIG_PM=n linkage] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:26 -08:00
Rik van Riel	5beb493052	mm: change anon_vma linking to fix multi-process server scalability issue The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [akpm@linux-foundation.org: cleanups] Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:26 -08:00
KAMEZAWA Hiroyuki	34e55232e5	mm: avoid false sharing of mm_counter Considering the nature of per mm stats, it's the shared object among threads and can be a cache-miss point in the page fault path. This patch adds per-thread cache for mm_counter. RSS value will be counted into a struct in task_struct and synchronized with mm's one at events. Now, in this patch, the event is the number of calls to handle_mm_fault. Per-thread value is added to mm at each 64 calls. rough estimation with small benchmark on parallel thread (2threads) shows [before] 4.5 cache-miss/faults [after] 4.0 cache-miss/faults Anyway, the most contended object is mmap_sem if the number of threads grows. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:24 -08:00
KAMEZAWA Hiroyuki	d559db086f	mm: clean up mm_counter Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:23 -08:00
Akinobu Mita	984b3f5746	bitops: rename for_each_bit() to for_each_set_bit() Rename for_each_bit to for_each_set_bit in the kernel source tree. To permit for_each_clear_bit(), should that ever be added. The patch includes a macro to map the old for_each_bit() onto the new for_each_set_bit(). This is a (very) temporary thing to ease the migration. [akpm@linux-foundation.org: add temporary for_each_bit()] Suggested-by: Alexey Dobriyan <adobriyan@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-06 11:26:23 -08:00
Tim Bird	0e95017355	function-graph: Add tracing_thresh support to function_graph tracer Add support for tracing_thresh to the function_graph tracer. This version of this feature isolates the checks into new entry and return functions, to avoid adding more conditional code into the main function_graph paths. When the tracing_thresh is set and the function graph tracer is enabled, only the functions that took longer than the time in microseconds that was set in tracing_thresh are recorded. To do this efficiently, only the function exits are recorded: [tracing]# echo 100 > tracing_thresh [tracing]# echo function_graph > current_tracer [tracing]# cat trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # \| \| \| \| \| \| \| 1) ! 119.214 us \| } /* smp_apic_timer_interrupt / 1) <========== \| 0) ! 101.527 us \| } / __rcu_process_callbacks / 0) ! 126.461 us \| } / rcu_process_callbacks / 0) ! 145.111 us \| } / __do_softirq / 0) ! 149.667 us \| } / do_softirq / 0) ! 168.817 us \| } / irq_exit / 0) ! 248.254 us \| } / smp_apic_timer_interrupt */ Also, add support for specifying tracing_thresh on the kernel command line. When used like so: "tracing_thresh=200 ftrace=function_graph" this can be used to analyse system startup. It is important to disable tracing soon after boot, in order to avoid losing the trace data. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Tim Bird <tim.bird@am.sony.com> LKML-Reference: <4B87098B.4040308@am.sony.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-05 21:20:57 -05:00
Arnaldo Carvalho de Melo	1acaa1b2d9	tracing: Update the comm field in the right variable in update_max_tr The latency output showed: # \| task: -3 (uid:0 nice:0 policy:1 rt_prio:99) The comm is missing in the "task:" and it looks like a minus 3 is the output. The correct display should be: # \| task: migration/0-3 (uid:0 nice:0 policy:1 rt_prio:99) The problem is that the comm is being stored in the wrong data structure. The max_tr.data[cpu] is what stores the comm, not the tr->data[cpu]. Before this patch the max_tr.data[cpu]->comm was zeroed and the /debug/trace ended up showing just the '-' sign followed by the pid. Also remove a needless initialization of max_data. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <1267824230-23861-1-git-send-email-acme@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-05 21:12:08 -05:00
Steven Rostedt	a094fe04c7	function-graph: Use comment notation for func names of dangling '}' When a '}' does not have a matching function start, the name is printed within parenthesis. But this makes it confusing between ending '}' and function starts. This patch makes the function name appear in C comment notation. Old view: 3) 1.281 us \| } (might_fault) 3) 3.620 us \| } (filldir) 3) 5.251 us \| } (call_filldir) 3) \| call_filldir() { 3) \| filldir() { New view: 3) 1.281 us \| } /* might_fault / 3) 3.620 us \| } / filldir / 3) 5.251 us \| } / call_filldir */ 3) \| call_filldir() { 3) \| filldir() { Requested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-05 21:11:13 -05:00
Steven Rostedt	801c29fd1f	function-graph: Fix unused reference to ftrace_set_func() The declaration of ftrace_set_func() is at the start of the ftrace.c file and wrapped with a #ifdef CONFIG_FUNCTION_GRAPH condition. If function graph tracing is enabled but CONFIG_DYNAMIC_FTRACE is not, a warning about that function being declared static and unused is given. This really should have been placed within the CONFIG_FUNCTION_GRAPH condition that uses ftrace_set_func(). Moving the declaration down fixes the warning and makes the code cleaner. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-05 21:00:30 -05:00
Linus Torvalds	660f6a360b	Merge branch 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Issue at least one memory barrier in stop_machine_text_poke() perf probe: Correct probe syntax on command line help perf probe: Add lazy line matching support perf probe: Show more lines after last line perf probe: Check function address range strictly in line finder perf probe: Use libdw callback routines perf probe: Use elfutils-libdw for analyzing debuginfo perf probe: Rename probe finder functions perf probe: Fix bugs in line range finder perf probe: Update perf probe document perf probe: Do not show --line option without dwarf support kprobes: Add documents of jump optimization kprobes/x86: Support kprobes jump optimization on x86 x86: Add text_poke_smp for SMP cross modifying code kprobes/x86: Cleanup save/restore registers kprobes/x86: Boost probes when reentering kprobes: Jump optimization sysctl interface kprobes: Introduce kprobes jump optimization kprobes: Introduce generic insn_slot framework kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE	2010-03-05 10:50:22 -08:00
Linus Torvalds	586fac13f8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: padata: Allocate the cpumask for the padata instance crypto: authenc - Move saved IV in front of the ablkcipher request crypto: hash - Fix handling of unaligned buffers crypto: authenc - Use correct ahash complete functions crypto: md5 - Set statesize	2010-03-05 10:47:57 -08:00
Linus Torvalds	0f2cc4ecd8	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) init: Open /dev/console from rootfs mqueue: fix typo "failues" -> "failures" mqueue: only set error codes if they are really necessary mqueue: simplify do_open() error handling mqueue: apply mathematics distributivity on mq_bytes calculation mqueue: remove unneeded info->messages initialization mqueue: fix mq_open() file descriptor leak on user-space processes fix race in d_splice_alias() set S_DEAD on unlink() and non-directory rename() victims vfs: add NOFOLLOW flag to umount(2) get rid of ->mnt_parent in tomoyo/realpath hppfs can use existing proc_mnt, no need for do_kern_mount() in there Mirror MS_KERNMOUNT in ->mnt_flags get rid of useless vfsmount_lock use in put_mnt_ns() Take vfsmount_lock to fs/internal.h get rid of insanity with namespace roots in tomoyo take check for new events in namespace (guts of mounts_poll()) to namespace.c Don't mess with generic_permission() under ->d_lock in hpfs sanitize const/signedness for udf nilfs: sanitize const/signedness in dealing with ->d_name.name ... Fix up fairly trivial (famous last words...) conflicts in drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c	2010-03-04 08:15:33 -08:00
Paul E. McKenney	8d53dd546f	rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare() Change the pair of rcu_dereference() calls in ftrace_perf_buf_prepare() to rcu_dereference_sched(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1267667418-32233-3-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-04 12:07:35 +01:00
Paul E. McKenney	cc5b83a9f8	rcu: Add control variables to lockdep_rcu_dereference() diagnostics Add the values of rcu_scheduler_active() and debug_locks() to the lockdep_rcu_dereference() output to help diagnose RCU lockdep splats that occur shortly after the scheduler starts. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267631219-8713-4-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-04 12:07:32 +01:00
Ingo Molnar	e02c4fd314	Merge branch 'tip/tracing/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent	2010-03-04 11:51:29 +01:00
Ingo Molnar	4f16d4e0c9	Merge branch 'perf/core' into perf/urgent Merge reason: Switch from pre-merge topical split to the post-merge urgent track Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-04 11:47:52 +01:00
Paul E. McKenney	db1466b3e1	rcu: Use wrapper function instead of exporting tasklist_lock Lockdep-RCU commit `d11c563d` exported tasklist_lock, which is not a good thing. This patch instead exports a function that uses lockdep to check whether tasklist_lock is held. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: Christoph Hellwig <hch@lst.de> LKML-Reference: <1267631219-8713-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-04 11:46:14 +01:00
Ingo Molnar	0e064caf64	Merge branches 'core/futexes' and 'core/iommu' into core/urgent Merge reason: Switch from topical split to the stabilization track Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-04 11:45:31 +01:00
Steffen Klassert	7478138782	padata: Allocate the cpumask for the padata instance The cpumask of the padata instance was used without allocated. This caused boot crashes if CONFIG_CPUMASK_OFFSTACK is enabled. This patch fixes this by doing proper allocation for this cpumask. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-03-04 13:30:22 +08:00
Linus Torvalds	a27341cd5f	Prioritize synchronous signals over 'normal' signals This makes sure that we pick the synchronous signals caused by a processor fault over any pending regular asynchronous signals sent to use by [t]kill(). This is not strictly required semantics, but it makes it _much_ easier for programs like Wine that expect to find the fault information in the signal stack. Without this, if a non-synchronous signal gets picked first, the delayed asynchronous signal will have its signal context pointing to the new signal invocation, rather than the instruction that caused the SIGSEGV or SIGBUS in the first place. This is not all that pretty, and we're discussing making the synchronous signals more explicit rather than have these kinds of implicit preferences of SIGSEGV and friends. See for example http://bugzilla.kernel.org/show_bug.cgi?id=15395 for some of the discussion. But in the meantime this is a simple and fairly straightforward work-around, and the whole if (x & Y) x &= Y; thing can be compiled into (and gcc does do it) just three instructions: movq %rdx, %rax andl $Y, %eax cmovne %rax, %rdx so it is at least a simple solution to a subtle issue. Reported-and-tested-by: Pavel Vilim <wylda@volny.cz> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-03 19:21:10 -08:00
Al Viro	9643f5d94a	Merge branch 'for-fsnotify' into for-linus	2010-03-03 17:12:40 -05:00
Al Viro	1f707137b5	new helper: iterate_mounts() apply function to vfsmounts in set returned by collect_mounts(), stop if it returns non-zero. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-03-03 14:07:57 -05:00
Al Viro	2096f759ab	New helper: path_is_under(path1, path2) Analog of is_subdir for vfsmount,dentry pairs, moved from audit_tree.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-03-03 14:07:55 -05:00
Al Viro	8737c9305b	Switch may_open() and break_lease() to passing O_... ... instead of mixing FMODE_ and O_ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-03-03 13:00:21 -05:00
Linus Torvalds	2a32f2db13	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: resource: Fix broken indentation resource: Fix generic page_is_ram() for partial RAM pages x86, paravirt: Remove kmap_atomic_pte paravirt op. x86, vmi: Disable highmem PTE allocation even when CONFIG_HIGHPTE=y x86, xen: Disable highmem PTE allocation even when CONFIG_HIGHPTE=y	2010-03-03 09:11:02 -08:00
Linus Torvalds	fb7b096d94	Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits) x86: Fix out of order of gsi x86: apic: Fix mismerge, add arch_probe_nr_irqs() again x86, irq: Keep chip_data in create_irq_nr and destroy_irq xen: Remove unnecessary arch specific xen irq functions. smp: Use nr_cpus= to set nr_cpu_ids early x86, irq: Remove arch_probe_nr_irqs sparseirq: Use radix_tree instead of ptrs array sparseirq: Change irq_desc_ptrs to static init: Move radix_tree_init() early irq: Remove unnecessary bootmem code x86: Add iMac9,1 to pci_reboot_dmi_table x86: Convert i8259_lock to raw_spinlock x86: Convert nmi_lock to raw_spinlock x86: Convert ioapic_lock and vector_lock to raw_spinlock x86: Avoid race condition in pci_enable_msix() x86: Fix SCI on IOAPIC != 0 x86, ia32_aout: do not kill argument mapping x86, irq: Move __setup_vector_irq() before the first irq enable in cpu online path x86, irq: Update the vector domain for legacy irqs handled by io-apic x86, irq: Don't block IRQ0_VECTOR..IRQ15_VECTOR's on all cpu's ...	2010-03-03 08:15:37 -08:00
Linus Torvalds	a626b46e17	Merge branch 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits) early_res: Need to save the allocation name in drop_range_partial() sparsemem: Fix compilation on PowerPC early_res: Add free_early_partial() x86: Fix non-bootmem compilation on PowerPC core: Move early_res from arch/x86 to kernel/ x86: Add find_fw_memmap_area Move round_up/down to kernel.h x86: Make 32bit support NO_BOOTMEM early_res: Enhance check_and_double_early_res x86: Move back find_e820_area to e820.c x86: Add find_early_area_size x86: Separate early_res related code from e820.c x86: Move bios page reserve early to head32/64.c sparsemem: Put mem map for one node together. sparsemem: Put usemap for one node together x86: Make 64 bit use early_res instead of bootmem before slab x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA x86: Make early_node_mem get mem > 4 GB if possible x86: Dynamically increase early_res array size x86: Introduce max_early_res and early_res_count ...	2010-03-03 08:15:05 -08:00
Linus Torvalds	0a135ba14d	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: add __percpu sparse annotations to what's left percpu: add __percpu sparse annotations to fs percpu: add __percpu sparse annotations to core kernel subsystems local_t: Remove leftover local.h this_cpu: Remove pageset_notifier this_cpu: Page allocator conversion percpu, x86: Generic inc / dec percpu instructions local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c module: Use this_cpu_xx to dynamically allocate counters local_t: Remove cpu_local_xx macros percpu: refactor the code in pcpu_[de]populate_chunk() percpu: remove compile warnings caused by __verify_pcpu_ptr() percpu: make accessors check for percpu pointer in sparse percpu: add __percpu for sparse. percpu: make access macros universal percpu: remove per_cpu__ prefix.	2010-03-03 07:34:18 -08:00
Denys Vlasenko	3d9a854c2d	Rename .data[.percpu][.XXX] to .data[..percpu][..XXX]. Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Signed-off-by: Michal Marek <mmarek@suse.cz>	2010-03-03 11:26:00 +01:00
Lai Jiangshan	ac91d85456	tracing: Fix warning in s_next of trace file ops This warning in s_next() can be triggered by lseek(): [<c018b3f7>] ? s_next+0x77/0x80 [<c013e3c1>] warn_slowpath_common+0x81/0xa0 [<c018b3f7>] ? s_next+0x77/0x80 [<c013e3fa>] warn_slowpath_null+0x1a/0x20 [<c018b3f7>] s_next+0x77/0x80 [<c01efa77>] traverse+0x117/0x200 [<c01eff13>] seq_lseek+0xa3/0x120 [<c01efe70>] ? seq_lseek+0x0/0x120 [<c01d7081>] vfs_llseek+0x41/0x50 [<c01d8116>] sys_llseek+0x66/0xa0 [<c0102bd0>] sysenter_do_call+0x12/0x26 The iterator "leftover" variable is zeroed in the opening of the trace file. But lseek can call s_start() which will call s_next() without reseting the "leftover" variable back to zero, which might trigger the WARN_ON_ONCE(iter->leftover) that is in s_next(). Cc: stable@kernel.org Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B8CE06A.9090207@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-03-02 21:11:47 -05:00
Linus Torvalds	832d30ca72	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (38 commits) SELinux: Make selinux_kernel_create_files_as() shouldn't just always return 0 TOMOYO: Protect find_task_by_vpid() with RCU. Security: add static to security_ops and default_security_ops variable selinux: libsepol: remove dead code in check_avtab_hierarchy_callback() TOMOYO: Remove __func__ from tomoyo_is_correct_path/domain security: fix a couple of sparse warnings TOMOYO: Remove unneeded parameter. TOMOYO: Use shorter names. TOMOYO: Use enum for index numbers. TOMOYO: Add garbage collector. TOMOYO: Add refcounter on domain structure. TOMOYO: Merge headers. TOMOYO: Add refcounter on string data. TOMOYO: Reduce lines by using common path for addition and deletion. selinux: fix memory leak in sel_make_bools TOMOYO: Extract bitfield syslog: clean up needless comment syslog: use defined constants instead of raw numbers syslog: distinguish between /proc/kmsg and syscalls selinux: allow MLS->non-MLS and vice versa upon policy reload ...	2010-03-02 14:47:24 -08:00
H. Peter Anvin	f41496607e	resource: Fix broken indentation Fix broken indentation in patch `37b99dd537`. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Wu Fengguang <fengguang.wu@intel.com> LKML-Reference: <20100301135551.GA9998@localhost> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-03-02 11:21:09 -08:00
Linus Torvalds	b7f3a209e9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next-2.6: sparc: Support show_unhandled_signals. sparc: use __ratelimit sunxvr500: Additional PCI id for sunxvr500 driver sparc: use asm-generic/scatterlist.h sparc64: If 'slot-names' property exist, create sysfs PCI slot information. sparc: remove trailing space in messages sparc: remove redundant return statements	2010-03-02 07:56:44 -08:00
Linus Torvalds	6d6b89bd2e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1341 commits) virtio_net: remove forgotten assignment be2net: fix tx completion polling sis190: fix cable detect via link status poll net: fix protocol sk_buff field bridge: Fix build error when IGMP_SNOOPING is not enabled bnx2x: Tx barriers and locks scm: Only support SCM_RIGHTS on unix domain sockets. vhost-net: restart tx poll on sk_sndbuf full vhost: fix get_user_pages_fast error handling vhost: initialize log eventfd context pointer vhost: logging thinko fix wireless: convert to use netdev_for_each_mc_addr ethtool: do not set some flags, if others failed ipoib: returned back addrlen check for mc addresses netlink: Adding inode field to /proc/net/netlink axnet_cs: add new id bridge: Make IGMP snooping depend upon BRIDGE. bridge: Add multicast count/interval sysfs entries bridge: Add hash elasticity/max sysfs entries bridge: Add multicast_snooping sysfs toggle ... Trivial conflicts in Documentation/feature-removal-schedule.txt	2010-03-02 07:55:08 -08:00
Peter Zijlstra	320ebf09cb	perf, x86: Restrict the ANY flag The ANY flag can show SMT data of another task (like 'top'), so we want to disable it when system-wide profiling is disabled. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-02 15:06:46 +01:00
Peter Zijlstra	5671a10e2b	nmi_watchdog: Tell the world we're active Because I was wondering why perf stat wasn't working as expected.. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-02 15:02:05 +01:00
john stultz	ad6759fbf3	timekeeping: Prevent oops when GENERIC_TIME=n Aaro Koskinen reported an issue in kernel.org bugzilla #15366, where on non-GENERIC_TIME systems, accessing /sys/devices/system/clocksource/clocksource0/current_clocksource results in an oops. It seems the timekeeper/clocksource rework missed initializing the curr_clocksource value in the !GENERIC_TIME case. Thanks to Aaro for reporting and diagnosing the issue as well as testing the fix! Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: stable@kernel.org LKML-Reference: <1267475683.4216.61.camel@localhost.localdomain> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-03-02 09:22:25 +01:00
Yinghai Lu	dce46a04d5	early_res: Need to save the allocation name in drop_range_partial() During free_early_partial(), reserve_early_without_check() could end extending the early_res area from __check_and_double_early_res(); as a result, the location of the name for the current reservation could change. Therefore, we need to save a local copy of the name. [ hpa: rewrote comment and checkin description ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4B8C7C94.7070000@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-03-01 23:23:02 -08:00
Wu Fengguang	37b99dd537	resource: Fix generic page_is_ram() for partial RAM pages The System RAM walk shall skip partial RAM pages and avoid calling func() on them. So that page_is_ram() return 0 for a partial RAM page. In particular, it shall not call func() with len=0. This fixes a boot time bug reported by Sachin and root caused by Thomas: > >>> WARNING: at arch/x86/mm/ioremap.c:111 __ioremap_caller+0x169/0x2f1() > >>> Hardware name: BladeCenter LS21 -[79716AA]- > >>> Modules linked in: > >>> Pid: 0, comm: swapper Not tainted 2.6.33-git6-autotest #1 > >>> Call Trace: > >>> [<ffffffff81047cff>] ? __ioremap_caller+0x169/0x2f1 > >>> [<ffffffff81063b7d>] warn_slowpath_common+0x77/0xa4 > >>> [<ffffffff81063bb9>] warn_slowpath_null+0xf/0x11 > >>> [<ffffffff81047cff>] __ioremap_caller+0x169/0x2f1 > >>> [<ffffffff813747a3>] ? acpi_os_map_memory+0x12/0x1b > >>> [<ffffffff81047f10>] ioremap_nocache+0x12/0x14 > >>> [<ffffffff813747a3>] acpi_os_map_memory+0x12/0x1b > >>> [<ffffffff81282fa0>] acpi_tb_verify_table+0x29/0x5b > >>> [<ffffffff812827f0>] acpi_load_tables+0x39/0x15a > >>> [<ffffffff8191c8f8>] acpi_early_init+0x60/0xf5 > >>> [<ffffffff818f2cad>] start_kernel+0x397/0x3a7 > >>> [<ffffffff818f2295>] x86_64_start_reservations+0xa5/0xa9 > >>> [<ffffffff818f237a>] x86_64_start_kernel+0xe1/0xe8 > >>> ---[ end trace 4eaa2a86a8e2da22 ]--- > >>> ioremap reserve_memtype failed -22 The return code is -EINVAL, so it failed in the is_ram check, which is not too surprising > BIOS-provided physical RAM map: > BIOS-e820: 0000000000000000 - 000000000009c000 (usable) > BIOS-e820: 000000000009c000 - 00000000000a0000 (reserved) > BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) > BIOS-e820: 0000000000100000 - 00000000cffa3900 (usable) > BIOS-e820: 00000000cffa3900 - 00000000cffa7400 (ACPI data) The ACPI data is not starting on a page boundary and neither does the usable RAM area end on a page boundary. Very useful ! > ACPI: DSDT 00000000cffa3900 036CE (v01 IBM SERLEWIS 00001000 INTL 20060912) ACPI is trying to map DSDT at cffa3900, which results in a check vs. cffa3000 which is the relevant page boundary. The generic is_ram check correctly identifies that as RAM because it's in the usable resource area. The old e820 based is_ram check does not take overlapping resource areas into account. That's why it works. CC: Sachin Sant <sachinp@in.ibm.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> LKML-Reference: <20100301135551.GA9998@localhost> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-03-01 10:18:32 -08:00
Linus Torvalds	b1bf936840	Merge branch 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.34' of git://git.kernel.dk/linux-2.6-block: (38 commits) block: don't access jiffies when initialising io_context cfq: remove 8 bytes of padding from cfq_rb_root on 64 bit builds block: fix for "Consolidate phys_segment and hw_segment limits" cfq-iosched: quantum check tweak blktrace: perform cleanup after setup error blkdev: fix merge_bvec_fn return value checks cfq-iosched: requests "in flight" vs "in driver" clarification cciss: Fix problem with scatter gather elements in the scsi half of the driver cciss: eliminate unnecessary pointer use in cciss scsi code cciss: do not use void pointer for scsi hba data cciss: factor out scatter gather chain block mapping code cciss: fix scatter gather chain block dma direction kludge cciss: simplify scatter gather code cciss: factor out scatter gather chain block allocation and freeing cciss: detect bad alignment of scsi commands at build time cciss: clarify command list padding calculation cfq-iosched: rethink seeky detection for SSDs cfq-iosched: rework seeky detection block: remove padding from io_context on 64bit builds block: Consolidate phys_segment and hw_segment limits ...	2010-03-01 09:00:29 -08:00
Linus Torvalds	0f45339794	Merge branches 'futexes-for-linus', 'irq-core-for-linus' and 'bkl-drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Protect pid lookup in compat code with RCU * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Fix documentation of default chip disable() * 'bkl-drivers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: nvram: Drop the BKL from nvram_open()	2010-03-01 08:51:52 -08:00
Linus Torvalds	e56425b135	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix-timers.c: Don't export local functions clocksource: start CMT at clocksource resume clocksource: add suspend callback clocksource: add argument to resume callback ntp: Cleanup xtime references in ntp.c ntp: Make time_esterror and time_maxerror static	2010-03-01 08:48:25 -08:00
Paul E. McKenney	90a6501f94	sched, rcu: Fix rcu_dereference() for RCU-lockdep Make rcu_dereference() of runqueue data structures be rcu_dereference_sched(). Located-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20100228163218.GD6846@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-01 09:29:58 +01:00
Ingo Molnar	e2f4699ac1	Merge branch 'linus' into core/rcu Merge reason: Backmerge latest upstream to queue up dependent fix in the scheduler. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-03-01 09:28:58 +01:00
David S. Miller	4b17764737	sparc: Support show_unhandled_signals. Just faults right now, will add other traps later. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-03-01 00:02:23 -08:00
David S. Miller	47871889c6	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: drivers/firmware/iscsi_ibft.c	2010-02-28 19:23:06 -08:00
James Morris	b4ccebdd37	Merge branch 'next' into for-linus	2010-03-01 09:36:31 +11:00
Frederic Weisbecker	1e259e0a99	hw-breakpoints: Remove stub unthrottle callback We support event unthrottling in breakpoint events. It means that if we have more than sysctl_perf_event_sample_rate/HZ, perf will throttle, ignoring subsequent events until the next tick. So if ptrace exceeds this max rate, it will omit events, which breaks the ptrace determinism that is supposed to report every triggered breakpoints. This is likely to happen if we set sysctl_perf_event_sample_rate to 1. This patch removes support for unthrottling in breakpoint events to break throttling and restore ptrace determinism. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: 2.6.33.x <stable@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org>	2010-02-28 20:51:15 +01:00
Linus Torvalds	6f5621cb16	Merge branch 'x86-ptrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-ptrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, ptrace: Remove set_stopped_child_used_math() in [x]fpregs_set x86, ptrace: Simplify xstateregs_get() ptrace: Fix ptrace_regset() comments and diagnose errors specifically parisc: Disable CONFIG_HAVE_ARCH_TRACEHOOK ptrace: Add support for generic PTRACE_GETREGSET/PTRACE_SETREGSET x86, ptrace: regset extensions to support xstate	2010-02-28 10:59:44 -08:00
Dmitry Monakhov	9a8c28c831	blktrace: perform cleanup after setup error Currently even if BLKTRACESETUP ioctl has failed user must call BLKTRACETEARDOWN to be shure what all staff was cleaned, which is contr-intuitive. Let's setup ioctl make necessery cleanup by it self. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-28 19:47:19 +01:00
Frederic Weisbecker	ae1f30384b	tracing: Include irqflags headers from trace clock trace_clock.c includes spinlock.h, which ends up including asm/system.h, which in turn includes linux/irqflags.h in x86. So the definition of raw_local_irq_save is luckily covered there, but this is not the case in parisc: tip/kernel/trace/trace_clock.c:86: error: implicit declaration of function 'raw_local_irq_save' tip/kernel/trace/trace_clock.c:112: error: implicit declaration of function 'raw_local_irq_restore' We need to include linux/irqflags.h directly from trace_clock.c to avoid such build error. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Robert Richter <robert.richter@amd.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-28 19:45:01 +01:00
Linus Torvalds	46bbffad54	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, mm: Unify kernel_physical_mapping_init() API x86, mm: Allow highmem user page tables to be disabled at boot time x86: Do not reserve brk for DMI if it's not going to be used x86: Convert tlbstate_lock to raw_spinlock x86: Use the generic page_is_ram() x86: Remove BIOS data range from e820 Move page_is_ram() declaration to mm.h Generic page_is_ram: use __weak resources: introduce generic page_is_ram()	2010-02-28 10:38:45 -08:00
Linus Torvalds	f66ffdedbf	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits) sched: Fix SCHED_MC regression caused by change in sched cpu_power sched: Don't use possibly stale sched_class kthread, sched: Remove reference to kthread_create_on_cpu sched: cpuacct: Use bigger percpu counter batch values for stats counters percpu_counter: Make __percpu_counter_add an inline function on UP sched: Remove member rt_se from struct rt_rq sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu] sched: Remove unused update_shares_locked() sched: Use for_each_bit sched: Queue a deboosted task to the head of the RT prio queue sched: Implement head queueing for sched_rt sched: Extend enqueue_task to allow head queueing sched: Remove USER_SCHED sched: Fix the place where group powers are updated sched: Assume *balance is valid sched: Remove load_balance_newidle() sched: Unify load_balance{,_newidle}() sched: Add a lock break for PREEMPT=y sched: Remove from fwd decls sched: Remove rq_iterator from move_one_task ... Fix up trivial conflicts in kernel/sched.c	2010-02-28 10:31:01 -08:00
Linus Torvalds	2531216f23	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix race between ttwu() and task_rq_lock() sched: Fix SMT scheduler regression in find_busiest_queue() sched: Fix sched_mv_power_savings for !SMT kernel/sched.c: Suppress unused var warning	2010-02-28 10:23:41 -08:00
Linus Torvalds	6556a67435	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (172 commits) perf_event, amd: Fix spinlock initialization perf_event: Fix preempt warning in perf_clock() perf tools: Flush maps on COMM events perf_events, x86: Split PMU definitions into separate files perf annotate: Handle samples not at objdump output addr boundaries perf_events, x86: Remove superflous MSR writes perf_events: Simplify code by removing cpu argument to hw_perf_group_sched_in() perf_events, x86: AMD event scheduling perf_events: Add new start/stop PMU callbacks perf_events: Report the MMAP pgoff value in bytes perf annotate: Defer allocating sym_priv->hist array perf symbols: Improve debugging information about symtab origins perf top: Use a macro instead of a constant variable perf symbols: Check the right return variable perf/scripts: Tag syscall_name helper as not yet available perf/scripts: Add perf-trace-python Documentation perf/scripts: Remove unnecessary PyTuple resizes perf/scripts: Add syscall tracing scripts perf/scripts: Add Python scripting engine perf/scripts: Remove check-perf-trace from listed scripts ... Fix trivial conflict in tools/perf/util/probe-event.c	2010-02-28 10:20:25 -08:00
Linus Torvalds	e0d272429a	Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits) ftrace: Add function names to dangling } in function graph tracer tracing: Simplify memory recycle of trace_define_field tracing: Remove unnecessary variable in print_graph_return tracing: Fix typo of info text in trace_kprobe.c tracing: Fix typo in prof_sysexit_enable() tracing: Remove CONFIG_TRACE_POWER from kernel config tracing: Fix ftrace_event_call alignment for use with gcc 4.5 ftrace: Remove memory barriers from NMI code when not needed tracing/kprobes: Add short documentation for HAVE_REGS_AND_STACK_ACCESS_API s390: Add pt_regs register and stack access API tracing/kprobes: Make Kconfig dependencies generic tracing: Unify arch_syscall_addr() implementations tracing: Add notrace to TRACE_EVENT implementation functions ftrace: Allow to remove a single function from function graph filter tracing: Add correct/incorrect to sort keys for branch annotation output tracing: Simplify test for function_graph tracing start point tracing: Drop the tr check from the graph tracing path tracing: Add stack dump to trace_printk if stacktrace option is set tracing: Use appropriate perl constructs in recordmcount.pl tracing: optimize recordmcount.pl for offsets-handling ...	2010-02-28 10:17:55 -08:00
Linus Torvalds	642c4c75a7	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits) rcu: Fix accelerated GPs for last non-dynticked CPU rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot rcu: Fix accelerated grace periods for last non-dynticked CPU rcu: Export rcu_scheduler_active rcu: Make rcu_read_lock_sched_held() take boot time into account rcu: Make lockdep_rcu_dereference() message less alarmist sched, cgroups: Fix module export rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information rcu: Fix rcutorture mod_timer argument to delay one jiffy rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection rcu: Convert to raw_spinlocks rcu: Stop overflowing signed integers rcu: Use canonical URL for Mathieu's dissertation rcu: Accelerate grace period if last non-dynticked CPU rcu: Fix citation of Mathieu's dissertation rcu: Documentation update for CONFIG_PROVE_RCU security: Apply lockdep-based checking to rcu_dereference() uses idr: Apply lockdep-based diagnostics to rcu_dereference() uses radix-tree: Disable RCU lockdep checking in radix tree vfs: Abstract rcu_dereference_check for files-fdtable use ...	2010-02-28 10:13:16 -08:00
Linus Torvalds	f91b22c35f	Merge branches 'core-ipi-for-linus', 'core-locking-for-linus', 'tracing-fixes-for-linus', 'x86-debug-for-linus', 'x86-doc-for-linus', 'x86-gpu-for-linus' and 'x86-rlimit-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic-ipi: Optimize accesses by using DEFINE_PER_CPU_SHARED_ALIGNED for IPI data * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: plist: Fix grammar mistake, and c-style mistake * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: kprobes: Add mcount to the kprobes blacklist * 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86_64: Print modules like i386 does * 'x86-doc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Put 'nopat' in kernel-parameters * 'x86-gpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86-64: Allow fbdev primary video code * 'x86-rlimit-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Use helpers for rlimits	2010-02-28 10:04:02 -08:00
Paul E. McKenney	622ea685f1	rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU Make the holdoff only happen when the full number of attempts have been made. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267311188-16603-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-28 09:17:42 +01:00
Tejun Heo	44ee63587d	percpu: Add __percpu sparse annotations to hw_breakpoint Add __percpu sparse annotations to hw_breakpoint. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. In kernel/hw_breakpoint.c, per_cpu(nr_task_bp_pinned, cpu)'s will trigger spurious noderef related warnings from sparse. Changing it to &per_cpu(nr_task_bp_pinned[0], cpu) will work around the problem but deemed to ugly by the maintainer. Leave it alone until better solution can be found. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: K.Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <4B7B4B7A.9050902@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-02-27 16:23:39 +01:00
Frederic Weisbecker	018cbffe68	Merge commit 'v2.6.33' into perf/core Merge reason: __percpu annotations need the corresponding sparse address space definition upstream. Conflicts: tools/perf/util/probe-event.c (trivial)	2010-02-27 16:18:46 +01:00
Ingo Molnar	480917427b	Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core	2010-02-27 10:41:16 +01:00
Ingo Molnar	6fb83029db	Merge branch 'tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into tracing/core	2010-02-27 10:06:10 +01:00
Paul E. McKenney	71da81324c	rcu: Fix accelerated GPs for last non-dynticked CPU This patch disables irqs across the call to rcu_needs_cpu(). It also enforces a hold-off period so that the idle loop doesn't softirq itself to death when there are lots of RCU callbacks in flight on the last non-dynticked CPU. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267231138-27856-3-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-27 09:53:53 +01:00
Paul E. McKenney	a47cd880b5	rcu: Fix accelerated grace periods for last non-dynticked CPU It is invalid to invoke __rcu_process_callbacks() with irqs disabled, so do it indirectly via raise_softirq(). This requires a state-machine implementation to cycle through the grace-period machinery the required number of times. Located-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267231138-27856-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-27 09:53:52 +01:00
Linus Torvalds	06a79b82b2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Hibernate: Fix preallocating of memory PM / Hibernate: Remove swsusp.c finally PM / Hibernate: Remove trailing space in message PM: Allow SCSI devices to suspend/resume asynchronously PM: Allow USB devices to suspend/resume asynchronously USB: implement non-tree resume ordering constraints for PCI host controllers PM: Allow PCI devices to suspend/resume asynchronously PM / Hibernate: Swap, remove useless check from swsusp_read() PM / Hibernate: Really deprecate deprecated user ioctls PM: Allow device drivers to use dpm_wait() PM: Start asynchronous resume threads upfront PM: Add facility for advanced testing of async suspend/resume PM: Add a switch for disabling/enabling asynchronous suspend/resume PM: Asynchronous suspend and resume of devices PM: Add parent information to timing messages PM: Document device power attributes in sysfs PM / Runtime: Add sysfs switch for disabling device run-time PM	2010-02-26 17:22:53 -08:00
Linus Torvalds	37d4008484	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (31 commits) crypto: aes_generic - Fix checkpatch errors crypto: fcrypt - Fix checkpatch errors crypto: ecb - Fix checkpatch errors crypto: des_generic - Fix checkpatch errors crypto: deflate - Fix checkpatch errors crypto: crypto_null - Fix checkpatch errors crypto: cipher - Fix checkpatch errors crypto: crc32 - Fix checkpatch errors crypto: compress - Fix checkpatch errors crypto: cast6 - Fix checkpatch errors crypto: cast5 - Fix checkpatch errors crypto: camellia - Fix checkpatch errors crypto: authenc - Fix checkpatch errors crypto: api - Fix checkpatch errors crypto: anubis - Fix checkpatch errors crypto: algapi - Fix checkpatch errors crypto: blowfish - Fix checkpatch errors crypto: aead - Fix checkpatch errors crypto: ablkcipher - Fix checkpatch errors crypto: pcrypt - call the complete function on error ...	2010-02-26 16:50:02 -08:00
Steven Rostedt	f1c7f517a5	ftrace: Add function names to dangling } in function graph tracer The function graph tracer is currently the most invasive tracer in the ftrace family. It can easily overflow the buffer even with 10megs per CPU. This means that events can often be lost. On start up, or after events are lost, if the function return is recorded but the function enter was lost, all we get to see is the exiting '}'. Here is how a typical trace output starts: [tracing] cat trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # \| \| \| \| \| \| \| 0) + 91.897 us \| } 0) ! 567.961 us \| } 0) <========== \| 0) ! 579.083 us \| _raw_spin_lock_irqsave(); 0) 4.694 us \| _raw_spin_unlock_irqrestore(); 0) ! 594.862 us \| } 0) ! 603.361 us \| } 0) ! 613.574 us \| } 0) ! 623.554 us \| } 0) 3.653 us \| fget_light(); 0) \| sock_poll() { There are a series of '}' with no matching "func() {". There's no information to what functions these ending brackets belong to. This patch adds a stack on the per cpu structure used in outputting the function graph tracer to keep track of what function was outputted. Then on a function exit event, it checks the depth to see if the function exit has a matching entry event. If it does, then it only prints the '}', otherwise it adds the function name after the '}'. This allows function exit events to show what function they belong to at trace output startup, when the entry was lost due to ring buffer overflow, or even after a new task is scheduled in. Here is what the above trace will look like after this patch: [tracing] cat trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # \| \| \| \| \| \| \| 0) + 91.897 us \| } (irq_exit) 0) ! 567.961 us \| } (smp_apic_timer_interrupt) 0) <========== \| 0) ! 579.083 us \| _raw_spin_lock_irqsave(); 0) 4.694 us \| _raw_spin_unlock_irqrestore(); 0) ! 594.862 us \| } (add_wait_queue) 0) ! 603.361 us \| } (__pollwait) 0) ! 613.574 us \| } (tcp_poll) 0) ! 623.554 us \| } (sock_poll) 0) 3.653 us \| fget_light(); 0) \| sock_poll() { Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-26 19:25:53 -05:00
Rafael J. Wysocki	a9c9b4429d	PM / Hibernate: Fix preallocating of memory The hibernate memory preallocation code allocates memory to push some user space data out of physical RAM, so that the hibernation image is not too large. It allocates more memory than necessary for creating the image, so it has to release some pages to make room for allocations made while suspending devices and disabling nonboot CPUs, or the system will hang due to the lack of free pages to allocate from. Unfortunately, the function used for freeing these pages, free_unnecessary_pages(), contains a bug that prevents it from doing the job on all systems without highmem. Fix this problem, which is a regression from the 2.6.30 kernel, by using the right condition for the termination of the loop in free_unnecessary_pages(). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-and-tested-by: Alan Jenkins <sourcejedi.lkml@googlemail.com> Cc: stable@kernel.org	2010-02-26 20:39:13 +01:00
Jiri Slaby	f8bb0db818	PM / Hibernate: Remove swsusp.c finally Its contents and entry in Makefile were already removed in `8e60c6a134` (Shift remaining code from swsusp.c to hibernate.c) but somehow it remained in-place (rjw: which most likely was my mistake). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Nigel Cunningham <nigel@tuxonice.net> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:13 +01:00
Frans Pop	07c3bb5797	PM / Hibernate: Remove trailing space in message Remove a trailing space from a message in swsusp_save(). Signed-off-by: Frans Pop <elendil@planet.nl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:13 +01:00
Jiri Slaby	09c09bc618	PM / Hibernate: Swap, remove useless check from swsusp_read() It will never reach here if the sws_resume_bdev is erratic. swsusp_read() is called only from software_resume(), but after swsusp_check() which would catch the error state. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:11 +01:00
Jiri Slaby	b694e52ebd	PM / Hibernate: Really deprecate deprecated user ioctls They were deprecated and removed from exported headers more than 2 years ago. Inform users about their removal in the future now. (Switch cases needed to be reorderded for an easy fall through.) And add an entry to feature-removal-schedule. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:11 +01:00
Rafael J. Wysocki	5a2eb8585f	PM: Add facility for advanced testing of async suspend/resume Add configuration switch CONFIG_PM_ADVANCED_DEBUG for compiling in extra PM debugging/testing code allowing one to access some PM-related attributes of devices from the user space via sysfs. If CONFIG_PM_ADVANCED_DEBUG is set, add sysfs attribute power/async for every device allowing the user space to access the device's power.async_suspend flag and modify it, if desired. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:10 +01:00
Rafael J. Wysocki	0e06b4a891	PM: Add a switch for disabling/enabling asynchronous suspend/resume Add sysfs attribute /sys/power/pm_async allowing the user space to disable/enable asynchronous suspend/resume of devices. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-02-26 20:39:10 +01:00
Linus Torvalds	68c6b85984	Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (48 commits) x86/PCI: Prevent mmconfig memory corruption ACPI: Use GPE reference counting to support shared GPEs x86/PCI: use host bridge _CRS info by default on 2008 and newer machines PCI: augment bus resource table with a list PCI: add pci_bus_for_each_resource(), remove direct bus->resource[] refs PCI: read bridge windows before filling in subtractive decode resources PCI: split up pci_read_bridge_bases() PCIe PME: use pci_pcie_cap() PCI PM: Run-time callbacks for PCI bus type PCIe PME: use pci_is_pcie() PCI / ACPI / PM: Platform support for PCI PME wake-up ACPI / ACPICA: Multiple system notify handlers per device ACPI / PM: Add more run-time wake-up fields ACPI: Use GPE reference counting to support shared GPEs PCI PM: Make it possible to force using INTx for PCIe PME signaling PCI PM: PCIe PME root port service driver PCI PM: Add function for checking PME status of devices PCI: mark is_pcie obsolete PCI: set PCI_PREF_RANGE_TYPE_64 in pci_bridge_check_ranges PCI: pciehp: second try to get big range for pcie devices ...	2010-02-26 10:35:27 -08:00
Peter Zijlstra	24691ea964	perf_event: Fix preempt warning in perf_clock() A recent commit introduced a preemption warning for perf_clock(), use raw_smp_processor_id() to avoid this, it really doesn't matter which cpu we use here. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1267198583.22519.684.camel@laptop> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 17:25:00 +01:00
Suresh Siddha	dd5feea14a	sched: Fix SCHED_MC regression caused by change in sched cpu_power On platforms like dual socket quad-core platform, the scheduler load balancer is not detecting the load imbalances in certain scenarios. This is leading to scenarios like where one socket is completely busy (with all the 4 cores running with 4 tasks) and leaving another socket completely idle. This causes performance issues as those 4 tasks share the memory controller, last-level cache bandwidth etc. Also we won't be taking advantage of turbo-mode as much as we would like, etc. Some of the comparisons in the scheduler load balancing code are comparing the "weighted cpu load that is scaled wrt sched_group's cpu_power" with the "weighted average load per task that is not scaled wrt sched_group's cpu_power". While this has probably been broken for a longer time (for multi socket numa nodes etc), the problem got aggrevated via this recent change: \| \| commit `f93e65c186` \| Author: Peter Zijlstra <a.p.zijlstra@chello.nl> \| Date: Tue Sep 1 10:34:32 2009 +0200 \| \| sched: Restore __cpu_power to a straight sum of power \| Also with this change, the sched group cpu power alone no longer reflects the group capacity that is needed to implement MC, MT performance (default) and power-savings (user-selectable) policies. We need to use the computed group capacity (sgs.group_capacity, that is computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to find out if the group with the max load is above its capacity and how much load to move etc. Reported-by: Ma Ling <ling.ma@intel.com> Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> [ -v2: build fix ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> # [2.6.32.x, 2.6.33.x] LKML-Reference: <1266970432.11588.22.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 15:45:13 +01:00
Peter Zijlstra	6e37738a2f	perf_events: Simplify code by removing cpu argument to hw_perf_group_sched_in() Since the cpu argument to hw_perf_group_sched_in() is always smp_processor_id(), simplify the code a little by removing this argument and using the current cpu where needed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Miller <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1265890918.5396.3.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 10:56:53 +01:00
Stephane Eranian	38331f62c2	perf_events, x86: AMD event scheduling This patch adds correct AMD NorthBridge event scheduling. NB events are events measuring L3 cache, Hypertransport traffic. They are identified by an event code >= 0xe0. They measure events on the Northbride which is shared by all cores on a package. NB events are counted on a shared set of counters. When a NB event is programmed in a counter, the data actually comes from a shared counter. Thus, access to those counters needs to be synchronized. We implement the synchronization such that no two cores can be measuring NB events using the same counters. Thus, we maintain a per-NB allocation table. The available slot is propagated using the event_constraint structure. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4b703957.0702d00a.6bf2.7b7d@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 10:56:53 +01:00
Stephane Eranian	d76a0812ac	perf_events: Add new start/stop PMU callbacks In certain situations, the kernel may need to stop and start the same event rapidly. The current PMU callbacks do not distinguish between stop and release (i.e., stop + free the resource). Thus, a counter may be released, then it will be immediately re-acquired. Event scheduling will again take place with no guarantee to assign the same counter. On some processors, this may event yield to failure to assign the event back due to competion between cores. This patch is adding a new pair of callback to stop and restart a counter without actually release the underlying counter resource. On stop, the counter is stopped, its values saved and that's it. On start, the value is reloaded and counter is restarted (on x86, actual restart is delayed until perf_enable()). Signed-off-by: Stephane Eranian <eranian@google.com> [ added fallback to ->enable/->disable for all other PMUs fixed x86_pmu_start() to call x86_pmu.enable() merged __x86_pmu_disable into x86_pmu_stop() ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4b703875.0a04d00a.7896.ffffb824@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 10:56:53 +01:00
Peter Zijlstra	3a0304e90a	perf_events: Report the MMAP pgoff value in bytes DaveM reported that currently perf interprets the pgoff value reported by the MMAP events as a byte range, but the kernel reports it as a page offset. Since its broken (and unusable) anyway, change the kernel behaviour (ABI) to report bytes indeed, avoiding the need for userspace to deal with PAGE_SIZE things. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 10:56:52 +01:00
Ingo Molnar	281b3714e9	Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core	2010-02-26 09:20:17 +01:00
Ingo Molnar	64b9fb5704	Merge commit 'v2.6.33' into tracing/core Conflicts: scripts/recordmcount.pl Merge reason: Merge up to v2.6.33. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 09:18:32 +01:00
Yinghai Lu	fb90ef93df	early_res: Add free_early_partial() To free partial areas in pcpu_setup... Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> LKML-Reference: <4B85E245.5030001@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 08:25:35 +01:00
Paul E. McKenney	f5f6540964	rcu: Export rcu_scheduler_active Kernel modules using rcu_read_lock_sched_held() must now have access to rcu_scheduler_active, so it must be exported. This should fix the fix for the boot-time RCU-lockdep splat. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20100226030230.GA7743@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 08:20:46 +01:00
Paul E. McKenney	d9f1bb6ad7	rcu: Make rcu_read_lock_sched_held() take boot time into account Before the scheduler starts, all tasks are non-preemptible by definition. So, during that time, rcu_read_lock_sched_held() needs to always return "true". This patch makes that be so. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267135607-7056-2-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 08:20:46 +01:00
Paul E. McKenney	056ba4a9be	rcu: Make lockdep_rcu_dereference() message less alarmist Change from "unsafe" to "suspicious", given that there will be false alarms. Suggested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1267135607-7056-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-26 08:20:46 +01:00
Masami Hiramatsu	b2be84df99	kprobes: Jump optimization sysctl interface Add /proc/sys/debug/kprobes-optimization sysctl which enables and disables kprobes jump optimization on the fly for debugging. Changes in v7: - Remove ctl_name = CTL_UNNUMBERED for upstream compatibility. Changes in v6: - Update comments and coding style. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Anders Kaseorg <andersk@ksplice.com> Cc: Tim Abbott <tabbott@ksplice.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> LKML-Reference: <20100225133415.6725.8274.stgit@localhost6.localdomain6> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 17:49:25 +01:00
Masami Hiramatsu	afd66255b9	kprobes: Introduce kprobes jump optimization Introduce kprobes jump optimization arch-independent parts. Kprobes uses breakpoint instruction for interrupting execution flow, on some architectures, it can be replaced by a jump instruction and interruption emulation code. This gains kprobs' performance drastically. To enable this feature, set CONFIG_OPTPROBES=y (default y if the arch supports OPTPROBE). Changes in v9: - Fix a bug to optimize probe when enabling. - Check nearby probes can be optimize/unoptimize when disarming/arming kprobes, instead of registering/unregistering. This will help kprobe-tracer because most of probes on it are usually disabled. Changes in v6: - Cleanup coding style for readability. - Add comments around get/put_online_cpus(). Changes in v5: - Use get_online_cpus()/put_online_cpus() for avoiding text_mutex deadlock. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Anders Kaseorg <andersk@ksplice.com> Cc: Tim Abbott <tabbott@ksplice.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> LKML-Reference: <20100225133407.6725.81992.stgit@localhost6.localdomain6> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 17:49:24 +01:00
Masami Hiramatsu	4610ee1d36	kprobes: Introduce generic insn_slot framework Make insn_slot framework support various size slots. Current insn_slot just supports one-size instruction buffer slot. However, kprobes jump optimization needs larger size buffers. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Anders Kaseorg <andersk@ksplice.com> Cc: Tim Abbott <tabbott@ksplice.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> LKML-Reference: <20100225133358.6725.82430.stgit@localhost6.localdomain6> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Anders Kaseorg <andersk@ksplice.com> Cc: Tim Abbott <tabbott@ksplice.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>	2010-02-25 17:49:24 +01:00
Wenji Huang	7b60997f73	tracing: Simplify memory recycle of trace_define_field Discard freeing field->type since it is not necessary. Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Wenji Huang <wenji.huang@oracle.com> LKML-Reference: <1266997226-6833-5-git-send-email-wenji.huang@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 10:42:55 -05:00
Wenji Huang	c85f3a91f8	tracing: Remove unnecessary variable in print_graph_return The "cpu" variable is declared at the start of the function and also within a branch, with the exact same initialization. Remove the local variable of the same name in the branch. Signed-off-by: Wenji Huang <wenji.huang@oracle.com> LKML-Reference: <1266997226-6833-3-git-send-email-wenji.huang@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 10:41:24 -05:00
Wenji Huang	a5efd92511	tracing: Fix typo of info text in trace_kprobe.c Signed-off-by: Wenji Huang <wenji.huang@oracle.com> LKML-Reference: <1266997226-6833-2-git-send-email-wenji.huang@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 10:36:29 -05:00
Wenji Huang	6574658b3b	tracing: Fix typo in prof_sysexit_enable() Signed-off-by: Wenji Huang <wenji.huang@oracle.com> LKML-Reference: <1266997226-6833-1-git-send-email-wenji.huang@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 10:35:55 -05:00
Li Zefan	1ab83a8941	tracing: Remove CONFIG_TRACE_POWER from kernel config The power tracer has been converted to power trace events. Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B84D50E.4070806@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 10:31:45 -05:00
Jeff Mahoney	86c38a31aa	tracing: Fix ftrace_event_call alignment for use with gcc 4.5 GCC 4.5 introduces behavior that forces the alignment of structures to use the largest possible value. The default value is 32 bytes, so if some structures are defined with a 4-byte alignment and others aren't declared with an alignment constraint at all - it will align at 32-bytes. For things like the ftrace events, this results in a non-standard array. When initializing the ftrace subsystem, we traverse the _ftrace_events section and call the initialization callback for each event. When the structures are misaligned, we could be treating another part of the structure (or the zeroed out space between them) as a function pointer. This patch forces the alignment for all the ftrace_event_call structures to 4 bytes. Without this patch, the kernel fails to boot very early when built with gcc 4.5. It's trivial to check the alignment of the members of the array, so it might be worthwhile to add something to the build system to do that automatically. Unfortunately, that only covers this case. I've asked one of the gcc developers about adding a warning when this condition is seen. Cc: stable@kernel.org Signed-off-by: Jeff Mahoney <jeffm@suse.com> LKML-Reference: <4B85770B.6010901@suse.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-25 09:38:11 -05:00
Don Zickus	47195d5763	nmi_watchdog: Clean up various small details Mostly copy/paste whitespace damage with a couple of nitpicks by the checkpatch script. Fix the struct definition as requested by Ingo too. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: peterz@infradead.org Cc: gorcunov@gmail.com Cc: aris@redhat.com LKML-Reference: <1266880143-24943-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> -- arch/x86/kernel/apic/hw_nmi.c \| 14 +++++------ arch/x86/kernel/traps.c \| 6 ++-- include/linux/nmi.h \| 2 - kernel/nmi_watchdog.c \| 51 ++++++++++++++++++++---------------------- 4 files changed, 36 insertions(+), 37 deletions(-)	2010-02-25 12:40:50 +01:00
Ingo Molnar	c50cc75271	sched, cgroups: Fix module export I have exported it in `d11c563` - but cgroups.c did not have module.h included ... Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 12:02:13 +01:00
Paul E. McKenney	1ed509a225	rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information When RCU detects a grace-period stall, it currently just prints out the PID of any tasks doing the stalling. This patch adds RCU_CPU_STALL_VERBOSE, which enables the more-verbose reporting from sched_show_task(). Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-21-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:35:02 +01:00
Paul E. McKenney	6155fec92e	rcu: Fix rcutorture mod_timer argument to delay one jiffy The current "mod_timer(&t, 1)" potentially makes the timer fire immediately, change this to wait one jiffy. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-20-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:35:00 +01:00
Paul E. McKenney	3acd9eb31c	rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection Under TREE_PREEMPT_RCU, print_other_cpu_stall() invokes rcu_print_task_stall() with the root rcu_node structure's ->lock held, and rcu_print_task_stall() acquires that same lock for self-deadlock. Fix this by removing the lock acquisition from rcu_print_task_stall(), and making all callers acquire the lock instead. Tested-by: John Kacur <jkacur@redhat.com> Tested-by: Thomas Gleixner <tglx@linutronix.de> Located-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-19-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:59 +01:00
Paul E. McKenney	1304afb225	rcu: Convert to raw_spinlocks The spinlocks in rcutree need to be real spinlocks in preempt-rt. Convert them to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-18-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:58 +01:00
Paul E. McKenney	20133cfce7	rcu: Stop overflowing signed integers The C standard does not specify the result of an operation that overflows a signed integer, so such operations need to be avoided. This patch changes the type of several fields from "long" to "unsigned long" and adjusts operations as needed. ULONG_CMP_GE() and ULONG_CMP_LT() macros are introduced to do the modular comparisons that are appropriate given that overflow is an expected event. Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-17-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:57 +01:00
Paul E. McKenney	8bd93a2c5d	rcu: Accelerate grace period if last non-dynticked CPU Currently, rcu_needs_cpu() simply checks whether the current CPU has an outstanding RCU callback, which means that the last CPU to go into dyntick-idle mode might wait a few ticks for the relevant grace periods to complete. However, if all the other CPUs are in dyntick-idle mode, and if this CPU is in a quiescent state (which it is for RCU-bh and RCU-sched any time that we are considering going into dyntick-idle mode), then the grace period is instantly complete. This patch therefore repeatedly invokes the RCU grace-period machinery in order to force any needed grace periods to complete quickly. It does so a limited number of times in order to prevent starvation by an RCU callback function that might pass itself to call_rcu(). However, if any CPU other than the current one is not in dyntick-idle mode, fall back to simply checking (with fix to bug noted by Lai Jiangshan). Also, take advantage of last grace-period forcing, the opportunity to do so noted by Steve Rostedt. And apply simplified #ifdef condition suggested by Frederic Weisbecker. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-15-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:55 +01:00
Paul E. McKenney	497f0ab39c	sched: Better name for for_each_domain_rd As suggested by Peter Ziljstra, make better choice of name for for_each_domain_rd(), containing "rcu_dereference", given that it is but a wrapper for rcu_dereference_check(). The name rcu_dereference_check_sched_domain() does that and provides a separate per-subsystem name space. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-7-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:47 +01:00
Paul E. McKenney	d11c563dd2	sched: Use lockdep-based checking on rcu_dereference() Update the rcu_dereference() usages to take advantage of the new lockdep-based checking. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com> [ -v2: fix allmodconfig missing symbol export build failure on x86 ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 10:34:26 +01:00
Paul E. McKenney	0632eb3d75	rcu: Integrate rcu_dereference_check() message into lockdep Make rcu_dereference_check() print the list of held locks in addition to the stack dump to ease debugging. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-3-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 09:41:01 +01:00
Paul E. McKenney	632ee20013	rcu: Introduce lockdep-based checking to RCU read-side primitives Inspection is proving insufficient to catch all RCU misuses, which is understandable given that rcu_dereference() might be protected by any of four different flavors of RCU (RCU, RCU-bh, RCU-sched, and SRCU), and might also/instead be protected by any of a number of locking primitives. It is therefore time to enlist the aid of lockdep. This set of patches is inspired by earlier work by Peter Zijlstra and Thomas Gleixner, and takes the following approach: o Set up separate lockdep classes for RCU, RCU-bh, and RCU-sched. o Set up separate lockdep classes for each instance of SRCU. o Create primitives that check for being in an RCU read-side critical section. These return exact answers if lockdep is fully enabled, but if unsure, report being in an RCU read-side critical section. (We want to avoid false positives!) The primitives are: For RCU: rcu_read_lock_held(void) For RCU-bh: rcu_read_lock_bh_held(void) For RCU-sched: rcu_read_lock_sched_held(void) For SRCU: srcu_read_lock_held(struct srcu_struct *sp) o Add rcu_dereference_check(), which takes a second argument in which one places a boolean expression based on the above primitives and/or lockdep_is_held(). o A new kernel configuration parameter, CONFIG_PROVE_RCU, enables rcu_dereference_check(). This depends on CONFIG_PROVE_LOCKING, and should be quite helpful during the transition period while CONFIG_PROVE_RCU-unaware patches are in flight. The existing rcu_dereference() primitive does no checking, but upcoming patches will change that. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 09:40:59 +01:00
Ingo Molnar	996de8c6fe	Merge commit 'v2.6.33' into core/rcu Merge reason: Update from -rc4 to -final. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-25 09:40:26 +01:00
Suresh Siddha	c6a0dd7ec6	ptrace: Fix ptrace_regset() comments and diagnose errors specifically Return -EINVAL for the bad size and for unrecognized NT_* type in ptrace_regset() instead of -EIO. Also update the comments for this ptrace interface with more clarifications. Requested-by: Roland McGrath <roland@redhat.com> Requested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <20100222225240.397523600@sbs-t61.sc.intel.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-23 13:45:26 -08:00
Tetsuo Handa	701188374b	kernel/sys.c: fix missing rcu protection for sys_getpriority() find_task_by_vpid() is not safe without rcu_read_lock(). 2.6.33-rc7 got RCU protection for sys_setpriority() but missed it for sys_getpriority(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-22 19:50:34 -08:00
Rafael J. Wysocki	6cbf82148f	PCI PM: Run-time callbacks for PCI bus type Introduce run-time PM callbacks for the PCI bus type. Make the new callbacks work in analogy with the existing system sleep PM callbacks, so that the drivers already converted to struct dev_pm_ops can use their suspend and resume routines for run-time PM without modifications. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-02-22 16:21:19 -08:00
Yinghai Lu	5eeec0ec93	resource: add release_child_resources Useful for freeing a portion of the resource tree, e.g. when trying to reallocate resources more efficiently. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-02-22 16:17:00 -08:00
Dominik Brodowski	3b7a17fcda	resource/PCI: mark struct resource as const Now that we return the new resource start position, there is no need to update "struct resource" inside the align function. Therefore, mark the struct resource as const. Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-02-22 16:16:57 -08:00
Dominik Brodowski	b26b2d494b	resource/PCI: align functions now return start of resource As suggested by Linus, align functions should return the start of a resource, not void. An update of "res->start" is no longer necessary. Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2010-02-22 16:16:56 -08:00
Linus Torvalds	bee415ce42	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf probe: Init struct probe_point and set counter correctly hw-breakpoint: Keep track of dr7 local enable bits hw-breakpoints: Accept breakpoints on NULL address perf_events: Fix FORK events	2010-02-22 08:55:32 -08:00
Alexey Dobriyan	b54452b07a	const: struct nla_policy Make remaining netlink policies as const. Fixup coding style where needed. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-18 14:30:18 -08:00
Yinghai Lu	b5eb78f76d	sparseirq: Use radix_tree instead of ptrs array Use radix_tree irq_desc_tree instead of irq_desc_ptrs. -v2: according to Eric and cyrill to use radix_tree_lookup_slot and radix_tree_replace_slot Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-32-git-send-email-yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-17 17:27:20 -08:00
Yinghai Lu	99558f0bbe	sparseirq: Change irq_desc_ptrs to static Add replace_irq_desc() instead of poking at the array directly. -v2: remove unneeded boundary check in replace_irq_desc Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-31-git-send-email-yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-17 17:27:03 -08:00
Yinghai Lu	febcb0c59a	irq: Remove unnecessary bootmem code mem_init is moved early already. Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-29-git-send-email-yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-17 17:23:59 -08:00
Thomas Gleixner	b7e56edba4	Merge branch 'linus' into x86/mm x86/mm is on 32-rc4 and missing the spinlock namespace changes which are needed for further commits into this topic. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-17 18:28:05 +01:00
Heiko Carstens	f850c30c8b	tracing/kprobes: Make Kconfig dependencies generic KPROBES_EVENT actually depends on the regs and stack access API (`b1cf540f`) and not on x86. So introduce a new config option which architectures can select if they have the API implemented and switch x86. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> LKML-Reference: <20100210162517.GB6933@osiris.boeblingen.de.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-02-17 13:13:08 +01:00
Mike Frysinger	e7b8e675d9	tracing: Unify arch_syscall_addr() implementations Most implementations of arch_syscall_addr() are the same, so create a default version in common code and move the one piece that differs (the syscall table) to asm/syscall.h. New arch ports don't have to waste time copying & pasting this simple function. The s390/sparc versions need to be different, so document why. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1264498803-17278-1-git-send-email-vapier@gentoo.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-02-17 13:07:21 +01:00
Thomas Gleixner	83ab0aa0d5	sched: Don't use possibly stale sched_class setscheduler() saves task->sched_class outside of the rq->lock held region for a check after the setscheduler changes have become effective. That might result in checking a stale value. rtmutex_setprio() has the same problem, though it is protected by p->pi_lock against setscheduler(), but for correctness sake (and to avoid bad examples) it needs to be fixed as well. Retrieve task->sched_class inside of the rq->lock held region. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org	2010-02-17 11:58:18 +01:00
Don Zickus	96ca4028ac	nmi_watchdog: Properly configure for software events Paul Mackerras brought up a good point that when fallbacking to software events, I may have been lucky in my configuration. Modified the code to explicit provide a new configuration for software events. Suggested-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: gorcunov@gmail.com Cc: aris@redhat.com Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1266357745-26671-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-17 08:16:02 +01:00
Don Zickus	6081b6cd97	nmi_watchdog: support for oprofile Re-arrange the code so that when someone disables nmi_watchdog with: echo 0 > /proc/sys/kernel/nmi_watchdog it releases the hardware reservation on the PMUs. This allows the oprofile module to grab those PMUs and do its thing. Otherwise oprofile fails to load because the hardware is reserved by the perf_events subsystem. Tested using: oprofile --vm-linux --start and watched it failed when nmi_watchdog is enabled and succeed when: oprofile --deinit && echo 0 > /proc/sys/kernel/nmi_watchdog is run. Note: this has the side quirk of having the nmi_watchdog latch onto the software events instead of hardware events if oprofile has already reserved the hardware first. User beware! :-) Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: gorcunov@gmail.com Cc: aris@redhat.com Cc: eranian@google.com LKML-Reference: <1266357892-30504-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-17 08:16:02 +01:00
Yinghai Lu	580e0ad21d	core: Move early_res from arch/x86 to kernel/ This makes the range reservation feature available to other architectures. -v2: add get_max_mapped, max_pfn_mapped only defined in x86... to fix PPC compiling -v3: according to hpa, add CONFIG_HAVE_EARLY_RES -v4: fix typo about EARLY_RES in config Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4B7B5723.4070009@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-16 21:43:39 -08:00
Tejun Heo	43cf38eb5c	percpu: add __percpu sparse annotations to core kernel subsystems Add __percpu sparse annotations to core subsystems. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-mm@kvack.org Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Biederman <ebiederm@xmission.com>	2010-02-17 11:17:38 +09:00
Anton Vorontsov	5a5e0f4c70	kfifo: Don't use integer as NULL pointer This patch fixes following sparse warnings: include/linux/kfifo.h:127:25: warning: Using plain integer as NULL pointer kernel/kfifo.c:83:21: warning: Using plain integer as NULL pointer Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Acked-by: Stefani Seibold <stefani@seibold.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-02-16 15:11:08 -08:00
Anton Vorontsov	1a02d59aba	kfifo: Make kfifo_initialized work after kfifo_free After kfifo rework it's no longer possible to reliably know if kfifo is usable, since after kfifo_free(), kfifo_initialized() would still return true. The correct behaviour is needed for at least FHCI USB driver. This patch fixes the issue by resetting the kfifo to zero values (the same approach is used in kfifo_alloc() if allocation failed). Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Acked-by: Stefani Seibold <stefani@seibold.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-02-16 15:11:06 -08:00
Thomas Gleixner	6e40f5bbbc	Merge branch 'sched/urgent' into sched/core Conflicts: kernel/sched.c Necessary due to the urgent fixes which conflict with the code move from sched.c to sched_fair.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-16 16:48:56 +01:00
Peter Zijlstra	0970d2992d	sched: Fix race between ttwu() and task_rq_lock() Thomas found that due to ttwu() changing a task's cpu without holding the rq->lock, task_rq_lock() might end up locking the wrong rq. Avoid this by serializing against TASK_WAKING. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1266241712.15770.420.camel@laptop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-16 15:13:59 +01:00
Suresh Siddha	9000f05c6d	sched: Fix SMT scheduler regression in find_busiest_queue() Fix a SMT scheduler performance regression that is leading to a scenario where SMT threads in one core are completely idle while both the SMT threads in another core (on the same socket) are busy. This is caused by this commit (with the problematic code highlighted) commit `bdb94aa5db` Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Tue Sep 1 10:34:38 2009 +0200 sched: Try to deal with low capacity @@ -4203,15 +4223,18 @@ find_busiest_queue() ... for_each_cpu(i, sched_group_cpus(group)) { + unsigned long power = power_of(i); ... - wl = weighted_cpuload(i); + wl = weighted_cpuload(i) * SCHED_LOAD_SCALE; + wl /= power; - if (rq->nr_running == 1 && wl > imbalance) + if (capacity && rq->nr_running == 1 && wl > imbalance) continue; On a SMT system, power of the HT logical cpu will be 589 and the scheduler load imbalance (for scenarios like the one mentioned above) can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling the weighted load with the power will result in "wl > imbalance" and ultimately resulting in find_busiest_queue() return NULL, causing load_balance() to think that the load is well balanced. But infact one of the tasks can be moved to the idle core for optimal performance. We don't need to use the weighted load (wl) scaled by the cpu power to compare with imabalance. In that condition, we already know there is only a single task "rq->nr_running == 1" and the comparison between imbalance, wl is to make sure that we select the correct priority thread which matches imbalance. So we really need to compare the imabalnce with the original weighted load of the cpu and not the scaled load. But in other conditions where we want the most hammered(busiest) cpu, we can use scaled load to ensure that we consider the cpu power in addition to the actual load on that cpu, so that we can move the load away from the guy that is getting most hammered with respect to the actual capacity, as compared with the rest of the cpu's in that busiest group. Fix it. Reported-by: Ma Ling <ling.ma@intel.com> Initial-Analysis-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1266023662.2808.118.camel@sbs-t61.sc.intel.com> Cc: stable@kernel.org [2.6.32.x] Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-16 15:13:59 +01:00
Linus Torvalds	7d0bab9dfe	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer, softirq: Fix hrtimer->softirq trampoline	2010-02-15 19:52:12 -08:00
Linus Torvalds	627a9a194d	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing/kprobes: Fix probe parsing tracing: Fix circular dead lock in stack trace	2010-02-15 19:47:59 -08:00
Linus Torvalds	3d8b4bdef7	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf top: Fix help text alignment perf: Fix hypervisor sample reporting perf: Make bp_len type to u64 generic across the arch	2010-02-15 19:47:48 -08:00
Uwe Kleine-König	dfff0615d2	tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-15 15:38:10 +01:00
Masami Hiramatsu	8b833c506c	kprobes: Add mcount to the kprobes blacklist Since mcount function can be called from everywhere, it should be blacklisted. Moreover, the "mcount" symbol is a special symbol name. So, it is better to put it in the generic blacklist. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100205062433.3745.36726.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-15 05:45:49 +01:00
Peter Zijlstra	6f93d0a7c8	perf_events: Fix FORK events Commit `22e19085` ("Honour event state for aux stream data") introduced a bug where we would drop FORK events. The thing is that we deliver FORK events to the child process' event, which at that time will be PERF_EVENT_STATE_INACTIVE because the child won't be scheduled in (we're in the middle of fork). Solve this twice, change the event state filter to exclude only disabled (STATE_OFF) or worse, and deliver FORK events to the current (parent). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Anton Blanchard <anton@samba.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> LKML-Reference: <1266142324.5273.411.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-14 18:10:39 +01:00
Heiko Carstens	a9bb18f36c	tracing/kprobes: Fix probe parsing Trying to add a probe like: echo p:myprobe 0x10000 > /sys/kernel/debug/tracing/kprobe_events will fail since the wrong pointer is passed to strict_strtoul when trying to convert the address to an unsigned long. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100210162346.GA6933@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-14 09:43:58 +01:00
Don Zickus	cf454aecb3	nmi_watchdog: Fallback to software events when no hardware pmu detected Not all arches have a PMU or have perf_event support for their PMU. The nmi_watchdog will fail in those cases. Fallback to using software events to generate nmi_watchdog traffic with local apic interrupts. Tested on a Pentium4 and it worked as expected, excepting for detecting cpu lockups. The problem with using software events as a cpu lock up detector is the nmi_watchdog uses the logic that if local apic interrupts stop incrementing then the cpu is probably locked up. But with software events we use the local apic to trigger the nmi_watchdog callback to see if local apic interrupts are still firing, which obviously they are otherwise we wouldn't have been triggered. The algorithm to detect cpu lock ups is the same as the old nmi_watchdog. Perhaps we need to find a better way to detect lock ups? Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: peterz@infradead.org Cc: gorcunov@gmail.com Cc: aris@redhat.com LKML-Reference: <1266013161-31197-3-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-14 09:19:44 +01:00
Don Zickus	504d7cf10e	nmi_watchdog: Compile and portability fixes The original patch was x86_64 centric. Changed the code to make it less so. ested by building and running on a powerpc. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: peterz@infradead.org Cc: gorcunov@gmail.com Cc: aris@redhat.com LKML-Reference: <1266013161-31197-2-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-14 09:19:43 +01:00
Suresh Siddha	2225a122ae	ptrace: Add support for generic PTRACE_GETREGSET/PTRACE_SETREGSET Generic support for PTRACE_GETREGSET/PTRACE_SETREGSET commands which export the regsets supported by each architecture using the correponding NT_* types. These NT_* types are already part of the userland ABI, used in representing the architecture specific register sets as different NOTES in an ELF core file. 'addr' parameter for the ptrace system call encode the REGSET type (using the corresppnding NT_* type) and the 'data' parameter points to the struct iovec having the user buffer and the length of that buffer. struct iovec iov = { buf, len}; ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, &iov); On successful completion, iov.len will be updated by the kernel specifying how much the kernel has written/read to/from the user's iov.buf. x86 extended state registers are primarily exported using this interface. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <20100211195614.886724710@sbs-t61.sc.intel.com> Acked-by: Hongjiu Lu <hjl.tools@gmail.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-11 15:08:33 -08:00
Li Zefan	c7c6b1fe9f	ftrace: Allow to remove a single function from function graph filter I don't see why we can only clear all functions from the filter. After patching: # echo sys_open > set_graph_function # echo sys_close >> set_graph_function # cat set_graph_function sys_open sys_close # echo '!sys_close' >> set_graph_function # cat set_graph_function sys_open Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B726388.2000408@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-11 14:32:38 -05:00
Jean Delvare	5c42dc7070	devres/irq: Fix devm_irq_match comment Fix the reference (in comment). Signed-off-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-11 16:01:02 +01:00
Yinghai Lu	e9a0064ad0	x86: Change range end to start+size So make interface more consistent with early_res. Later we can share some code with early_res. Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-10-git-send-email-yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-10 17:47:17 -08:00
Yinghai Lu	27811d8cab	x86: Move range related operation to one file We have almost the same code for mtrr cleanup and amd_bus checkup, and this code will also be used in replacing bootmem with early_res, so try to move them together and reuse it from different parts. Also rename update_range to subtract_range as that is what the function is actually doing. -v2: update comments as Christoph requested Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-4-git-send-email-yinghai@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-10 17:47:17 -08:00
H. Peter Anvin	84abd88a70	Merge remote branch 'linus/master' into x86/bootmem	2010-02-10 16:55:28 -08:00
Brandon Phiilps	ced5b697a7	x86: Avoid race condition in pci_enable_msix() Keep chip_data in create_irq_nr and destroy_irq. When two drivers are setting up MSI-X at the same time via pci_enable_msix() there is a race. See this dmesg excerpt: [ 85.170610] ixgbe 0000:02:00.1: irq 97 for MSI/MSI-X [ 85.170611] alloc irq_desc for 99 on node -1 [ 85.170613] igb 0000:08:00.1: irq 98 for MSI/MSI-X [ 85.170614] alloc kstat_irqs on node -1 [ 85.170616] alloc irq_2_iommu on node -1 [ 85.170617] alloc irq_desc for 100 on node -1 [ 85.170619] alloc kstat_irqs on node -1 [ 85.170621] alloc irq_2_iommu on node -1 [ 85.170625] ixgbe 0000:02:00.1: irq 99 for MSI/MSI-X [ 85.170626] alloc irq_desc for 101 on node -1 [ 85.170628] igb 0000:08:00.1: irq 100 for MSI/MSI-X [ 85.170630] alloc kstat_irqs on node -1 [ 85.170631] alloc irq_2_iommu on node -1 [ 85.170635] alloc irq_desc for 102 on node -1 [ 85.170636] alloc kstat_irqs on node -1 [ 85.170639] alloc irq_2_iommu on node -1 [ 85.170646] BUG: unable to handle kernel NULL pointer dereference at 0000000000000088 As you can see igb and ixgbe are both alternating on create_irq_nr() via pci_enable_msix() in their probe function. ixgbe: While looping through irq_desc_ptrs[] via create_irq_nr() ixgbe choses irq_desc_ptrs[102] and exits the loop, drops vector_lock and calls dynamic_irq_init. Then it sets irq_desc_ptrs[102]->chip_data = NULL via dynamic_irq_init(). igb: Grabs the vector_lock now and starts looping over irq_desc_ptrs[] via create_irq_nr(). It gets to irq_desc_ptrs[102] and does this: cfg_new = irq_desc_ptrs[102]->chip_data; if (cfg_new->vector != 0) continue; This hits the NULL deref. Another possible race exists via pci_disable_msix() in a driver or in the number of error paths that call free_msi_irqs(): destroy_irq() dynamic_irq_cleanup() which sets desc->chip_data = NULL ...race window... desc->chip_data = cfg; Remove the save and restore code for cfg in create_irq_nr() and destroy_irq() and take the desc->lock when checking the irq_cfg. Reported-and-analyzed-by: Brandon Philips <bphilips@suse.de> Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <1265793639-15071-3-git-send-email-yinghai@kernel.org> Signed-off-by: Brandon Phililps <bphilips@suse.de> Cc: stable@kernel.org Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-10 14:27:28 -08:00
Steven Rostedt	ede55c9d78	tracing: Add correct/incorrect to sort keys for branch annotation output The branch annotation is a bit difficult to see the worst offenders because it only sorts by percentage: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 0 163 100 qdisc_restart sch_generic.c 179 0 163 100 pfifo_fast_dequeue sch_generic.c 447 0 4 100 pskb_trim_rcsum skbuff.h 1689 0 4 100 llc_rcv llc_input.c 170 0 18 100 psmouse_interrupt psmouse-base.c 304 0 3 100 atkbd_interrupt atkbd.c 389 0 5 100 usb_alloc_dev usb.c 437 0 11 100 vsscanf vsprintf.c 1897 0 2 100 IS_ERR err.h 34 0 23 100 __rmqueue_fallback page_alloc.c 865 0 4 100 probe_wakeup_sched_switch trace_sched_wakeup.c 142 0 3 100 move_masked_irq migration.c 11 Adding the incorrect and correct values as sort keys makes this file a bit more informative: correct incorrect % Function File Line ------- --------- - -------- ---- ---- 0 366541 100 audit_syscall_entry auditsc.c 1637 0 366538 100 audit_syscall_exit auditsc.c 1685 0 115839 100 sched_info_switch sched_stats.h 269 0 74567 100 sched_info_queued sched_stats.h 222 0 66578 100 sched_info_dequeued sched_stats.h 177 0 15113 100 trace_workqueue_insertion workqueue.h 38 0 15107 100 trace_workqueue_execution workqueue.h 45 0 3622 100 syscall_trace_leave ptrace.c 1772 0 2750 100 sched_move_task sched.c 10100 0 2750 100 sched_move_task sched.c 10110 0 1815 100 pre_schedule_rt sched_rt.c 1462 0 837 100 audit_alloc auditsc.c 879 0 814 100 tcp_mss_split_point tcp_output.c 1302 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-09 21:35:05 -05:00
Jason Wang	c93d89f3db	Export the symbol of getboottime and mmonotonic_to_bootbased Export getboottime and monotonic_to_bootbased in order to let them could be used by following patch. Cc: stable@kernel.org Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2010-02-09 19:20:15 +02:00
Anton Blanchard	301ba0457f	kthread, sched: Remove reference to kthread_create_on_cpu kthread_create_on_cpu doesn't exist so update a comment in kthread.c to reflect this. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100209040740.GB3702@kryten> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-09 11:47:39 +01:00
Anton Blanchard	b80109e256	Remove reference to kthread_create_on_cpu kthread_create_on_cpu doesn't exist so update a comment in kthread.c to reflect this. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-09 11:28:48 +01:00
Al Viro	cccc6bba3f	Lose the first argument of audit_inode_child() it's always equal to ->d_name.name of the second argument Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-02-08 14:38:36 -05:00
Anton Blanchard	fa535a77bd	sched: cpuacct: Use bigger percpu counter batch values for stats counters When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can call cpuacct_update_stats with values much larger than percpu_counter_batch. This means the call to percpu_counter_add will always add to the global count which is protected by a spinlock and we end up with a global spinlock in the scheduler. Based on an idea by KOSAKI Motohiro, this patch scales the batch value by cputime_one_jiffy such that we have the same batch limit as we would if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot but that initialisation happened too early on PowerPC (before time_init) and it was never updated at runtime as a result of a hotplug cpu add/remove. This patch instead scales percpu_counter_batch by cputime_one_jiffy at runtime, which keeps the batch correct even after cpu hotplug operations. We cap it at INT_MAX in case of overflow. For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING, cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at least on x86 and PowerPC. So there is no need to add an #ifdef. On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel: CONFIG_CGROUP_CPUACCT disabled: 16906698 ctx switches/sec CONFIG_CGROUP_CPUACCT enabled: 61720 ctx switches/sec CONFIG_CGROUP_CPUACCT + patch: 16663217 ctx switches/sec Tested with: wget http://ozlabs.org/~anton/junkcode/context_switch.c make context_switch for i in `seq 0 63`; do taskset -c $i ./context_switch & done vmstat 1 Signed-off-by: Anton Blanchard <anton@samba.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-08 08:57:37 +01:00
Ingo Molnar	6d3e0907b8	Merge branch 'sched/urgent' into sched/core Merge reason: Merge dependent fix, update to latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-08 08:55:46 +01:00
Andrew Morton	50200df462	kernel/sched.c: Suppress unused var warning On UP: kernel/sched.c: In function 'wake_up_new_task': kernel/sched.c:2631: warning: unused variable 'cpu' Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-08 08:53:19 +01:00
Don Zickus	84e478c6f1	nmi_watchdog: Config option to enable new nmi_watchdog These are the bits that enable the new nmi_watchdog and safely isolate the old nmi_watchdog. Only one or the other can run, not both at the same time. Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: gorcunov@gmail.com Cc: aris@redhat.com Cc: peterz@infradead.org LKML-Reference: <1265424425-31562-4-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-08 08:29:03 +01:00
Don Zickus	1fb9d6ad27	nmi_watchdog: Add new, generic implementation, using perf events This is a new generic nmi_watchdog implementation using the perf events infrastructure as suggested by Ingo. The implementation is simple, just create an in-kernel perf event and register an overflow handler to check for cpu lockups. I created a generic implementation that lives in kernel/ and the hardware specific part that for now lives in arch/x86. This approach has a number of advantages: - It simplifies the x86 PMU implementation in the long run, in that it removes the hardcoded low-level PMU implementation that was the NMI watchdog before. - It allows new NMI watchdog features to be added in a central place. - It allows other architectures to enable the NMI watchdog, as long as they have perf events (that provide NMIs) implemented. - It also allows for more graceful co-existence of existing perf events apps and the NMI watchdog - before these changes the relationship was exclusive. (The NMI watchdog will 'spend' a perf event when enabled. In later iterations we might be able to piggyback from an existing NMI event without having to allocate a hardware event for the NMI watchdog - turning this into a no-hardware-cost feature.) As for compatibility, we'll keep the old NMI watchdog code as well until the new one can 100% replace it on all CPUs, old and new alike. That might take some time as the NMI watchdog has been ported to many CPU models. I have done light testing to make sure the framework works correctly and it does. v2: Set the correct timeout values based on the old nmi watchdog Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: gorcunov@gmail.com Cc: aris@redhat.com Cc: peterz@infradead.org LKML-Reference: <1265424425-31562-3-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-08 08:29:02 +01:00
H Hartley Sweeten	6622e670b2	posix-timers.c: Don't export local functions Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-05 14:54:10 +01:00
Magnus Damm	c54a42b19f	clocksource: add suspend callback Add a clocksource suspend callback. This callback can be used by the clocksource driver to shutdown and perform any kind of late suspend activities even though the clocksource driver itself is a non-sysdev driver. One example where this is useful is to fix the sh_cmt.c platform driver that today suspends using the platform bus and shuts down the clocksource too early. With this callback in place the sh_cmt driver will suspend using the clocksource and clockevent hooks and leave the platform device pm callbacks unused. Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-05 14:54:10 +01:00
Magnus Damm	17622339af	clocksource: add argument to resume callback Pass the clocksource as an argument to the clocksource resume callback. Needed so we can point out which CMT channel the sh_cmt.c driver shall resume. Signed-off-by: Magnus Damm <damm@opensource.se> Cc: john stultz <johnstul@us.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-05 14:54:10 +01:00
Daniel Mack	1537a3638c	tree-wide: fix 'lenght' typo in comments and code Some misspelled occurences of 'octet' and some comments were also fixed as I was on it. Signed-off-by: Daniel Mack <daniel@caiaq.de> Cc: Jiri Kosina <trivial@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Junio C Hamano <gitster@pobox.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-05 12:22:45 +01:00
Baruch Siach	9ce8e498ee	devres: typo fix s/dev/devm/ Signed-off-by: Baruch Siach <baruch@tkos.co.il> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-05 12:22:43 +01:00
Edward Z. Yang	350f82586b	Remove redundant trailing semicolons from macros Signed-off-by: Edward Z. Yang <ezyang@ksplice.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-05 12:22:43 +01:00
Thadeu Lima de Souza Cascardo	af66585270	fix comment typo boo -> boot in ksysfs.c Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-05 12:22:37 +01:00
Adam Buchbinder	c9404c9c39	Fix misspelling of "should" and "shouldn't" in comments. Some comments misspell "should" or "shouldn't"; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-05 12:22:30 +01:00
Masami Hiramatsu	5ecaafdbf4	kprobes: Add mcount to the kprobes blacklist Since mcount function can be called from everywhere, it should be blacklisted. Moreover, the "mcount" symbol is a special symbol name. So, it is better to put it in the generic blacklist. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100205062433.3745.36726.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-05 08:13:57 +01:00
Linus Torvalds	aa16cd8d12	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Handle futex value corruption gracefully futex: Handle user space corruption gracefully futex_lock_pi() key refcnt fix softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume	2010-02-04 16:07:41 -08:00
Adam Buchbinder	2a61aa4016	Fix misspellings of "invocation" in comments. Some comments misspell "invocation"; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-04 11:55:45 +01:00
Adam Buchbinder	c41b20e721	Fix misspellings of "truly" in comments. Some comments misspell "truly"; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-04 11:55:45 +01:00
Peter Zijlstra	9717e6cd3d	perf_events: Optimize perf_event_task_tick() Pretty much all of the calls do perf_disable/perf_enable cycles, pull that out to cut back on hardware programming. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:59:49 +01:00
Yong Zhang	2357725695	sched: Remove member rt_se from struct rt_rq It's a duplicate of tg->rt_se[cpu] and the only usage is sched_rt_rq_dequeue() and sched_rt_rq_enqueue(). After the first patch to those two function. rt_se can be removed. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <2674af741001282258q38781619u653ca4a7dd267347@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:57:33 +01:00
Yong Zhang	74b7eb5885	sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu] This is the first step to remove rt_rq member rt_se because it have the same meaning with tg->rt_se[cpu]. And the latter style is also used by the fair scheduling class. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <2674af741001282257r28c97a92o9f90cf16fe8d3d84@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:57:32 +01:00
Masami Hiramatsu	f24bb999d2	ftrace: Remove record freezing Remove record freezing. Because kprobes never puts probe on ftrace's mcount call anymore, it doesn't need ftrace to check whether kprobes on it. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: przemyslaw@pawelczyk.it Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100202214925.4694.73469.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:36:19 +01:00
Masami Hiramatsu	4554dbcb85	kprobes: Check probe address is reserved Check whether the address of new probe is already reserved by ftrace or alternatives (on x86) when registering new probe. If reserved, it returns an error and not register the probe. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: przemyslaw@pawelczyk.it Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <20100202214918.4694.94179.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:36:19 +01:00
Masami Hiramatsu	2cfa19780d	ftrace/alternatives: Introducing _text_reserved functions Introducing _text_reserved functions for checking the text address range is partially reserved or not. This patch provides checking routines for x86 smp alternatives and dynamic ftrace. Since both functions modify fixed pieces of kernel text, they should reserve and protect those from other dynamic text modifier, like kprobes. This will also be extended when introducing other subsystems which modify fixed pieces of kernel text. Dynamic text modifiers should avoid those. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: przemyslaw@pawelczyk.it Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <20100202214911.4694.16587.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:36:19 +01:00
Masami Hiramatsu	615d0ebbc7	kprobes: Disable booster when CONFIG_PREEMPT=y Disable kprobe booster when CONFIG_PREEMPT=y at this time, because it can't ensure that all kernel threads preempted on kprobe's boosted slot run out from the slot even using freeze_processes(). The booster on preemptive kernel will be resumed if synchronize_tasks() or something like that is introduced. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100202214904.4694.24330.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-04 09:36:18 +01:00
Kees Cook	d78ca3cd73	syslog: use defined constants instead of raw numbers Right now the syslog "type" action are just raw numbers which makes the source difficult to follow. This patch replaces the raw numbers with defined constants for some level of sanity. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: John Johansen <john.johansen@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-02-04 14:20:41 +11:00
Kees Cook	002345925e	syslog: distinguish between /proc/kmsg and syscalls This allows the LSM to distinguish between syslog functions originating from /proc/kmsg access and direct syscalls. By default, the commoncaps will now no longer require CAP_SYS_ADMIN to read an opened /proc/kmsg file descriptor. For example the kernel syslog reader can now drop privileges after opening /proc/kmsg, instead of staying privileged with CAP_SYS_ADMIN. MAC systems that implement security_syslog have unchanged behavior. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: John Johansen <john.johansen@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>	2010-02-04 14:20:12 +11:00
Mahesh Salgaonkar	cd757645fb	perf: Make bp_len type to u64 generic across the arch Change 'bp_len' type to __u64 to make it work across archs as the s390 architecture watch point length can be upto 2^64. reference: http://lkml.org/lkml/2010/1/25/212 This is an ABI change that is not backward compatible with the previous hardware breakpoint info layout integrated in this development cycle, a rebuilt of perf tools is necessary for versions based on 2.6.33-rc1 - 2.6.33-rc6 to work with a kernel based on this patch. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "K. Prasad" <prasad@linux.vnet.ibm.com> Cc: Maneesh Soni <maneesh@in.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin <schwidefsky@de.ibm.com> LKML-Reference: <20100130045518.GA20776@in.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-02-04 01:07:12 +01:00
Peter Zijlstra	b9c3032277	hrtimer, softirq: Fix hrtimer->softirq trampoline hrtimers callbacks are always done from hardirq context, either the jiffy tick interrupt or the hrtimer device interrupt. [ there is currently one exception that can still call a hrtimer callback from softirq, but even in that case this will still work correctly. ] Reported-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Yury Polyanskiy <ypolyans@princeton.edu> Tested-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: David S. Miller <davem@davemloft.net> LKML-Reference: <1265120401.24455.306.camel@laptop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-02-03 18:17:40 +01:00
Thomas Gleixner	59647b6ac3	futex: Handle futex value corruption gracefully The WARN_ON in lookup_pi_state which complains about a mismatch between pi_state->owner->pid and the pid which we retrieved from the user space futex is completely bogus. The code just emits the warning and then continues despite the fact that it detected an inconsistent state of the futex. A conveniant way for user space to spam the syslog. Replace the WARN_ON by a consistency check. If the values do not match return -EINVAL and let user space deal with the mess it created. This also fixes the missing task_pid_vnr() when we compare the pi_state->owner pid with the futex value. Reported-by: Jermome Marchand <jmarchan@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org>	2010-02-03 15:13:22 +01:00
Thomas Gleixner	51246bfd18	futex: Handle user space corruption gracefully If the owner of a PI futex dies we fix up the pi_state and set pi_state->owner to NULL. When a malicious or just sloppy programmed user space application sets the futex value to 0 e.g. by calling pthread_mutex_init(), then the futex can be acquired again. A new waiter manages to enqueue itself on the pi_state w/o damage, but on unlock the kernel dereferences pi_state->owner and oopses. Prevent this by checking pi_state->owner in the unlock path. If pi_state->owner is not current we know that user space manipulated the futex value. Ignore the mess and return -EINVAL. This catches the above case and also the case where a task hijacks the futex by setting the tid value and then tries to unlock it. Reported-by: Jermome Marchand <jmarchan@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org>	2010-02-03 15:13:22 +01:00
Mikael Pettersson	5ecb01cfdf	futex_lock_pi() key refcnt fix This fixes a futex key reference count bug in futex_lock_pi(), where a key's reference count is incremented twice but decremented only once, causing the backing object to not be released. If the futex is created in a temporary file in an ext3 file system, this bug causes the file's inode to become an "undead" orphan, which causes an oops from a BUG_ON() in ext3_put_super() when the file system is unmounted. glibc's test suite is known to trigger this, see <http://bugzilla.kernel.org/show_bug.cgi?id=14256>. The bug is a regression from 2.6.28-git3, namely Peter Zijlstra's `38d47c1b70` "[PATCH] futex: rely on get_user_pages() for shared futexes". That commit made get_futex_key() also increment the reference count of the futex key, and updated its callers to decrement the key's reference count before returning. Unfortunately the normal exit path in futex_lock_pi() wasn't corrected: the reference count is incremented by get_futex_key() and queue_lock(), but the normal exit path only decrements once, via unqueue_me_pi(). The fix is to put_futex_key() after unqueue_me_pi(), since 2.6.31 this is easily done by 'goto out_put_key' rather than 'goto out'. Signed-off-by: Mikael Pettersson <mikpe@it.uu.se> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org>	2010-02-03 15:13:22 +01:00
Linus Torvalds	c80d292f13	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: kernel/cred.c: use kmem_cache_free	2010-02-02 18:12:22 -08:00
Li Zefan	4528fd0595	cgroups: fix to return errno in a failure path In cgroup_create(), if alloc_css_id() returns failure, the errno is not propagated to userspace, so mkdir will fail silently. To trigger this bug, we mount blkio (or memory subsystem), and create more then 65534 cgroups. (The number of cgroups is limited to 65535 if a subsystem has use_id == 1) # mount -t cgroup -o blkio xxx /mnt # for ((i = 0; i < 65534; i++)); do mkdir /mnt/$i; done # mkdir /mnt/65534 (should return ENOSPC) # Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Paul Menage <menage@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-02 18:11:22 -08:00
Randy Dunlap	bc173f7092	kfifo: fix kernel-doc notation Fix kfifo kernel-doc warnings: Warning(kernel/kfifo.c:361): No description found for parameter 'total' Warning(kernel/kfifo.c:402): bad line: @ @lenout: pointer to output variable with copied data Warning(kernel/kfifo.c:412): No description found for parameter 'lenout' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Stefani Seibold <stefani@seibold.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-02-02 18:11:21 -08:00
Julia Lawall	b8a1d37c5f	kernel/cred.c: use kmem_cache_free Free memory allocated using kmem_cache_zalloc using kmem_cache_free rather than kfree. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x,E,c; @@ x = $kmem_cache_alloc\\|kmem_cache_zalloc\\|kmem_cache_alloc_node$(c,...) ... when != x = E when != &x ?-kfree(x) +kmem_cache_free(c,x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Steve Dickson <steved@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org>	2010-02-03 10:21:57 +11:00
Lai Jiangshan	4f48f8b7fd	tracing: Fix circular dead lock in stack trace When we cat <debugfs>/tracing/stack_trace, we may cause circular lock: sys_read() t_start() arch_spin_lock(&max_stack_lock); t_show() seq_printf(), vsnprintf() .... /* they are all trace-able, when they are traced, max_stack_lock may be required again. */ The following script can trigger this circular dead lock very easy: #!/bin/bash echo 1 > /proc/sys/kernel/stack_tracer_enabled mount -t debugfs xxx /mnt > /dev/null 2>&1 ( # make check_stack() zealous to require max_stack_lock for ((; ;)) { echo 1 > /mnt/tracing/stack_max_size } ) & for ((; ;)) { cat /mnt/tracing/stack_trace > /dev/null } To fix this bug, we increase the percpu trace_active before require the lock. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B67D4F9.9080905@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-02-02 10:20:18 -05:00
Peter Zijlstra	4a461c85b6	sched: Remove unused update_shares_locked() Commit `f492e12ef0` ("sched: Remove load_balance_newidle()") removed the only user of this function, so remove it too. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1265019219.24455.128.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-02 06:58:27 +01:00
Akinobu Mita	90fdbdb484	sched: Use for_each_bit No change in functionality. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <1264938810-4173-1-git-send-email-akinobu.mita@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-02 06:58:27 +01:00
Tejun Heo	ab386128f2	Merge branch 'master' into percpu	2010-02-02 14:38:15 +09:00
Andrew Morton	e527300715	Generic page_is_ram: use __weak Use __weak instead of __attribute__((weak)). Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-01 16:58:17 -08:00
Wu Fengguang	61ef2489db	resources: introduce generic page_is_ram() It's based on walk_system_ram_range(), for archs that don't have their own page_is_ram(). The static verions in MIPS and SCORE are also made global. v4: prefer plain 1 instead of PAGE_IS_RAM (H. Peter Anvin) v3: add comment (KAMEZAWA Hiroyuki) "AFAIK, this "System RAM" information has been used for kdump to grab valid memory area and seems good for the kernel itself." v2: add PAGE_IS_RAM macro (Américo Wang) Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Américo Wang <xiyou.wangcong@gmail.com> Cc: linux-mips@linux-mips.org Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> LKML-Reference: <20100122081619.GA6431@localhost> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2010-02-01 16:58:17 -08:00
Linus Torvalds	e20da89130	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Fix check_usage_backwards() error message	2010-02-01 10:45:26 -08:00
Linus Torvalds	834db333ed	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf, hw_breakpoint, kgdb: Do not take mutex for kernel debugger x86, hw_breakpoints, kgdb: Fix kgdb to use hw_breakpoint API hw_breakpoints: Release the bp slot if arch_validate_hwbkpt_settings() fails. perf: Ignore perf.data.old perf report: Fix segmentation fault when running with '-g none'	2010-02-01 10:45:00 -08:00
Linus Torvalds	8ea85c2817	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Correct printk whitespace in warning from cpu down task check sched: Fix incorrect sanity check sched: Fix fork vs hotplug vs cpuset namespaces	2010-02-01 10:44:36 -08:00
Linus Torvalds	bdd8466783	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Prevent potential kgdb dead lock	2010-02-01 10:44:06 -08:00
Jason Wessel	d6ad3e286d	softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, sched_clock() gets the time from hardware such as the TSC on x86. In this configuration kgdb will report a softlock warning message on resuming or detaching from a debug session. Sequence of events in the problem case: 1) "cpu sched clock" and "hardware time" are at 100 sec prior to a call to kgdb_handle_exception() 2) Debugger waits in kgdb_handle_exception() for 80 sec and on exit the following is called ... touch_softlockup_watchdog() --> __raw_get_cpu_var(touch_timestamp) = 0; 3) "cpu sched clock" = 100s (it was not updated, because the interrupt was disabled in kgdb) but the "hardware time" = 180 sec 4) The first timer interrupt after resuming from kgdb_handle_exception updates the watchdog from the "cpu sched clock" update_process_times() { ... run_local_timers() --> softlockup_tick() --> check (touch_timestamp == 0) (it is "YES" here, we have set "touch_timestamp = 0" at kgdb) --> __touch_softlockup_watchdog() *(A)--> reset "touch_timestamp" to "get_timestamp()" (Here, the "touch_timestamp" will still be set to 100s.) ... scheduler_tick() (B)--> sched_clock_tick() (update "cpu sched clock" to "hardware time" = 180s) ... } 5) The Second timer interrupt handler appears to have a large jump and trips the softlockup warning. update_process_times() { ... run_local_timers() --> softlockup_tick() --> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s --> printk "soft lockup error messages" ... } note: **(A) reset "touch_timestamp" to "get_timestamp(this_cpu)" Why is "touch_timestamp" 100 sec, instead of 180 sec? When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, the call trace of get_timestamp() is: get_timestamp(this_cpu) -->cpu_clock(this_cpu) -->sched_clock_cpu(this_cpu) -->__update_sched_clock(sched_clock_data, now) The __update_sched_clock() function uses the GTOD tick value to create a window to normalize the "now" values. So if "now" value is too big for sched_clock_data, it will be ignored. The fix is to invoke sched_clock_tick() to update "cpu sched clock" in order to recover from this state. This is done by introducing the function touch_softlockup_watchdog_sync(). This allows kgdb to request that the sched clock is updated when the watchdog thread runs the first time after a resume from kgdb. [yong.zhang0@gmail.com: Use per cpu instead of an array] Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: peterz@infradead.org LKML-Reference: <1264631124-4837-2-git-send-email-jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-02-01 08:22:32 +01:00
Jason Wessel	5352ae638e	perf, hw_breakpoint, kgdb: Do not take mutex for kernel debugger This patch fixes the regression in functionality where the kernel debugger and the perf API do not nicely share hw breakpoint reservations. The kernel debugger cannot use any mutex_lock() calls because it can start the kernel running from an invalid context. A mutex free version of the reservation API needed to get created for the kernel debugger to safely update hw breakpoint reservations. The possibility for a breakpoint reservation to be concurrently processed at the time that kgdb interrupts the system is improbable. Should this corner case occur the end user is warned, and the kernel debugger will prohibit updating the hardware breakpoint reservations. Any time the kernel debugger reserves a hardware breakpoint it will be a system wide reservation. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: torvalds@linux-foundation.org LKML-Reference: <1264719883-7285-3-git-send-email-jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-30 08:42:21 +01:00
Jason Wessel	cc0967490c	x86, hw_breakpoints, kgdb: Fix kgdb to use hw_breakpoint API In the 2.6.33 kernel, the hw_breakpoint API is now used for the performance event counters. The hw_breakpoint_handler() now consumes the hw breakpoints that were previously set by kgdb arch specific code. In order for kgdb to work in conjunction with this core API change, kgdb must use some of the low level functions of the hw_breakpoint API to install, uninstall, and deal with hw breakpoint reservations. The kgdb core required a change to call kgdb_disable_hw_debug anytime a slave cpu enters kgdb_wait() in order to keep all the hw breakpoints in sync as well as to prevent hitting a hw breakpoint while kgdb is active. During the architecture specific initialization of kgdb, it will pre-allocate 4 disabled (struct perf event **) structures. Kgdb will use these to manage the capabilities for the 4 hw breakpoint registers, per cpu. Right now the hw_breakpoint API does not have a way to ask how many breakpoints are available, on each CPU so it is possible that the install of a breakpoint might fail when kgdb restores the system to the run state. The intent of this patch is to first get the basic functionality of hw breakpoints working and leave it to the person debugging the kernel to understand what hw breakpoints are in use and what restrictions have been imposed as a result. Breakpoint constraints will be dealt with in a future patch. While atomic, the x86 specific kgdb code will call arch_uninstall_hw_breakpoint() and arch_install_hw_breakpoint() to manage the cpu specific hw breakpoints. The net result of these changes allow kgdb to use the same pool of hw_breakpoints that are used by the perf event API, but neither knows about future reservations for the available hw breakpoint slots. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: torvalds@linux-foundation.org LKML-Reference: <1264719883-7285-2-git-send-email-jason.wessel@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-30 08:42:20 +01:00
Ingo Molnar	ae7f6711d6	Merge branch 'perf/urgent' into perf/core Merge reason: We want to queue up a dependent patch. Also update to later -rc's. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-29 10:36:22 +01:00
John Stultz	7e1b584774	ntp: Cleanup xtime references in ntp.c ntp.c doesn't need to access timekeeping internals directly, so change xtime references to use the get_seconds() timekeeping interface. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: richard@rsk.demon.co.uk LKML-Reference: <1264738844-21935-1-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-01-29 10:15:19 +01:00
john stultz	1f5b8f8a20	ntp: Make time_esterror and time_maxerror static Make time_esterror and time_maxerror static as no one uses them outside of ntp.c Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: richard@rsk.demon.co.uk LKML-Reference: <1264719761.3437.47.camel@localhost.localdomain> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-01-29 10:15:19 +01:00
Peter Zijlstra	75c9f3284a	perf_events: Fix sample_period transfer on inherit One problem with frequency driven counters is that we cannot predict the rate at which they trigger, therefore we have to start them at period=1, this causes a ramp up effect. However, if we fail to propagate the stable state on fork each new child will have to ramp up again. This can lead to significant artifacts in sample data. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: eranian@google.com Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1264752266.4283.2121.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-29 09:15:26 +01:00
Xiao Guangrong	1e12a4a7a3	tracing/kprobe: Cleanup unused return value of tracing functions The return values of the kprobe's tracing functions are meaningless, lets remove these. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <4B60E9A3.2040505@cn.fujitsu.com> [fweisbec@gmail: whitespace fixes, drop useless void returns in end of functions] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-01-29 02:14:40 +01:00
Xiao Guangrong	430ad5a600	perf: Factorize trace events raw sample buffer operations Introduce ftrace_perf_buf_prepare() and ftrace_perf_buf_submit() to gather the common code that operates on raw events sampling buffer. This cleans up redundant code between regular trace events, syscall events and kprobe events. Changelog v1->v2: - Rename function name as per Masami and Frederic's suggestion - Add __kprobes for ftrace_perf_buf_prepare() and make ftrace_perf_buf_submit() inline as per Masami's suggestion - Export ftrace_perf_buf_prepare since modules will use it Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <4B60E92D.9000808@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-01-29 02:02:57 +01:00
Lai Jiangshan	ea2c68a08f	tracing: Simplify test for function_graph tracing start point In the function graph tracer, a calling function is to be traced only when it is enabled through the set_graph_function file, or when it is nested in an enabled function. Current code uses TSK_TRACE_FL_GRAPH to test whether it is nested or not. Looking at the code, we can get this: (trace->depth > 0) <==> (TSK_TRACE_FL_GRAPH is set) trace->depth is more explicit to tell that it is nested. So we use trace->depth directly and simplify the code. No functionality is changed. TSK_TRACE_FL_GRAPH is not removed yet, it is left for future usage. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B4DB0B6.7040607@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-01-29 01:05:12 +01:00
Mahesh Salgaonkar	b23ff0e933	hw_breakpoints: Release the bp slot if arch_validate_hwbkpt_settings() fails. On a given architecture, when hardware breakpoint registration fails due to un-supported access type (read/write/execute), we lose the bp slot since register_perf_hw_breakpoint() does not release the bp slot on failure. Hence, any subsequent hardware breakpoint registration starts failing with 'no space left on device' error. This patch introduces error handling in register_perf_hw_breakpoint() function and releases bp slot on error. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: K. Prasad <prasad@linux.vnet.ibm.com> Cc: Maneesh Soni <maneesh@in.ibm.com> LKML-Reference: <20100121125516.GA32521@in.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-01-28 14:15:51 +01:00
Frans Pop	9d3cfc4c1d	sched: Correct printk whitespace in warning from cpu down task check Due to an incorrect line break the output currently contains tabs. Also remove trailing space. The actual output that logcheck sent me looked like this: Task events/1 (pid = 10) is on cpu 1^I^I^I^I(state = 1, flags = 84208040) After this patch it becomes: Task events/1 (pid = 10) is on cpu 1 (state = 1, flags = 84208040) Signed-off-by: Frans Pop <elendilplanet.nl> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <201001251456.34996.elendil@planet.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-28 06:59:55 +01:00
Peter Zijlstra	11854247e2	sched: Fix incorrect sanity check We moved to migrate on wakeup, which means that sleeping tasks could still be present on offline cpus. Amend the check to only test running tasks. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-28 06:59:51 +01:00
Peter Zijlstra	abd5071394	perf: Reimplement frequency driven sampling There was a bug in the old period code that caused intel_pmu_enable_all() or native_write_msr_safe() to show up quite high in the profiles. In staring at that code it made my head hurt, so I rewrote it in a hopefully simpler fashion. Its now fully symetric between tick and overflow driven adjustments and uses less data to boot. The only complication is that it basically wants to do a u128 division. The code approximates that in a rather simple truncate until it fits fashion, taking care to balance the terms while truncating. This version does not generate that sampling artefact. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-27 08:39:33 +01:00
Oleg Nesterov	48d5067417	lockdep: Fix check_usage_backwards() error message Lockdep has found the real bug, but the output doesn't look right to me: > ========================================================= > [ INFO: possible irq lock inversion dependency detected ] > 2.6.33-rc5 #77 > --------------------------------------------------------- > emacs/1609 just changed the state of lock: > (&(&tty->ctrl_lock)->rlock){+.....}, at: [<ffffffff8127c648>] tty_fasync+0xe8/0x190 > but this lock took another, HARDIRQ-unsafe lock in the past: > (&(&sighand->siglock)->rlock){-.....} "HARDIRQ-unsafe" and "this lock took another" looks wrong, afaics. > ... key at: [<ffffffff81c054a4>] __key.46539+0x0/0x8 > ... acquired at: > [<ffffffff81089af6>] __lock_acquire+0x1056/0x15a0 > [<ffffffff8108a0df>] lock_acquire+0x9f/0x120 > [<ffffffff81423012>] _raw_spin_lock_irqsave+0x52/0x90 > [<ffffffff8127c1be>] __proc_set_tty+0x3e/0x150 > [<ffffffff8127e01d>] tty_open+0x51d/0x5e0 The stack-trace shows that this lock (ctrl_lock) was taken under ->siglock (which is hopefully irq-safe). This is a clear typo in check_usage_backwards() where we tell the print a fancy routine we're forwards. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100126181641.GA10460@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-27 08:34:02 +01:00
Mike Frysinger	0368897034	tracing/documentation: Cover new frame pointer semantics Update the graph tracer examples to cover the new frame pointer semantics (in terms of passing it along). Move the HAVE_FUNCTION_GRAPH_FP_TEST docs out of the Kconfig, into the right place, and expand on the details. Signed-off-by: Mike Frysinger <vapier@gentoo.org> LKML-Reference: <1264165967-18938-1-git-send-email-vapier@gentoo.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-26 17:00:39 -05:00
Steven Rostedt	3c05d74827	ring-buffer: Check for end of page in iterator If the iterator comes to an empty page for some reason, or if the page is emptied by a consuming read. The iterator code currently does not check if the iterator is pass the contents, and may return a false entry. This patch adds a check to the ring buffer iterator to test if the current page has been completely read and sets the iterator to the next page if necessary. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-26 16:14:08 -05:00
Steven Rostedt	492a74f421	ring-buffer: Check if ring buffer iterator has stale data Usually reads of the ring buffer is performed by a single task. There are two types of reads from the ring buffer. One is a consuming read which will consume the entry that was read and the next read will be the entry that follows. The other is an iterator that will let the user read the contents of the ring buffer without modifying it. When an iterator is allocated, writes to the ring buffer are disabled to protect the iterator. The problem exists when consuming reads happen while an iterator is allocated. Specifically, the kind of read that swaps out an entire page (used by splice) and replaces it with a new read. If the iterator is on the page that is swapped out, then the next read may read from this swapped out page and return garbage. This patch adds a check when reading the iterator to make sure that the iterator contents are still valid. If a consuming read has taken place, the iterator is reset. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-26 16:09:30 -05:00
Thomas Gleixner	7b7422a566	clocksource: Prevent potential kgdb dead lock commit `0f8e8ef7` (clocksource: Simplify clocksource watchdog resume logic) introduced a potential kgdb dead lock. When the kernel is stopped by kgdb inside code which holds watchdog_lock then kgdb dead locks in clocksource_resume_watchdog(). clocksource_resume_watchdog() is called from kbdg via clocksource_touch_watchdog() to avoid that the clock source watchdog marks TSC unstable after the kernel has been stopped. Solve this by replacing spin_lock with a spin_trylock and just return in case the lock is held. Not resetting the watchdog might result in TSC becoming marked unstable, but that's an acceptable penalty for using kgdb. The timekeeping is anyway easily screwed up by kgdb when the system uses either jiffies or a clock source which wraps in short intervals (e.g. pm_timer wraps about every 4.6s), so we really do not have to worry about that occasional TSC marked unstable side effect. The second caller of clocksource_resume_watchdog() is clocksource_resume(). The trylock is safe here as well because the system is UP at this point, interrupts are disabled and nothing else can hold watchdog_lock(). Reported-by: Jason Wessel <jason.wessel@windriver.com> LKML-Reference: <1264480000-6997-4-git-send-email-jason.wessel@windriver.com> Cc: kgdb-bugreport@lists.sourceforge.net Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-01-26 14:53:16 +01:00
Steven Rostedt	74bf4076f2	tracing: Prevent kernel oops with corrupted buffer If the contents of the ftrace ring buffer gets corrupted and the trace file is read, it could create a kernel oops (usualy just killing the user task thread). This is caused by the checking of the pid in the buffer. If the pid is negative, it still references the cmdline cache array, which could point to an invalid address. The simple fix is to test for negative PIDs. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-25 15:11:53 -05:00
Linus Torvalds	f6760aa024	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clockevent: Don't remove broadcast device when cpu is dead	2010-01-24 10:38:07 -08:00
Linus Torvalds	b8be634e01	Merge git://git.infradead.org/~dwmw2/mtd-2.6.33 * git://git.infradead.org/~dwmw2/mtd-2.6.33: mtd: tests: fix read, speed and stress tests on NOR flash mtd: Really add ARM pismo support kmsg_dump: Dump on crash_kexec as well	2010-01-24 10:31:34 -08:00
Thomas Gleixner	60db48cacb	sched: Queue a deboosted task to the head of the RT prio queue rtmutex_set_prio() is used to implement priority inheritance for futexes. When a task is deboosted it gets enqueued at the tail of its RT priority list. This is violating the POSIX scheduling semantics: rt priority list X contains two runnable tasks A and B task A runs with priority X and holds mutex M task C preempts A and is blocked on mutex M -> task A is boosted to priority of task C (Y) task A unlocks the mutex M and deboosts itself -> A is dequeued from rt priority list Y -> A is enqueued to the tail of rt priority list X task C schedules away task B runs This is wrong as task A did not schedule away and therefor violates the POSIX scheduling semantics. Enqueue the task to the head of the priority list instead. Reported-by: Mathias Weber <mathias.weber.mw1@roche.com> Reported-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Carsten Emde <cbe@osadl.org> Tested-by: Mathias Weber <mathias.weber.mw1@roche.com> LKML-Reference: <20100120171629.809074113@linutronix.de>	2010-01-22 18:09:59 +01:00
Thomas Gleixner	37dad3fce9	sched: Implement head queueing for sched_rt The ability of enqueueing a task to the head of a SCHED_FIFO priority list is required to fix some violations of POSIX scheduling policy. Implement the functionality in sched_rt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Carsten Emde <cbe@osadl.org> Tested-by: Mathias Weber <mathias.weber.mw1@roche.com> LKML-Reference: <20100120171629.772169931@linutronix.de>	2010-01-22 18:09:59 +01:00
Thomas Gleixner	ea87bb7853	sched: Extend enqueue_task to allow head queueing The ability of enqueueing a task to the head of a SCHED_FIFO priority list is required to fix some violations of POSIX scheduling policy. Extend the related functions with a "head" argument. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Carsten Emde <cbe@osadl.org> Tested-by: Mathias Weber <mathias.weber.mw1@roche.com> LKML-Reference: <20100120171629.734886007@linutronix.de>	2010-01-22 18:09:59 +01:00
Peter Zijlstra	fabf318e5e	sched: Fix fork vs hotplug vs cpuset namespaces There are a number of issues: 1) TASK_WAKING vs cgroup_clone (cpusets) copy_process(): sched_fork() child->state = TASK_WAKING; /* waiting for wake_up_new_task() */ if (current->nsproxy != p->nsproxy) ns_cgroup_clone() cgroup_clone() mutex_lock(inode->i_mutex) mutex_lock(cgroup_mutex) cgroup_attach_task() ss->can_attach() ss->attach() [ -> cpuset_attach() ] cpuset_attach_task() set_cpus_allowed_ptr(); while (child->state == TASK_WAKING) cpu_relax(); will deadlock the system. 2) cgroup_clone (cpusets) vs copy_process So even if the above would work we still have: copy_process(): if (current->nsproxy != p->nsproxy) ns_cgroup_clone() cgroup_clone() mutex_lock(inode->i_mutex) mutex_lock(cgroup_mutex) cgroup_attach_task() ss->can_attach() ss->attach() [ -> cpuset_attach() ] cpuset_attach_task() set_cpus_allowed_ptr(); ... p->cpus_allowed = current->cpus_allowed over-writing the modified cpus_allowed. 3) fork() vs hotplug if we unplug the child's cpu after the sanity check when the child gets attached to the task_list but before wake_up_new_task() shit will meet with fan. Solve all these issues by moving fork cpu selection into wake_up_new_task(). Reported-by: Serge E. Hallyn <serue@us.ibm.com> Tested-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1264106190.4283.1314.camel@laptop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-01-21 23:25:31 +01:00
Linus Torvalds	e80b135985	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: x86: Add support for the ANY bit perf: Change the is_software_event() definition perf: Honour event state for aux stream data perf: Fix perf_event_do_pending() fallback callsite perf kmem: Print usage help for unknown commands perf kmem: Increase "Hit" column length hw-breakpoints, perf: Fix broken mmiotrace due to dr6 by reference change perf timechart: Use tid not pid for COMM change	2010-01-21 08:50:04 -08:00
Peter Zijlstra	22e190851f	perf: Honour event state for aux stream data Anton reported that perf record kept receiving events even after calling ioctl(PERF_EVENT_IOC_DISABLE). It turns out that FORK,COMM and MMAP events didn't respect the disabled state and kept flowing in. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Anton Blanchard <anton@samba.org> LKML-Reference: <1263459187.4244.265.camel@laptop> CC: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:40 +01:00
Peter Zijlstra	fe432200ab	perf: Fix perf_event_do_pending() fallback callsite Paul questioned the context in which we should call perf_event_do_pending(). After looking at that I found that it should be called from IRQ context these days, however the fallback call-site is placed in softirq context. Ammend this by placing the callback in the IRQ timer path. Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1263374859.4244.192.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:39 +01:00
Dhaval Giani	7c9414385e	sched: Remove USER_SCHED Remove the USER_SCHED feature. It has been scheduled to be removed in 2.6.34 as per http://marc.info/?l=linux-kernel&m=125728479022976&w=2 Signed-off-by: Dhaval Giani <dhaval.giani@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1263990378.24844.3.camel@localhost> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:18 +01:00
Gautham R Shenoy	871e35bc97	sched: Fix the place where group powers are updated We want to update the sched_group_powers when balance_cpu == this_cpu. Currently the group powers are updated only if the balance_cpu is the first CPU in the local group. But balance_cpu = this_cpu could also be the first idle cpu in the group. Hence fix the place where the group powers are updated. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1264017764.5717.127.camel@jschopp-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:17 +01:00
Peter Zijlstra	8f190fb3f7	sched: Assume *balance is valid Since all load_balance() callers will have !NULL balance parameters we can now assume so and remove a few checks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:15 +01:00
Peter Zijlstra	f492e12ef0	sched: Remove load_balance_newidle() The two functions: load_balance{,_newidle}() are very similar, with the following differences: - rq->lock usage - sb->balance_interval updates - *balance check So remove the load_balance_newidle() call with load_balance(.idle = CPU_NEWLY_IDLE), explicitly unlock the rq->lock before calling (would be done by double_lock_balance() anyway), and ignore the other differences for now. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:14 +01:00
Peter Zijlstra	1af3ed3ddf	sched: Unify load_balance{,_newidle}() load_balance() and load_balance_newidle() look remarkably similar, one key point they differ in is the condition on when to active balance. So split out that logic into a separate function. One side effect is that previously load_balance_newidle() used to fail and return -1 under these conditions, whereas now it doesn't. I've not yet fully figured out the whole -1 return case for either load_balance{,_newidle}(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:13 +01:00
Peter Zijlstra	baa8c1102f	sched: Add a lock break for PREEMPT=y Since load-balancing can hold rq->locks for quite a long while, allow breaking out early when there is lock contention. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:13 +01:00
Peter Zijlstra	230059de77	sched: Remove from fwd decls Move code around to get rid of fwd declarations. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:12 +01:00
Peter Zijlstra	897c395f4c	sched: Remove rq_iterator from move_one_task Again, since we only iterate the fair class, remove the abstraction. Since this is the last user of the rq_iterator, remove all that too. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:11 +01:00
Peter Zijlstra	ee00e66fff	sched: Remove rq_iterator usage from load_balance_fair Since we only ever iterate the fair class, do away with this abstraction. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:10 +01:00
Peter Zijlstra	3d45fd804a	sched: Remove the sched_class load_balance methods Take out the sched_class methods for load-balancing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:09 +01:00
Peter Zijlstra	1e3c88bdeb	sched: Move load balance code into sched_fair.c Straight fwd code movement. Since non of the load-balance abstractions are used anymore, do away with them and simplify the code some. In preparation move the code around. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:40:08 +01:00
Yong Zhang	6d558c3ac9	sched: Reassign prev and switch_count when reacquire_kernel_lock() fail Assume A->B schedule is processing, if B have acquired BKL before and it need reschedule this time. Then on B's context, it will go to need_resched_nonpreemptible for reschedule. But at this time, prev and switch_count are related to A. It's wrong and will lead to incorrect scheduler statistics. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <2674af741001102238w7b0ddcadref00d345e2181d11@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:39:04 +01:00
Mike Galbraith	50b926e439	sched: Fix vmark regression on big machines SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't enabled, leading to many cache misses on large machines as we traverse looking for an idle shared cache to wake to. Change the enabler of select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the sibling domain level. Reported-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1262612696.15495.15.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-21 13:39:03 +01:00
Xiaotian Feng	ea9d8e3f45	clockevent: Don't remove broadcast device when cpu is dead Marc reported that the BUG_ON in clockevents_notify() triggers on his system. This happens because the kernel tries to remove an active clock event device (used for broadcasting) from the device list. The handling of devices which can be used as per cpu device and as a global broadcast device is suboptimal. The simplest solution for now (and for stable) is to check whether the device is used as global broadcast device, but this needs to be revisited. [ tglx: restored the cpuweight check and massaged the changelog ] Reported-by: Marc Dionne <marc.c.dionne@gmail.com> Tested-by: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> LKML-Reference: <1262834564-13033-1-git-send-email-dfeng@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2010-01-18 14:44:50 +01:00
Milton Miller	e03bcb6862	generic-ipi: Optimize accesses by using DEFINE_PER_CPU_SHARED_ALIGNED for IPI data The smp ipi data is passed around and given write access by other cpus and should be separated from per-cpu data consumed by this cpu. Looking for hot lines, I saw call_function_data shared with tick_cpu_sched. Signed-off-by: Milton Miller <miltonm@bga.com> Acked-by: Anton Blanchard <anton@samba.org> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: : Nick Piggin <npiggin@suse.de> LKML-Reference: <20100118020051.GR12666@kryten> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-18 09:02:59 +01:00
Ingo Molnar	f426a7e029	Merge branch 'perf/scheduling' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core	2010-01-18 08:56:41 +01:00
James Morris	2457552d1e	Merge branch 'master' into next	2010-01-18 09:56:22 +11:00
Frederic Weisbecker	329c0e012b	perf: Better order flexible and pinned scheduling When a task gets scheduled in. We don't touch the cpu bound events so the priority order becomes: cpu pinned, cpu flexible, task pinned, task flexible. So schedule out cpu flexibles when a new task context gets in and correctly order the groups to schedule in: task pinned, cpu flexible, task flexible. Cpu pinned groups don't need to be touched at this time. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-17 13:11:05 +01:00
Frederic Weisbecker	7defb0f879	perf: Don't schedule out/in pinned events on task tick We don't need to schedule in/out pinned events on task tick, now that pinned and flexible groups can be scheduled separately. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-17 13:09:51 +01:00
Frederic Weisbecker	5b0311e1f2	perf: Allow pinned and flexible groups to be scheduled separately Tune the scheduling helpers so that we can choose to schedule either pinned and/or flexible groups from a context. And while at it, refactor a bit the naming of these helpers to make these more consistent and flexible. There is no (intended) change in scheduling behaviour in this patch. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-17 13:08:57 +01:00
Frederic Weisbecker	42cce92f4d	perf: Make __perf_event_sched_out static __perf_event_sched_out doesn't need to be globally available, make it static. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-17 13:08:01 +01:00
Masami Hiramatsu	231e36f4d2	tracing/kprobe: Update kprobe tracing self test for new syntax Update kprobe tracing self test for new syntax (it supports deleting individual probes, and drops $argN support) and behavior change (new probes are disabled in default). This selftest includes the following checks: - Adding function-entry probe and return probe with arguments. - Enabling these probes. - Deleting it individually. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100114051211.7814.29436.stgit@localhost6.localdomain6> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-17 08:15:35 +01:00
H Hartley Sweeten	6d686f4564	sched: Don't expose local functions kernel/sched: don't expose local functions The get_rr_interval_* functions are all class methods of struct sched_class. They are not exported so make them static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <201001132021.53253.hartleys@visionengravers.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-17 08:09:45 +01:00
Frederic Weisbecker	24a53652e3	tracing: Drop the tr check from the graph tracing path Each time we save a function entry from the function graph tracer, we check if the trace array is set, which is wasteful because it is set anyway before we start the tracer. All we need is to ensure we have good read and write orderings. When we set the trace array, we just need to guarantee it to be visible before starting tracing. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> LKML-Reference: <1263453795-7496-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-17 08:06:25 +01:00
Linus Torvalds	2a8249daf6	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futexes: Remove rw parameter from get_futex_key()	2010-01-16 12:31:30 -08:00
Linus Torvalds	6ccc347b69	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing/filters: Add comment for match callbacks tracing/filters: Fix MATCH_FULL filter matching for PTR_STRING tracing/filters: Fix MATCH_MIDDLE_ONLY filter matching lib: Introduce strnstr() tracing/filters: Fix MATCH_END_ONLY filter matching tracing/filters: Fix MATCH_FRONT_ONLY filter matching ftrace: Fix MATCH_END_ONLY function filter tracing/x86: Derive arch from bits argument in recordmcount.pl ring-buffer: Add rb_list_head() wrapper around new reader page next field ring-buffer: Wrap a list.next reference with rb_list_head()	2010-01-16 12:27:25 -08:00
David John	af2422c42c	smp_call_function_any(): pass the node value to cpumask_of_node() The change in acpi_cpufreq to use smp_call_function_any causes a warning when it is called since the function erroneously passes the cpu id to cpumask_of_node rather than the node that the cpu is on. Fix this. cpumask_of_node(3): node > nr_node_ids(1) Pid: 1, comm: swapper Not tainted 2.6.33-rc3-00097-g2c1f189 #223 Call Trace: [<ffffffff81028bb3>] cpumask_of_node+0x23/0x58 [<ffffffff81061f51>] smp_call_function_any+0x65/0xfa [<ffffffff810160d1>] ? do_drv_read+0x0/0x2f [<ffffffff81015fba>] get_cur_val+0xb0/0x102 [<ffffffff81016080>] get_cur_freq_on_cpu+0x74/0xc5 [<ffffffff810168a7>] acpi_cpufreq_cpu_init+0x417/0x515 [<ffffffff81562ce9>] ? __down_write+0xb/0xd [<ffffffff8148055e>] cpufreq_add_dev+0x278/0x922 Signed-off-by: David John <davidjon@xenontk.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-16 12:15:39 -08:00
Andi Kleen	5dab600e6a	kfifo: document everywhere that size has to be power of two On my first try using them I missed that the fifos need to be power of two, resulting in a runtime bug. Document that requirement everywhere (and fix one grammar bug) Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-16 12:15:38 -08:00
Andi Kleen	a5b9e2c106	kfifo: add kfifo_out_peek In some upcoming code it's useful to peek into a FIFO without permanentely removing data. This patch implements a new kfifo_out_peek() to do this. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-16 12:15:38 -08:00
Andi Kleen	64ce1037c5	kfifo: sanitize _user error handling Right now for kfifo__user it's not easily possible to distingush between a user copy failing and the FIFO not containing enough data. The problem is that both conditions are multiplexed into the same return code. Avoid this by moving the "copy length" into a separate output parameter and only return 0/-EFAULT in the main return value. I didn't fully adapt the weird "record" variants, those seem to be unused anyways and were rather messy (should they be just removed?) I would appreciate some double checking if I did all the conversions correctly. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-16 12:15:38 -08:00
Andi Kleen	8ecc295153	kfifo: use void * pointers for user buffers The pointers to user buffers are currently unsigned char , which requires a lot of casting in the caller for any non-char typed buffers. Use void instead. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Stefani Seibold <stefani@seibold.net> Cc: Roland Dreier <rdreier@cisco.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Andy Walls <awalls@radix.net> Cc: Vikram Dhillon <dhillonv10@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-16 12:15:38 -08:00
Frederic Weisbecker	d6f962b57b	perf: Export software-only event group characteristic as a flag Before scheduling an event group, we first check if a group can go on. We first check if the group is made of software only events first, in which case it is enough to know if the group can be scheduled in. For that purpose, we iterate through the whole group, which is wasteful as we could do this check when we add/delete an event to a group. So we create a group_flags field in perf event that can host characteristics from a group of events, starting with a first PERF_GROUP_SOFTWARE flag that reduces the check on the fast path. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-16 12:30:40 +01:00
Frederic Weisbecker	e286417378	perf: Round robin flexible groups of events using list_rotate_left() This is more proper that doing it through a list_for_each_entry() that breaks after the first entry. v2: Don't rotate pinned groups as its not needed to time share them. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-16 12:30:28 +01:00
Frederic Weisbecker	889ff01506	perf/core: Split context's event group list into pinned and non-pinned lists Split-up struct perf_event_context::group_list into pinned_groups and flexible_groups (non-pinned). This first appears to be useless as it duplicates various loops around the group list handlings. But it scales better in the fast-path in perf_sched_in(). We don't anymore iterate twice through the entire list to separate pinned and non-pinned scheduling. Instead we interate through two distinct lists. The another desired effect is that it makes easier to define distinct scheduling rules on both. Changes in v2: - Respectively rename pinned_grp_list and volatile_grp_list into pinned_groups and flexible_groups as per Ingo suggestion. - Various cleanups Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@infradead.org>	2010-01-16 12:27:42 +01:00
Paul E. McKenney	017c426138	rcu: Fix sparse warnings Rename local variable "i" in rcu_init() to avoid conflict with RCU_INIT_FLAVOR(), restrict the scope of RCU_TREE_NONCORE, and make __synchronize_srcu() static. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12635142581560-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-16 10:25:22 +01:00
Li Zefan	d1303dd1d6	tracing/filters: Add comment for match callbacks We should be clear on 2 things: - the length parameter of a match callback includes tailing '\0'. - the string to be searched might not be NULL-terminated. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8770.7000608@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:14 -05:00
Li Zefan	16da27a8bc	tracing/filters: Fix MATCH_FULL filter matching for PTR_STRING MATCH_FULL matching for PTR_STRING is not working correctly: # echo 'func == vt' > events/bkl/lock_kernel/filter # echo 1 > events/bkl/lock_kernel/enable ... # cat trace Xorg-1484 [000] 1973.392586: lock_kernel: ... func=vt_ioctl() gpm-1402 [001] 1974.027740: lock_kernel: ... func=vt_ioctl() We should pass to regex.match(..., len) the length (including '\0') of the source string instead of the length of the pattern string. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8763.5070707@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:12 -05:00
Li Zefan	b2af211f28	tracing/filters: Fix MATCH_MIDDLE_ONLY filter matching The @str might not be NULL-terminated if it's of type DYN_STRING or STATIC_STRING, so we should use strnstr() instead of strstr(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8753.2000102@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:11 -05:00
Li Zefan	a3291c14ec	tracing/filters: Fix MATCH_END_ONLY filter matching For '*foo' pattern, we should allow any string ending with 'foo', but event filtering incorrectly disallows strings like bar_foo_foo: Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8735.6070604@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:07 -05:00
Li Zefan	285caad415	tracing/filters: Fix MATCH_FRONT_ONLY filter matching MATCH_FRONT_ONLY actually is a full matching: # ./perf record -R -f -a -e lock:lock_acquire \ --filter 'name ~rcu_*' sleep 1 # ./perf trace (no output) We should pass the length of the pattern string to strncmp(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E8721.5090301@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:05 -05:00
Li Zefan	751e9983ee	ftrace: Fix MATCH_END_ONLY function filter For 'foo' pattern, we should allow any string ending with 'foo', but ftrace filter incorrectly disallows strings like bar_foo_foo: # echo 'io' > set_ftrace_filter # cat set_ftrace_filter \| grep 'req_bio_endio' # cat available_filter_functions \| grep 'req_bio_endio' req_bio_endio Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4B4E870E.6060607@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-14 22:38:03 -05:00
Jamie Iles	8381f65d09	sched/perf: Make sure irqs are disabled for perf_event_task_sched_in() perf_event_task_sched_in() expects interrupts to be disabled, but on architectures with __ARCH_WANT_INTERRUPTS_ON_CTXSW defined, this isn't true. If this is defined, disable irqs around the call in finish_task_switch(). Signed-off-by: Jamie Iles <jamie.iles@picochip.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> LKML-Reference: <1262964453-27370-1-git-send-email-jamie.iles@picochip.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 10:43:08 +01:00
Masami Hiramatsu	14640106f2	tracing/kprobe: Drop function argument access syntax Drop function argument access syntax, because the function arguments depend on not only architecture but also compile-options and function API. And now, we have perf-probe for finding register/memory assigned to each argument. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Neuling <mikey@neuling.org> Cc: linuxppc-dev@ozlabs.org LKML-Reference: <20100105224648.19431.52309.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 10:09:12 +01:00
Ingo Molnar	61405fea92	Merge branch 'perf/urgent' into perf/core Merge reason: queue up dependent patch, update to -rc4 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 10:08:50 +01:00
KOSAKI Motohiro	7485d0d375	futexes: Remove rw parameter from get_futex_key() Currently, futexes have two problem: A) The current futex code doesn't handle private file mappings properly. get_futex_key() uses PageAnon() to distinguish file and anon, which can cause the following bad scenario: 1) thread-A call futex(private-mapping, FUTEX_WAIT), it sleeps on file mapping object. 2) thread-B writes a variable and it makes it cow. 3) thread-B calls futex(private-mapping, FUTEX_WAKE), it wakes up blocked thread on the anonymous page. (but it's nothing) B) Current futex code doesn't handle zero page properly. Read mode get_user_pages() can return zero page, but current futex code doesn't handle it at all. Then, zero page makes infinite loop internally. The solution is to use write mode get_user_page() always for page lookup. It prevents the lookup of both file page of private mappings and zero page. Performance concerns: Probaly very little, because glibc always initialize variables for futex before to call futex(). It means glibc users never see the overhead of this patch. Compatibility concerns: This patch has few compatibility issues. After this patch, FUTEX_WAIT require writable access to futex variables (read-only mappings makes EFAULT). But practically it's not a problem, glibc always initalizes variables for futexes explicitly - nobody uses read-only mappings. Reported-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Darren Hart <dvhltc@us.ibm.com> Cc: <stable@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Ulrich Drepper <drepper@gmail.com> LKML-Reference: <20100105162633.45A2.A69D9226@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:17:36 +01:00
Paul E. McKenney	b6407e8639	rcu: Give different levels of the rcu_node hierarchy distinct lockdep names Previously, each level of the rcu_node hierarchy had the same rather unimaginative name: "&rcu_node_class[i]". This makes lockdep diagnostics involving these lockdep classes less helpful than would be nice. This patch fixes this by giving each level of the rcu_node hierarchy a distinct name: "rcu_node_level_0", "rcu_node_level_1", and so on. This version of the patch includes improved diagnostics suggested by Josh Triplett and Peter Zijlstra. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626498421830-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:07 +01:00
Paul E. McKenney	cba8244a0f	rcu: Add debug check for too many rcu_read_unlock() TREE_PREEMPT_RCU maintains an rcu_read_lock_nesting counter in the task structure, which happens to be a signed int. So this patch adds a check for this counter being negative at the end of __rcu_read_unlock(). This check is under CONFIG_PROVE_LOCKING, so can be thought of as being part of lockdep. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626498423064-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:06 +01:00
Paul E. McKenney	bf66f18e79	rcu: Add force_quiescent_state() testing to rcutorture Add force_quiescent_state() testing to rcutorture, with a separate thread that repeatedly invokes force_quiescent_state() in bursts. This can greatly increase the probability of encountering certain types of race conditions. Suggested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1262646551116-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:05 +01:00
Paul E. McKenney	46a1e34eda	rcu: Make force_quiescent_state() start grace period if needed Grace periods cannot be started while force_quiescent_state() is active. This is OK in that the affected CPUs will try again later, but it does induce needless grace-period delays. This patch causes rcu_start_gp() to record a failed attempt to start a grace period. When force_quiescent_state() prepares to return, it then starts the grace period if there was such a failed attempt. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465501854-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:05 +01:00
Paul E. McKenney	45f014c52e	rcu: Remove redundant grace-period check The rcu_process_dyntick() function checks twice for the end of the current grace period. However, it holds the current rcu_node structure's ->lock field throughout, and doesn't get to the second call to rcu_gp_in_progress() unless there is at least one CPU corresponding to this rcu_node structure that has not yet checked in for the current grace period, which would prevent the current grace period from ending. So the current grace period cannot have ended, and the second check is redundant, so remove it. Also, given that this function is used even with !CONFIG_NO_HZ, its name is quite misleading. Change from rcu_process_dyntick() to force_qs_rnp(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1262646550562-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:04 +01:00
Paul E. McKenney	ee47eb9f4d	rcu: Remove leg of force_quiescent_state() switch statement The comparisons of rsp->gpnum nad rsp->completed in rcu_process_dyntick() and force_quiescent_state() can be replaced by the much more clear rcu_gp_in_progress() predicate function. After doing this, it becomes clear that the RCU_SAVE_COMPLETED leg of the force_quiescent_state() function's switch statement is almost completely a no-op. A small change to the RCU_SAVE_DYNTICK leg renders it a complete no-op, after which it can be removed. Doing so also eliminates the forcenow local variable from force_quiescent_state(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465501781-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:04 +01:00
Paul E. McKenney	0f10dc8266	rcu: Eliminate rcu_process_dyntick() return value Because a new grace period cannot start while we are executing within the force_quiescent_state() function's switch statement, if any test within that switch statement or within any function called from that switch statement shows that the current grace period has ended, we can safely re-do that test any time before we leave the switch statement. This means that we no longer need a return value from rcu_process_dyntick(), as we can simply invoke rcu_gp_in_progress() to check whether the old grace period has finished -- there is no longer any need to worry about whether or not a new grace period has been started. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465501857-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:03 +01:00
Paul E. McKenney	eb1ba45f1e	rcu: Eliminate second argument of rcu_process_dyntick() At this point, the second argument to all calls to rcu_process_dyntick() is a function of the same field of the structure passed in as the first argument, namely, rsp->gpnum-1. So propagate rsp->gpnum-1 to all uses of the second argument within rcu_process_dyntick() and then eliminate the second argument. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465503786-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:03 +01:00
Paul E. McKenney	39c0bbfc07	rcu: Eliminate local variable lastcomp from force_quiescent_state() Because rsp->fqs_active is set to 1 across force_quiescent_state()'s switch statement, rcu_start_gp() will refrain from starting a new grace period during this time. Therefore, rsp->gpnum is constant, and can be propagated to all uses of lastcomp, eliminating this local variable. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465502985-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:03 +01:00
Paul E. McKenney	f3a8b5c6aa	rcu: Eliminate local variable signaled from force_quiescent_state() Because the root rcu_node lock is held across entry to the switch statement in force_quiescent_state(), it is no longer necessary to snapshot rsp->signaled to a local variable. Eliminate both the snapshotting and the local variable. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1262646550602-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:02 +01:00
Paul E. McKenney	07079d5357	rcu: Prohibit starting new grace periods while forcing quiescent states Reduce the number and variety of race conditions by prohibiting the start of a new grace period while force_quiescent_state() is active. A new fqs_active flag in the rcu_state structure is used to trace whether or not force_quiescent_state() is active, and this new flag is tested by rcu_start_gp(). If the CPU that closed out the last grace period needs another grace period, this new grace period may be delayed up to one scheduling-clock tick, but it will eventually get started. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <126264655052-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:02 +01:00
Paul E. McKenney	559569acf9	rcu: Adjust force_quiescent_state() locking, step 2 This patch releases rnp->lock after the end of force_quiescent_state()'s switch statement. This is a second step towards prohibiting starting grace periods while force_quiescent_state() is executing, which will reduce the number and complexity of races that force_quiescent_state() is involved in. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465501994-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:01 +01:00
Paul E. McKenney	f96e9232e0	rcu: Adjust force_quiescent_state() locking, step 1 This causes rnp->lock to be held on entry to force_quiescent_state()'s switch statement. This is a first step towards prohibiting starting grace periods while force_quiescent_state() is executing, which will reduce the number and complexity of races that force_quiescent_state() is involved in. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12626465501455-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-01-13 09:06:01 +01:00
Andi Kleen	b45c6e76bc	kernel/signal.c: fix kernel information leak with print-fatal-signals=1 When print-fatal-signals is enabled it's possible to dump any memory reachable by the kernel to the log by simply jumping to that address from user space. Or crash the system if there's some hardware with read side effects. The fatal signals handler will dump 16 bytes at the execution address, which is fully controlled by ring 3. In addition when something jumps to a unmapped address there will be up to 16 additional useless page faults, which might be potentially slow (and at least is not very efficient) Fortunately this option is off by default and only there on i386. But fix it by checking for kernel addresses and also stopping when there's a page fault. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-11 09:34:05 -08:00
Dave Anderson	bd4f490a07	cgroups: fix 2.6.32 regression causing BUG_ON() in cgroup_diput() The LTP cgroup test suite generates a "kernel BUG at kernel/cgroup.c:790!" here in cgroup_diput(): /* * if we're getting rid of the cgroup, refcount should ensure * that there are no pidlists left. */ BUG_ON(!list_empty(&cgrp->pidlists)); The cgroup pidlist rework in 2.6.32 generates the BUG_ON, which is caused when pidlist_array_load() calls cgroup_pidlist_find(): (1) if a matching cgroup_pidlist is found, it down_write's the mutex of the pre-existing cgroup_pidlist, and increments its use_count. (2) if no matching cgroup_pidlist is found, then a new one is allocated, it down_write's its mutex, and the use_count is set to 0. (3) the matching, or new, cgroup_pidlist gets returned back to pidlist_array_load(), which increments its use_count -- regardless whether new or pre-existing -- and up_write's the mutex. So if a matching list is ever encountered by cgroup_pidlist_find() during the life of a cgroup directory, it results in an inflated use_count value, preventing it from ever getting released by cgroup_release_pid_array(). Then if the directory is subsequently removed, cgroup_diput() hits the BUG_ON() when it finds that the directory's cgroup is still populated with a pidlist. The patch simply removes the use_count increment when a matching pidlist is found by cgroup_pidlist_find(), because it gets bumped by the calling pidlist_array_load() function while still protected by the list's mutex. Signed-off-by: Dave Anderson <anderson@redhat.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Ben Blum <bblum@andrew.cmu.edu> Cc: Paul Menage <menage@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-11 09:34:05 -08:00
Masami Hiramatsu	8767ba2796	kmod: fix resource leak in call_usermodehelper_pipe() Fix resource (write-pipe file) leak in call_usermodehelper_pipe(). When call_usermodehelper_exec() fails, write-pipe file is opened and call_usermodehelper_pipe() just returns an error. Since it is hard for caller to determine whether the error occured when opening the pipe or executing the helper, the caller cannot close the pipe by themselves. I've found this resoruce leak when testing coredump. You can check how the resource leaks as below; $ echo "\|nocommand" > /proc/sys/kernel/core_pattern $ ulimit -c unlimited $ while [ 1 ]; do ./segv; done &> /dev/null & $ cat /proc/meminfo (<- repeat it) where segv.c is; //----- int main () { char p = 0; p = 1; } //----- This patch closes write-pipe file if call_usermodehelper_exec() failed. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-11 09:34:04 -08:00
Steven Rostedt	0e1ff5d72a	ring-buffer: Add rb_list_head() wrapper around new reader page next field If the very unlikely case happens where the writer moves the head by one between where the head page is read and where the new reader page is assigned _and_ the writer then writes and wraps the entire ring buffer so that the head page is back to what was originally read as the head page, the page to be swapped will have a corrupted next pointer. Simple solution is to wrap the assignment of the next pointer with a rb_list_head(). Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 20:40:44 -05:00
David Sharp	5ded3dc6a3	ring-buffer: Wrap a list.next reference with rb_list_head() This reference at the end of rb_get_reader_page() was causing off-by-one writes to the prev pointer of the page after the reader page when that page is the head page, and therefore the reader page has the RB_PAGE_HEAD flag in its list.next pointer. This eventually results in a GPF in a subsequent call to rb_set_head_page() (usually from rb_get_reader_page()) when that prev pointer is dereferenced. The dereferenced register would characteristically have an address that appears shifted left by one byte (eg, ffxxxxxxxxxxxxyy instead of ffffxxxxxxxxxxxx) due to being written at an address one byte too high. Signed-off-by: David Sharp <dhsharp@google.com> LKML-Reference: <1262826727-9090-1-git-send-email-dhsharp@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 20:38:25 -05:00
Steven Rostedt	d931369b74	tracing: Add stack dump to trace_printk if stacktrace option is set If the ftrace stacktrace option is set, then add the stack dumps to trace_printk. Requested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 18:09:57 -05:00
Lai Jiangshan	7e53bd42d1	tracing: Consolidate protection of reader access to the ring buffer At the beginning, access to the ring buffer was fully serialized by trace_types_lock. Patch `d7350c3f45` gives more freedom to readers, and patch `b04cc6b1f6` adds code to protect trace_pipe and cpu#/trace_pipe. But actually it is not enough, ring buffer readers are not always read-only, they may consume data. This patch makes accesses to trace, trace_pipe, trace_pipe_raw cpu#/trace, cpu#/trace_pipe and cpu#/trace_pipe_raw serialized. And removes tracing_reader_cpumask which is used to protect trace_pipe. Details: Ring buffer serializes readers, but it is low level protection. The validity of the events (which returns by ring_buffer_peek() ..etc) are not protected by ring buffer. The content of events may become garbage if we allow another process to consume these events concurrently: A) the page of the consumed events may become a normal page (not reader page) in ring buffer, and this page will be rewritten by the events producer. B) The page of the consumed events may become a page for splice_read, and this page will be returned to system. This patch adds trace_access_lock() and trace_access_unlock() primitives. These primitives allow multi process access to different cpu ring buffers concurrently. These primitives don't distinguish read-only and read-consume access. Multi read-only access is also serialized. And we don't use these primitives when we open files, we only use them when we read files. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B447D52.1050602@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 12:51:34 -05:00
Lai Jiangshan	0fa0edaf32	tracing: Remove show_format and related macros from TRACE_EVENT The previous patches added the use of print_fmt string and changes the trace_define_field() function to also create the fields and format output for the event format files. text data bss dec hex filename 5857201 1355780 9336808 16549789 fc879d vmlinux 5884589 1351684 9337896 16574169 fce6d9 vmlinux-orig The above shows the size of the vmlinux after this patch set compared to the vmlinux-orig which is before the patch set. This saves us 27k on text, 1k on bss and adds just 4k of data. The total savings of 24k in size. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D4D.40604@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 12:08:46 -05:00
Lai Jiangshan	5a65e95622	tracing: Use defined fields and print_fmt to print formats The calls ftrace_format_##call() and ftrace_define_fields_##call() are almost duplicate in functionality. With the addition of the print_fmt in previous patches, these two functions can be merged into one. The trace_define_field() defines the fields and links them into the struct ftrace_event_call. The previous patches introduced the print_fmt field and this can now be used with the trace_define_field() to create the event format file fields and print_fmt field. The struct ftrace_event_call->fields are used to print the fields The struct ftrace_event_call->print_fmt is used to print the "print fmt: XXXXXXXXXXX" line. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D49.5000006@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 12:08:20 -05:00
Steven Rostedt	c7ef3a9004	tracing: Have syscall tracing call its own init function In the clean up of having all events call one specific function, the syscall event init was changed to call this helper function. With the new print_fmt updates, the syscalls need to do special initializations. This patch converts the syscall events to call its own init function again. Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 12:02:32 -05:00
Lai Jiangshan	a342a0280b	tracing/kprobes: Init print_fmt for kprobe events This is part of a patch set that removes the show_format method in the ftrace event macros. Add the print_fmt initialization to the kprobe events. The print_fmt is still not used, but will be in the follow up patches. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D45.3080100@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 12:01:35 -05:00
Lai Jiangshan	50307a45f8	tracing/syscalls: Init print_fmt for syscall events This is part of a patch set that removes the show_format method in the ftrace event macros. Add the print_fmt initialization to the syscall events. The print_fmt is still not used, but will be in the follow up patches. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D41.609@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 11:58:32 -05:00
Lai Jiangshan	509e760cd9	tracing: Add print_fmt field This is part of a patch set that removes the show_format method in the ftrace event macros. The print_fmt field is added to hold the string that shows the print_fmt in the event format files. This patch only adds the field but it is currently not used. Later patches will use this field to enable us to remove the show_format field and function. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D3E.2000704@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 11:41:54 -05:00
Lai Jiangshan	809826a389	tracing: Have __dynamic_array() define a field This is part of a patch set that removes the show_format method in the ftrace event macros. This patch set requires that all fields are added to the ftrace_event_call->fields. This patch changes __dynamic_array() to call trace_define_field() to include fields that use __dynamic_array(). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D36.8090100@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-01-06 11:30:02 -05:00
Ben Hutchings	10b465aaf9	modules: Skip empty sections when exporting section notes Commit `35dead4` "modules: don't export section names of empty sections via sysfs" changed the set of sections that have attributes, but did not change the iteration over these attributes in add_notes_attrs(). This can lead to add_notes_attrs() creating attributes with the wrong names or with null name pointers. Introduce a sect_empty() function and use it in both add_sect_attrs() and add_notes_attrs(). Reported-by: Martin Michlmayr <tbm@cyrius.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Tested-by: Martin Michlmayr <tbm@cyrius.com> Cc: stable@kernel.org Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-01-06 01:11:29 -08:00
Steffen Klassert	16295bec63	padata: Generic parallelization/serialization interface This patch introduces an interface to process data objects in parallel. The parallelized objects return after serialization in the same order as they were before the parallelization. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-01-06 19:47:10 +11:00
Christoph Lameter	79615760f3	local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c ringbuffer*.c are the last users of local.h. Remove the include from modules.h and add it to ringbuffer files. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-01-05 15:34:50 +09:00
Christoph Lameter	e1783a240f	module: Use this_cpu_xx to dynamically allocate counters Use cpu ops to deal with the per cpu data instead of a local_t. Reduces memory requirements, cache footprint and decreases cycle counts. The this_cpu_xx operations are also used for !SMP mode. Otherwise we could not drop the use of __module_ref_addr() which would make per cpu data handling complicated. this_cpu_xx operations have their own fallback for !SMP. V8-V9: - Leave include asm/module.h since ringbuffer.c depends on it. Nothing else does though. Another patch will deal with that. - Remove spurious free. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Tejun Heo <tj@kernel.org>	2010-01-05 15:34:50 +09:00
Tejun Heo	32032df6c2	Merge branch 'master' into percpu Conflicts: arch/powerpc/platforms/pseries/hvCall.S include/linux/percpu.h	2010-01-05 09:17:33 +09:00
Linus Torvalds	952363c90c	Merge branch 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix NULL deref in inheritance code perf: Pass appropriate frame pointer to dump_trace()	2009-12-31 11:56:24 -08:00
Linus Torvalds	9d6e323c68	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf kmem: Fix statistics typo kprobes: Fix distinct type warning perf: Rename perf_event_hw_event in design document perf tools: Add missing header files to LIB_H Makefile variable perf record: We should fork only if a program was specified to run perf diff: Fix usage array, it must end with a NULL entry	2009-12-31 11:52:24 -08:00
Linus Torvalds	b21c070403	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix sign fields in ftrace_define_fields_##call() tracing/syscalls: Fix typo in SYSCALL_DEFINE0 tracing/kprobe: Show sign of fields in trace_kprobe format files ksym_tracer: Remove trace_stat ksym_tracer: Fix race when incrementing count ksym_tracer: Fix to allow writing newline to ksym_trace_filter ksym_tracer: Fix to make the tracer work tracing: Kconfig spelling fixes and cleanups tracing: Fix setting tracer specific options Documentation: Update ftrace-design.txt Documentation: Update tracepoint-analysis.txt Documentation: Update mmiotrace.txt	2009-12-31 11:52:01 -08:00
KOSAKI Motohiro	0f4bd46ec2	kmsg_dump: Dump on crash_kexec as well crash_kexec gets called before kmsg_dump(KMSG_DUMP_OOPS) if panic_on_oops is set, so the kernel log buffer is not stored for this case. This patch adds a KMSG_DUMP_KEXEC dump type which gets called when crash_kexec() is invoked. To avoid getting double dumps, the old KMSG_DUMP_PANIC is moved below crash_kexec(). The mtdoops driver is modified to handle KMSG_DUMP_KEXEC in the same way as a panic. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Simon Kagstrom <simon.kagstrom@netinsight.net> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-12-31 19:45:04 +00:00
Peter Zijlstra	05cbaa2853	perf: Fix NULL deref in inheritance code Liming found a NULL deref when a task has a perf context but no counters when it forks. This can occur in two cases, a race during construction where the fork hits after installing the context but before the first counter gets inserted, or more reproducably, a fork after the last counter is closed (which leaves the context around). Reported-by: Wang Liming <liming.wang@windriver.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> CC: <stable@kernel.org> LKML-Reference: <1262185684.7135.222.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-31 13:11:31 +01:00
Lai Jiangshan	fb7ae981cb	tracing: Fix sign fields in ftrace_define_fields_##call() Add is_signed_type() call to trace_define_field() in ftrace macros. The code previously just passed in 0 (false), disregarding whether or not the field was actually a signed type. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D3A.6020007@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-30 10:27:06 -05:00
Lai Jiangshan	79b4082108	tracing/kprobe: Show sign of fields in trace_kprobe format files The format files of trace_kprobe do not show the sign of the fields. The other format files show the field signed type of the fields and this patch makes the trace_kprobe formats consistent with the others. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4B273D27.5040009@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-30 10:27:03 -05:00
Li Zefan	53ab668064	ksym_tracer: Remove trace_stat trace_stat is problematic. Don't use it, use seqfile instead. This fixes a race that reading the stat file is not protected by any lock, which can lead to use after free. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B3AF203.40200@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-30 07:50:50 +01:00
Li Zefan	e6d9491bf8	ksym_tracer: Fix race when incrementing count We are under rcu read section but not holding the write lock, so count++ is not atomic. Use atomic64_t instead. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B3AF1EC.9010608@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-30 07:50:49 +01:00
Li Zefan	3d13ec2efd	ksym_tracer: Fix to allow writing newline to ksym_trace_filter It used to work, but now doesn't: # echo > ksym_filter bash: echo: write error: Invalid argument It's caused by `d954fbf0ff` ("tracing: Fix wrong usage of strstrip in trace_ksyms"). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B3AF1D7.5040400@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-30 07:50:49 +01:00
Li Zefan	88f7a890d7	ksym_tracer: Fix to make the tracer work ksym tracer doesn't work: # echo tasklist_lock:rw- > ksym_trace_filter -bash: echo: write error: No such device It's because we pass to perf_event_create_kernel_counter() a cpu number which is not present. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B3AF19E.1010201@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-30 07:50:47 +01:00
Simon Kagstrom	d894837f23	sched: might_sleep(): Make file parameter const char * Fixes a warning when building with g++: warning: deprecated conversion from string constant to 'char*' And the file parameter use is constant, so mark it as such. Signed-off-by: Simon Kagstrom <simon.kagstrom@netinsight.net> Cc: peterz@infradead.org LKML-Reference: <20091223110818.442d848e@marrow.netinsight.se> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-28 10:50:13 +01:00
Randy Dunlap	40892367bc	tracing: Kconfig spelling fixes and cleanups Fix filename reference (ftrace-implementation.txt -> ftrace-design.txt). Fix spelling, punctuation, grammar. Fix help text indentation and line lengths to reduce need for horizontal scrolling or larger window sizes. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091221120117.3fb49cdc.randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-28 10:37:54 +01:00
Li Zefan	07b139c8c8	perf events: Remove CONFIG_EVENT_PROFILE Quoted from Ingo: \| This reminds me - i think we should eliminate CONFIG_EVENT_PROFILE - \| it's an unnecessary Kconfig complication. If both PERF_EVENTS and \| EVENT_TRACING is enabled we should expose generic tracepoints. \| \| Nor is it limited to event 'profiling', so it has become a misnomer as \| well. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B2F1557.2050705@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-28 10:33:06 +01:00
Heiko Carstens	c2ef6661ce	kprobes: Fix distinct type warning Every time I see this: kernel/kprobes.c: In function 'register_kretprobe': kernel/kprobes.c:1038: warning: comparison of distinct pointer types lacks a cast I'm wondering if something changed in common code and we need to do something for s390. Apparently that's not the case. Let's get rid of this annoying warning. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> LKML-Reference: <20091221120224.GA4471@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-28 10:25:31 +01:00
Peter Zijlstra	49f474331e	perf events: Remove arg from perf sched hooks Since we only ever schedule the local cpu, there is no need to pass the cpu number to the perf sched hooks. This micro-optimizes things a bit. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-28 09:21:33 +01:00
Linus Torvalds	0b5e2588d8	Merge branch 'sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6 * 'sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6: SYSCTL: Add a mutex to the page_alloc zone order sysctl SYSCTL: Print binary sysctl warnings (nearly) only once	2009-12-24 13:01:29 -08:00
Andi Kleen	4440095c82	SYSCTL: Print binary sysctl warnings (nearly) only once When printing legacy sysctls print the warning message for each of them only once. This way there is a guarantee the syslog won't be flooded for any sane program. The original attempt at this made the tables non const and stored the flag inline. Linus suggested using a separate hash table for this, this is based on a code snippet from him. The hash implies this is not exact and can sometimes not print a new sysctl due to a hash collision, but in practice this should not be a problem I used a FNV32 hash over the binary string with a 32byte bitmap. This gives relatively little collisions when all the predefined binary sysctls are hashed: size 256 bucket length number 0: [25] 1: [67] 2: [88] 3: [47] 4: [22] 5: [6] 6: [1] The worst case is a single collision of 6 hash values. Signed-off-by: Andi Kleen <ak@linux.intel.com>	2009-12-23 21:00:20 +01:00
Linus Torvalds	6432ed648a	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Revert `738d2be`, simplify set_task_cpu()	2009-12-23 09:12:57 -08:00
Peter Zijlstra	0c69774e6c	sched: Revert `738d2be`, simplify set_task_cpu() Effectively reverts `738d2be430`. As demonstrated by Eric, we really need to call __set_task_cpu() early in the fork() path to properly initialize the various task state -- specifically the cgroup state through set_task_rq(). [ we could probably fix this by explicitly calling __set_task_cpu() from sched_fork(), but lets try that for the next cycle and simply revert to the old behaviour for now. ] Reported-by: Eric Paris <eparis@redhat.com> Tested-by: Eric Paris <eparis@redhat.com>, Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: efault@gmx.de LKML-Reference: <1261492999.4937.36.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-23 10:04:10 +01:00
Linus Torvalds	fe35d4a028	Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: jfs: Fix 32bit build warning Remove obsolete comment in fs.h Sanitize f_flags helpers Fix f_flags/f_mode in case of lookup_instantiate_filp() from open(pathname, 3) anonfd: Allow making anon files read-only fs/compat_ioctl.c: fix build error when !BLOCK pohmelfs needs I_LOCK alloc_file(): simplify handling of mnt_clone_write() errors	2009-12-22 14:20:48 -08:00
Stefani Seibold	86d4880313	kfifo: add record handling functions Add kfifo_in_rec() - puts some record data into the FIFO Add kfifo_out_rec() - gets some record data from the FIFO Add kfifo_from_user_rec() - puts some data from user space into the FIFO Add kfifo_to_user_rec() - gets data from the FIFO and write it to user space Add kfifo_peek_rec() - gets the size of the next FIFO record field Add kfifo_skip_rec() - skip the next fifo out record Add kfifo_avail_rec() - determinate the number of bytes available in a record FIFO Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:56 -08:00
Stefani Seibold	a121f24acc	kfifo: add kfifo_skip, kfifo_from_user and kfifo_to_user Add kfifo_reset_out() for save lockless discard the fifo output Add kfifo_skip() to skip a number of output bytes Add kfifo_from_user() to copy user space data into the fifo Add kfifo_to_user() to copy fifo data to user space Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:56 -08:00
Stefani Seibold	7acd72eb85	kfifo: rename kfifo_put... into kfifo_in... and kfifo_get... into kfifo_out... rename kfifo_put... into kfifo_in... to prevent miss use of old non in kernel-tree drivers ditto for kfifo_get... -> kfifo_out... Improve the prototypes of kfifo_in and kfifo_out to make the kerneldoc annotations more readable. Add mini "howto porting to the new API" in kfifo.h Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:56 -08:00
Stefani Seibold	e64c026dd0	kfifo: cleanup namespace change name of __kfifo_* functions to kfifo_*, because the prefix __kfifo should be reserved for internal functions only. Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:56 -08:00
Stefani Seibold	c1e13f2567	kfifo: move out spinlock Move the pointer to the spinlock out of struct kfifo. Most users in tree do not actually use a spinlock, so the few exceptions now have to call kfifo_{get,put}_locked, which takes an extra argument to a spinlock. Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:56 -08:00
Stefani Seibold	4546548789	kfifo: move struct kfifo in place This is a new generic kernel FIFO implementation. The current kernel fifo API is not very widely used, because it has to many constrains. Only 17 files in the current 2.6.31-rc5 used it. FIFO's are like list's a very basic thing and a kfifo API which handles the most use case would save a lot of development time and memory resources. I think this are the reasons why kfifo is not in use: - The API is to simple, important functions are missing - A fifo can be only allocated dynamically - There is a requirement of a spinlock whether you need it or not - There is no support for data records inside a fifo So I decided to extend the kfifo in a more generic way without blowing up the API to much. The new API has the following benefits: - Generic usage: For kernel internal use and/or device driver. - Provide an API for the most use case. - Slim API: The whole API provides 25 functions. - Linux style habit. - DECLARE_KFIFO, DEFINE_KFIFO and INIT_KFIFO Macros - Direct copy_to_user from the fifo and copy_from_user into the fifo. - The kfifo itself is an in place member of the using data structure, this save an indirection access and does not waste the kernel allocator. - Lockless access: if only one reader and one writer is active on the fifo, which is the common use case, no additional locking is necessary. - Remove spinlock - give the user the freedom of choice what kind of locking to use if one is required. - Ability to handle records. Three type of records are supported: - Variable length records between 0-255 bytes, with a record size field of 1 bytes. - Variable length records between 0-65535 bytes, with a record size field of 2 bytes. - Fixed size records, which no record size field. - Preserve memory resource. - Performance! - Easy to use! This patch: Since most users want to have the kfifo as part of another object, reorganize the code to allow including struct kfifo in another data structure. This requires changing the kfifo_alloc and kfifo_init prototypes so that we pass an existing kfifo pointer into them. This patch changes the implementation and all existing users. [akpm@linux-foundation.org: fix warning] Signed-off-by: Stefani Seibold <stefani@seibold.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:17:55 -08:00
Linus Torvalds	83f57a11d8	Revert "time: Remove xtime_cache" This reverts commit `7bc7d63745`, as requested by John Stultz. Quoting John: "Petr Titěra reported an issue where he saw odd atime regressions with 2.6.33 where there were a full second worth of nanoseconds in the nanoseconds field. He also reviewed the time code and narrowed down the problem: unhandled overflow of the nanosecond field caused by rounding up the sub-nanosecond accumulated time. Details: * At the end of update_wall_time(), we currently round up the sub-nanosecond portion of accumulated time when storing it into xtime. This was added to avoid time inconsistencies caused when the sub-nanosecond portion was truncated when storing into xtime. Unfortunately we don't handle the possible second overflow caused by that rounding. * Previously the xtime_cache code hid this overflow by normalizing the xtime value when storing into the xtime_cache. * We could try to handle the second overflow after the rounding up, but since this affects the timekeeping's internal state, this would further complicate the next accumulation cycle, causing small errors in ntp steering. As much as I'd like to get rid of it, the xtime_cache code is known to work. * The correct fix is really to include the sub-nanosecond portion in the timekeeping accessor function, so we don't need to round up at during accumulation. This would greatly simplify the accumulation code. Unfortunately, we can't do this safely until the last three non-GENERIC_TIME arches (sparc32, arm, cris) are converted (those patches are in -mm) and we kill off the spots where arches set xtime directly. This is all 2.6.34 material, so I think reverting the xtime_cache change is the best approach for now. Many thanks to Petr for both reporting and finding the issue!" Reported-by: Petr Titěra <P.Titera@century.cz> Requested-by: john stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-22 14:10:37 -08:00
Al Viro	5300990c03	Sanitize f_flags helpers * pull ACC_MODE to fs.h; we have several copies all over the place * nightmarish expression calculating f_mode by f_flags deserves a helper too (OPEN_FMODE(flags)) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-12-22 12:27:34 -05:00
Roland Dreier	628ff7c1d8	anonfd: Allow making anon files read-only It seems a couple places such as arch/ia64/kernel/perfmon.c and drivers/infiniband/core/uverbs_main.c could use anon_inode_getfile() instead of a private pseudo-fs + alloc_file(), if only there were a way to get a read-only file. So provide this by having anon_inode_getfile() create a read-only file if we pass O_RDONLY in flags. Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2009-12-22 12:27:34 -05:00
Steven Rostedt	c757bea93b	tracing: Fix setting tracer specific options The function __set_tracer_option() takes as its last parameter a "neg" value. If set it should negate the value of the option. The trace_options_write() passed the value written to the file which is what the new value needs to be set as. But since this is not the negative, it never sets the value. Reported-by: Peter Zijlstra <peterz@infradead.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-21 22:35:16 -05:00
Dominik Brodowski	0e2c8b8f55	resources: fix call to alignf() in allocate_resource() The second parameter to alignf() in allocate_resource() must reflect what new resource is attempted to be allocated, else functions like pcibios_align_resource() (at least on x86) or pcmcia_align() can't work correctly. Commit `1e5ad96790` broke this by setting the "new" resource until we're about to return success. To keep the resource untouched when allocate_resource() fails, a "tmp" resource is introduced. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Acked-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-21 10:42:29 -08:00
Peter Zijlstra	70f1120527	sched: Fix hotplug hang The hot-unplug kstopmachine usage does a wakeup after deactivating the cpu, hence we cannot use cpu_active() here but must rely on the good olde online. Reported-by: Sachin Sant <sachinp@in.ibm.com> Reported-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Jens Axboe <jens.axboe@oracle.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> LKML-Reference: <1261326987.4314.24.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-20 23:31:23 +01:00
Peter Zijlstra	3df0fc5b2e	sched: Restore printk sanity Revert the braindead pr_* crap. (Commit `663997d` "sched: Use pr_fmt() and pr_<level>()") It's dumb and causes stupid "sched: " strings all over the place. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Mike Galbraith <efault@gmx.de> Cc: Joe Perches <joe@perches.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <1261315437.4314.6.camel@laptop> [ i dont mind the pr_*() patterns that much - but Peter dislikes them with a vengence. ] [ - v2: remove spurious diffstat from changelog :-/ ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-20 19:05:02 +01:00
Linus Torvalds	eca9dfcd00	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf session: Make events_stats u64 to avoid overflow on 32-bit arches hw-breakpoints: Fix hardware breakpoints -> perf events dependency perf events: Dont report side-band events on each cpu for per-task-per-cpu events perf events, x86/stacktrace: Fix performance/softlockup by providing a special frame pointer-only stack walker perf events, x86/stacktrace: Make stack walking optional perf events: Remove unused perf_counter.h header file perf probe: Check new event name kprobe-tracer: Check new event/group name perf probe: Check whether debugfs path is correct perf probe: Fix libdwarf include path for Debian	2009-12-19 09:48:42 -08:00
Linus Torvalds	aac3d39693	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits) sched: Fix broken assertion sched: Assert task state bits at build time sched: Update task_state_arraypwith new states sched: Add missing state chars to TASK_STATE_TO_CHAR_STR sched: Move TASK_STATE_TO_CHAR_STR near the TASK_state bits sched: Teach might_sleep() about preemptible RCU sched: Make warning less noisy sched: Simplify set_task_cpu() sched: Remove the cfs_rq dependency from set_task_cpu() sched: Add pre and post wakeup hooks sched: Move kthread_bind() back to kthread.c sched: Fix select_task_rq() vs hotplug issues sched: Fix sched_exec() balancing sched: Ensure set_task_cpu() is never called on blocked tasks sched: Use TASK_WAKING for fork wakups sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE sched: Fix task_hot() test order sched: Fix set_cpu_active() in cpu_down() sched: Mark boot-cpu active before smp_init() sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK ...	2009-12-19 09:47:49 -08:00
Linus Torvalds	10e5453ffa	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sys: Fix missing rcu protection for __task_cred() access signals: Fix more rcu assumptions signal: Fix racy access to __task_cred in kill_pid_info_as_uid()	2009-12-19 09:47:34 -08:00
Linus Torvalds	3cd312c3e8	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers: Remove duplicate setting of new_base in __mod_timer() clockevents: Prevent clockevent_devices list corruption on cpu hotplug	2009-12-19 09:47:18 -08:00
Al Viro	b4c30aad39	fix more leaks in audit_tree.c tag_chunk() Several leaks in audit_tree didn't get caught by commit `318b6d3d7d`, including the leak on normal exit in case of multiple rules refering to the same chunk. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-19 09:27:43 -08:00
Al Viro	6f5d511489	fix braindamage in audit_tree.c untag_chunk() ... aka "Al had badly fscked up when writing that thing and nobody noticed until Eric had fixed leaks that used to mask the breakage". The function essentially creates a copy of old array sans one element and replaces the references to elements of original (they are on cyclic lists) with those to corresponding elements of new one. After that the old one is fair game for freeing. First of all, there's a dumb braino: when we get to list_replace_init we use indices for wrong arrays - position in new one with the old array and vice versa. Another bug is more subtle - termination condition is wrong if the element to be excluded happens to be the last one. We shouldn't go until we fill the new array, we should go until we'd finished the old one. Otherwise the element we are trying to kill will remain on the cyclic lists... That crap used to be masked by several leaks, so it was not quite trivial to hit. Eric had fixed some of those leaks a while ago and the shit had hit the fan... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-19 09:27:43 -08:00
Linus Torvalds	55db493b65	Merge branch 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'cpumask-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: cpumask: rename tsk_cpumask to tsk_cpus_allowed cpumask: don't recommend set_cpus_allowed hack in Documentation/cpu-hotplug.txt cpumask: avoid dereferencing struct cpumask cpumask: convert drivers/idle/i7300_idle.c to cpumask_var_t cpumask: use modern cpumask style in drivers/scsi/fcoe/fcoe.c cpumask: avoid deprecated function in mm/slab.c cpumask: use cpu_online in kernel/perf_event.c	2009-12-17 17:00:20 -08:00
Linus Torvalds	efc8e7f4c8	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support NOMMU: Optimise away the {dac_,}mmap_min_addr tests security/min_addr.c: make init_mmap_min_addr() static keys: PTR_ERR return of wrong pointer in keyctl_get_security()	2009-12-17 16:58:26 -08:00
Linus Torvalds	dcc7cd0112	Merge branch 'kmemleak' of git://linux-arm.org/linux-2.6 * 'kmemleak' of git://linux-arm.org/linux-2.6: kmemleak: fix kconfig for crc32 build error kmemleak: Reduce the false positives by checking for modified objects kmemleak: Show the age of an unreferenced object kmemleak: Release the object lock before calling put_object() kmemleak: Scan the _ftrace_events section in modules kmemleak: Simplify the kmemleak_scan_area() function prototype kmemleak: Do not use off-slab management with SLAB_NOLEAKTRACE	2009-12-17 16:00:19 -08:00
Randy Dunlap	6485536bcf	printk: fix new kernel-doc warnings Fix kernel-doc warnings in printk.c: Warning(kernel/printk.c:1422): No description found for parameter 'dumper' Warning(kernel/printk.c:1422): Excess function parameter 'dump' description in 'kmsg_dump_register' Warning(kernel/printk.c:1451): No description found for parameter 'dumper' Warning(kernel/printk.c:1451): Excess function parameter 'dump' description in 'kmsg_dump_unregister' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-17 15:45:32 -08:00
Oleg Nesterov	9cd80bbb07	do_wait() optimization: do not place sub-threads on task_struct->children list Thanks to Roland who pointed out de_thread() issues. Currently we add sub-threads to ->real_parent->children list. This buys nothing but slows down do_wait(). With this patch ->children contains only main threads (group leaders). The only complication is that forget_original_parent() should iterate over sub-threads by hand, and de_thread() needs another list_replace() when it changes ->group_leader. Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD tasks, we can remove this check (and we can unify do_wait_thread() and ptrace_do_wait()). This change can confuse the optimistic search in mm_update_next_owner(), but this is fixable and minor. Perhaps badness() and oom_kill_process() should be updated, but they should be fixed in any case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-17 15:45:31 -08:00
WANG Cong	3e26120cc7	kernel/sysctl.c: fix the incomplete part of sysctl_max_map_count-should-be-non-negative.patch It is a mistake that we used 'proc_dointvec', it should be 'proc_dointvec_minmax', as in the original patch. Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-17 15:45:30 -08:00
Linus Torvalds	5a865c0606	Merge branch 'for-33' of git://repo.or.cz/linux-kbuild * 'for-33' of git://repo.or.cz/linux-kbuild: (29 commits) net: fix for utsrelease.h moving to generated gen_init_cpio: fixed fwrite warning kbuild: fix make clean after mismerge kbuild: generate modules.builtin genksyms: properly consider EXPORT_UNUSED_SYMBOL{,_GPL}() score: add asm/asm-offsets.h wrapper unifdef: update to upstream revision 1.190 kbuild: specify absolute paths for cscope kbuild: create include/generated in silentoldconfig scripts/package: deb-pkg: use fakeroot if available scripts/package: add KBUILD_PKG_ROOTCMD variable scripts/package: tar-pkg: use tar --owner=root Kbuild: clean up marker net: add net_tstamp.h to headers_install kbuild: move utsrelease.h to include/generated kbuild: move autoconf.h to include/generated drop explicit include of autoconf.h kbuild: move compile.h to include/generated kbuild: drop include/asm kbuild: do not check for include/asm-$ARCH ... Fixed non-conflicting clean merge of modpost.c as per comments from Stephen Rothwell (modpost.c had grown an include of linux/autoconf.h that needed to be changed to generated/autoconf.h)	2009-12-17 07:23:42 -08:00
Peter Zijlstra	077614ee1e	sched: Fix broken assertion There's a preemption race in the set_task_cpu() debug check in that when we get preempted after setting task->state we'd still be on the rq proper, but fail the test. Check for preempted tasks, since those are always on the RQ. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20091217121830.137155561@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 13:22:46 +01:00
Peter Zijlstra	5d27c23df0	perf events: Dont report side-band events on each cpu for per-task-per-cpu events Acme noticed that his FORK/MMAP numbers were inflated by about the same factor as his cpu-count. This led to the discovery of a few more sites that need to respect the event->cpu filter. Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091217121830.215333434@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 13:21:36 +01:00
Frederic Weisbecker	61c1917f47	perf events, x86/stacktrace: Make stack walking optional The current print_context_stack helper that does the stack walking job is good for usual stacktraces as it walks through all the stack and reports even addresses that look unreliable, which is nice when we don't have frame pointers for example. But we have users like perf that only require reliable stacktraces, and those may want a more adapted stack walker, so lets make this function a callback in stacktrace_ops that users can tune for their needs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1261024834-5336-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 09:56:19 +01:00
Frederic Weisbecker	234da7bcdc	sched: Teach might_sleep() about preemptible RCU In practice, it is harmless to voluntarily sleep in a rcu_read_lock() section if we are running under preempt rcu, but it is illegal if we build a kernel running non-preemptable rcu. Currently, might_sleep() doesn't notice sleepable operations under rcu_read_lock() sections if we are running under preemptable rcu because preempt_count() is left untouched after rcu_read_lock() in this case. But we want developers who test their changes under such config to notice the "sleeping while atomic" issues. So we add rcu_read_lock_nesting to prempt_count() in might_sleep() checks. [ v2: Handle rcu-tiny ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1260991265-8451-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 09:46:44 +01:00
Masami Hiramatsu	6f3cf44047	kprobe-tracer: Check new event/group name Check new event/group name is same syntax as a C symbol. In other words, checking the name is as like as other tracepoint events. This can prevent user to create an event with useless name (e.g. foo\|bar, foo*bar). Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> LKML-Reference: <20091216222408.14459.68790.stgit@dhcp-100-2-132.bos.redhat.com> [ v2: minor cleanups ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 09:42:44 +01:00
Ingo Molnar	416eb39556	sched: Make warning less noisy Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.807938893@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-17 06:05:49 +01:00
Rusty Russell	62ac127950	cpumask: avoid dereferencing struct cpumask struct cpumask will be undefined soon with CONFIG_CPUMASK_OFFSTACK=y, to avoid them being declared on the stack. cpumask_bits() does what we want here (of course, this code is crap). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> To: Thomas Gleixner <tglx@linutronix.de>	2009-12-17 11:43:29 +10:30
Rusty Russell	f6325e30eb	cpumask: use cpu_online in kernel/perf_event.c Also, we want to check against nr_cpu_ids, not num_possible_cpus(). The latter works, but the correct bounds check is < nr_cpu_ids. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> To: Thomas Gleixner <tglx@linutronix.de>	2009-12-17 11:43:11 +10:30
Simon Horman	cf1e367ee8	timers: Remove duplicate setting of new_base in __mod_timer() new_base is set using per_cpu(tvec_bases, cpu) after selecting the desired value of cpu immediately below so this line is a unnecessary. Signed-off-by: Simon Horman <horms@verge.net.au> LKML-Reference: <20091217001542.GD25317@verge.net.au> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-12-17 01:30:49 +01:00
David Howells	6e14154676	NOMMU: Optimise away the {dac_,}mmap_min_addr tests In NOMMU mode clamp dac_mmap_min_addr to zero to cause the tests on it to be skipped by the compiler. We do this as the minimum mmap address doesn't make any sense in NOMMU mode. mmap_min_addr and round_hint_to_min() can be discarded entirely in NOMMU mode. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-12-17 09:25:19 +11:00
Andi Kleen	61cf693159	[sysctl] Fix breakage on systems with older glibc As predicted during code review, the sysctl(2) changes made systems with old glibc nearly unusable. About every command gives a: warning: process `ls' used the deprecated sysctl system call with 1.4 warning in the log. I see this on a SUSE 10.0 system with glibc 2.3.5. Don't warn for this common case. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 12:36:18 -08:00
Linus Torvalds	8aedf8a6ae	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (52 commits) perf record: Use per-task-per-cpu events for inherited events perf record: Properly synchronize child creation perf events: Allow per-task-per-cpu counters perf diff: Percent calcs should use double values perf diff: Change the default sort order to "dso,symbol" perf diff: Use perf_session__fprintf_hists just like 'perf record' perf report: Fix cut'n'paste error recently introduced perf session: Move perf report specific hits out of perf_session__fprintf_hists perf tools: Move hist entries printing routines from perf report perf report: Generalize perf_session__fprintf_hists() perf symbols: Move symbol filtering to event__preprocess_sample() perf symbols: Adopt the strlists for dso, comm perf symbols: Make symbol_conf global perf probe: Fix to show which probe point is not found perf probe: Check symbols in symtab/kallsyms perf probe: Check build-id of vmlinux perf probe: Reject second attempt of adding same-name event perf probe: Support event name for --add option perf probe: Add glob matching support on --del perf probe: Use strlist__for_each macros in probe-event.c ...	2009-12-16 12:32:47 -08:00
Linus Torvalds	da184a8064	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix return of trace_dump_stack() ksym_tracer: Fix bad cast tracing/power: Remove two exports tracing: Change event->profile_count to be int type tracing: Simplify trace_option_write() tracing: Remove useless trace option tracing: Use seq file for trace_clock tracing: Use seq file for trace_options function-graph: Allow writing the same val to set_graph_function ftrace: Call trace_parser_clear() properly ftrace: Return EINVAL when writing invalid val to set_ftrace_filter tracing: Move a printk out of ftrace_raw_reg_event_foo() tracing: Pull up calls to trace_define_common_fields() tracing: Extract duplicate ftrace_raw_init_event_foo() ftrace.h: Use common pr_info fmt string tracing: Add stack trace to irqsoff tracer tracing: Add trace_dump_stack() ring-buffer: Move resize integrity check under reader lock ring-buffer: Use sync sched protection on ring buffer resizing tracing: Fix wrong usage of strstrip in trace_ksyms	2009-12-16 12:02:25 -08:00
Linus Torvalds	74f3ae7434	Merge branch 'module' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'module' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: modpost: fix segfault with short symbol names module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y Kbuild: clear marker out of modpost module: make MODULE_SYMBOL_PREFIX into a CONFIG option ARM: unexport symbols used to implement floating point emulation ARM: use unified discard definition in linker script x86: don't export inline function sparc64: don't export static inline pci_ functions	2009-12-16 10:47:24 -08:00
Linus Torvalds	60d9aa758c	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: (90 commits) jffs2: Fix long-standing bug with symlink garbage collection. mtd: OneNAND: Fix test of unsigned in onenand_otp_walk() mtd: cfi_cmdset_0002, fix lock imbalance Revert "mtd: move mxcnd_remove to .exit.text" mtd: m25p80: add support for Macronix MX25L4005A kmsg_dump: fix build for CONFIG_PRINTK=n mtd: nandsim: add support for 4KiB pages mtd: mtdoops: refactor as a kmsg_dumper mtd: mtdoops: make record size configurable mtd: mtdoops: limit the maximum mtd partition size mtd: mtdoops: keep track of used/unused pages in an array mtd: mtdoops: several minor cleanups core: Add kernel message dumper to call on oopses and panics mtd: add ARM pismo support mtd: pxa3xx_nand: Fix PIO data transfer mtd: nand: fix multi-chip suspend problem mtd: add support for switching old SST chips into QRY mode mtd: fix M29W800D dev_id and uaddr mtd: don't use PF_MEMALLOC mtd: Add bad block table overrides to Davinci NAND driver ... Fixed up conflicts (mostly trivial) in drivers/mtd/devices/m25p80.c drivers/mtd/maps/pcmciamtd.c drivers/mtd/nand/pxa3xx_nand.c kernel/printk.c	2009-12-16 10:23:43 -08:00
Peter Zijlstra	738d2be430	sched: Simplify set_task_cpu() Rearrange code a bit now that its a simpler function. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170518.269101883@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:59 +01:00
Peter Zijlstra	88ec22d3ed	sched: Remove the cfs_rq dependency from set_task_cpu() In order to remove the cfs_rq dependency from set_task_cpu() we need to ensure the task is cfs_rq invariant for all callsites. The simple approach is to substract cfs_rq->min_vruntime from se->vruntime on dequeue, and add cfs_rq->min_vruntime on enqueue. However, this has the downside of breaking FAIR_SLEEPERS since we loose the old vruntime as we only maintain the relative position. To solve this, we observe that we only migrate runnable tasks, we do this using deactivate_task(.sleep=0) and activate_task(.wakeup=0), therefore we can restrain the min_vruntime invariance to that state. The only other case is wakeup balancing, since we want to maintain the old vruntime we cannot make it relative on dequeue, but since we don't migrate inactive tasks, we can do so right before we activate it again. This is where we need the new pre-wakeup hook, we need to call this while still holding the old rq->lock. We could fold it into ->select_task_rq(), but since that has multiple callsites and would obfuscate the locking requirements, that seems like a fudge. This leaves the fork() case, simply make sure that ->task_fork() leaves the ->vruntime in a relative state. This covers all cases where set_task_cpu() gets called, and ensures it sees a relative vruntime. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170518.191697025@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:58 +01:00
Peter Zijlstra	efbbd05a59	sched: Add pre and post wakeup hooks As will be apparent in the next patch, we need a pre wakeup hook for sched_fair task migration, hence rename the post wakeup hook and one pre wakeup. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170518.114746117@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:58 +01:00
Peter Zijlstra	881232b70b	sched: Move kthread_bind() back to kthread.c Since kthread_bind() lost its dependencies on sched.c, move it back where it came from. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170518.039524041@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:57 +01:00
Peter Zijlstra	5da9a0fb67	sched: Fix select_task_rq() vs hotplug issues Since select_task_rq() is now responsible for guaranteeing ->cpus_allowed and cpu_active_mask, we need to verify this. select_task_rq_rt() can blindly return smp_processor_id()/task_cpu() without checking the valid masks, select_task_rq_fair() can do the same in the rare case that all SD_flags are disabled. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.961475466@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:57 +01:00
Peter Zijlstra	3802290628	sched: Fix sched_exec() balancing Since we access ->cpus_allowed without holding rq->lock we need a retry loop to validate the result, this comes for near free when we merge sched_migrate_task() into sched_exec() since that already does the needed check. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.884743662@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:56 +01:00
Peter Zijlstra	e2912009fb	sched: Ensure set_task_cpu() is never called on blocked tasks In order to clean up the set_task_cpu() rq dependencies we need to ensure it is never called on blocked tasks because such usage does not pair with consistent rq->lock usage. This puts the migration burden on ttwu(). Furthermore we need to close a race against changing ->cpus_allowed, since select_task_rq() runs with only preemption disabled. For sched_fork() this is safe because the child isn't in the tasklist yet, for wakeup we fix this by synchronizing set_cpus_allowed_ptr() against TASK_WAKING, which leaves sched_exec to be a problem This also closes a hole in (`6ad4c1888` sched: Fix balance vs hotplug race) where ->select_task_rq() doesn't validate the result against the sched_domain/root_domain. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.807938893@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:56 +01:00
Peter Zijlstra	06b83b5fbe	sched: Use TASK_WAKING for fork wakups For later convenience use TASK_WAKING for fresh tasks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.732561278@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:55 +01:00
Peter Zijlstra	e4f4288842	sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE We should skip !SD_LOAD_BALANCE domains. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.653578430@chello.nl> CC: stable@kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:55 +01:00
Peter Zijlstra	e6c8fba777	sched: Fix task_hot() test order Make sure not to access sched_fair fields before verifying it is indeed a sched_fair task. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> CC: stable@kernel.org LKML-Reference: <20091216170517.577998058@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:54 +01:00
Xiaotian Feng	9ee349ad6d	sched: Fix set_cpu_active() in cpu_down() Sachin found cpu hotplug test failures on powerpc, which made the kernel hang on his POWER box. The problem is that we fail to re-activate a cpu when a hot-unplug fails. Fix this by moving the de-activation into _cpu_down after doing the initial checks. Remove the synchronize_sched() calls and rely on those implied by rebuilding the sched domains using the new mask. Reported-by: Sachin Sant <sachinp@in.ibm.com> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Tested-by: Sachin Sant <sachinp@in.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091216170517.500272612@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 19:01:53 +01:00
Ingo Molnar	ee1156c11a	Merge branch 'linus' into sched/urgent Conflicts: kernel/sched_idletask.c Merge reason: resolve the conflicts, pick up latest changes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 18:33:49 +01:00
Peter Zijlstra	f4c4176f21	perf events: Allow per-task-per-cpu counters In order to allow for per-task-per-cpu counters, useful for scalability when profiling task hierarchies, we allow installing events with event->cpu != -1 in task contexts. __perf_event_sched_in() already skips events where ->cpu mis-matches the current cpu, fix up __perf_install_in_context() and __perf_event_enable() to also respect this filter. This does lead to vary hard to interpret enabled/running times for such counters, but I don't see a simple solution for that. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: fweisbec@gmail.com Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091216165904.831451147@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-16 18:30:11 +01:00
Amerigo Wang	06a7f71124	kexec: premit reduction of the reserved memory size Implement shrinking the reserved memory for crash kernel, if it is more than enough. For example, if you have already reserved 128M, now you just want 100M, you can do: # echo $((10010241024)) > /sys/kernel/kexec_crash_size Note, you can only do this before loading the crash kernel. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Neil Horman <nhorman@redhat.com> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:13 -08:00
André Goddard Rosa	417e315247	pid: reduce code size by using a pointer to iterate over array It decreases code size by 16 bytes on my gcc 4.4.1 on Core 2: text data bss dec hex filename 4314 2216 8 6538 198a kernel/pid.o-BEFORE 4298 2216 8 6522 197a kernel/pid.o-AFTER Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:12 -08:00
André Goddard Rosa	7be6d991bc	pid: tighten pidmap spinlock critical section by removing kfree() Avoid calling kfree() under pidmap spinlock, calling it afterwards. Normally kfree() is fast, but sometimes it can be slow, so avoid calling it under the spinlock if we can do it. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:12 -08:00
Oleg Nesterov	1be53963b0	signals: check ->group_stop_count after tracehook_get_signal() Move the call to do_signal_stop() down, after tracehook call. This makes ->group_stop_count condition visible to tracers before do_signal_stop() will participate in this group-stop. Currently the patch has no effect, tracehook_get_signal() always returns 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:09 -08:00
Oleg Nesterov	ad09750b51	signals: kill force_sig_specific() Kill force_sig_specific(), this trivial wrapper has no callers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:09 -08:00
Oleg Nesterov	7486e5d9fc	signals: cosmetic, collect_signal: use SI_USER Trivial, s/0/SI_USER/ in collect_signal() for grep. This is a bit confusing, we don't know the source of this signal. But we don't care, and "info->si_code = 0" is imho worse. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:09 -08:00
Oleg Nesterov	dd34200adc	signals: send_signal: use si_fromuser() to detect from_ancestor_ns Change send_signal() to use si_fromuser(). From now SEND_SIG_NOINFO triggers the "from_ancestor_ns" check. This fixes reparent_thread()->group_send_sig_info(pdeath_signal) behaviour, before this patch send_signal() does not detect the cross-namespace case when the child of the dying parent belongs to the sub-namespace. This patch can affect the behaviour of send_sig(), kill_pgrp() and kill_pid() when the caller sends the signal to the sub-namespace with "priv == 0" but surprisingly all callers seem to use them correctly, including disassociate_ctty(on_exit). Except: drivers/staging/comedi/drivers/addi-data/*.c incorrectly use send_sig(priv => 0). But his is minor and should be fixed anyway. Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:09 -08:00
Oleg Nesterov	614c517d7c	signals: SEND_SIG_NOINFO should be considered as SI_FROMUSER() No changes in compiled code. The patch adds the new helper, si_fromuser() and changes check_kill_permission() to use this helper. The real effect of this patch is that from now we "officially" consider SEND_SIG_NOINFO signal as "from user-space" signals. This is already true if we look at the code which uses SEND_SIG_NOINFO, except __send_signal() has another opinion - see the next patch. The naming of these special SEND_SIG_XXX siginfo's is really bad imho. From __send_signal()'s pov they mean SEND_SIG_NOINFO from user SEND_SIG_PRIV from kernel SEND_SIG_FORCED no info Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:08 -08:00
Oleg Nesterov	6580807da1	ptrace: copy_process() should disable stepping If the tracee calls fork() after PTRACE_SINGLESTEP, the forked child starts with TIF_SINGLESTEP/X86_EFLAGS_TF bits copied from ptraced parent. This is not right, especially when the new child is not auto-attaced: in this case it is killed by SIGTRAP. Change copy_process() to call user_disable_single_step(). Tested on x86. Test-case: #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <assert.h> int main(void) { int pid, status; if (!(pid = fork())) { assert(ptrace(PTRACE_TRACEME) == 0); kill(getpid(), SIGSTOP); if (!fork()) { /* kernel bug: this child will be killed by SIGTRAP */ printf("Hello world\n"); return 43; } wait(&status); return WEXITSTATUS(status); } for (;;) { assert(pid == wait(&status)); if (WIFEXITED(status)) break; assert(ptrace(PTRACE_SINGLESTEP, pid, 0,0) == 0); } assert(WEXITSTATUS(status) == 43); return 0; } Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:08 -08:00
KAMEZAWA Hiroyuki	569b846df5	memcg: coalesce uncharge during unmap/truncate In massive parallel enviroment, res_counter can be a performance bottleneck. One strong techinque to reduce lock contention is reducing calls by coalescing some amount of calls into one. Considering charge/uncharge chatacteristic, - charge is done one by one via demand-paging. - uncharge is done by - in chunk at munmap, truncate, exit, execve... - one by one via vmscan/paging. It seems we have a chance to coalesce uncharges for improving scalability at unmap/truncation. This patch is a for coalescing uncharge. For avoiding scattering memcg's structure to functions under /mm, this patch adds memcg batch uncharge information to the task. A reason for per-task batching is for making use of caller's context information. We do batched uncharge (deleyed uncharge) when truncation/unmap occurs but do direct uncharge when uncharge is called by memory reclaim (vmscan.c). The degree of coalescing depends on callers - at invalidate/trucate... pagevec size - at unmap ....ZAP_BLOCK_SIZE (memory itself will be freed in this degree.) Then, we'll not coalescing too much. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 miss/faults [child with this patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 miss/faults We can see some amounts of improvement. (root cgroup doesn't affected by this patch) Another patch for "charge" will follow this and above will be improved more. Changelog(since 2009/10/02): - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes) - some clean up and commentary/description updates. - added initialize code to copy_process(). (possible bug fix) Changelog(old): - fixed !CONFIG_MEM_CGROUP case. - rebased onto the latest mmotm + softlimit fix patches. - unified patch for callers - added commetns. - make ->do_batch as bool. - removed css_get() at el. We don't need it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:07 -08:00
Alexey Dobriyan	28dfef8feb	const: constify remaining pipe_buf_operations Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:20:05 -08:00
Barry Song	f065f41f48	timecompare: fix half-Y2K38 problem in timecompare_update while calculating offset ktime will overflow from 03:14:07 UTC on Tuesday, 19 January 2038, ktime_add() in timecompare_update() will overflow a half earlier. As a result, wrong offset will be gotten, then cause some strange problems. Signed-off-by: Barry Song <21cnbao@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Patrick Ohly <patrick.ohly@intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-16 07:19:57 -08:00
Peter Zijlstra	f13c12c634	perf_events: Fix perf_event_attr layout The miss-alignment of bp_addr created a 32bit hole, causing different structure packings on 32 and 64 bit machines. Fix that by moving __reserve_2 into that hole. Further, remove the useless struct and redundant __bp_reserve muck. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <1260902591.8023.781.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-15 20:12:20 +01:00
Linus Torvalds	8f0ddf91f2	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits) clockevents: Convert to raw_spinlock clockevents: Make tick_device_lock static debugobjects: Convert to raw_spinlocks perf_event: Convert to raw_spinlock hrtimers: Convert to raw_spinlocks genirq: Convert irq_desc.lock to raw_spinlock smp: Convert smplocks to raw_spinlocks rtmutes: Convert rtmutex.lock to raw_spinlock sched: Convert pi_lock to raw_spinlock sched: Convert cpupri lock to raw_spinlock sched: Convert rt_runtime_lock to raw_spinlock sched: Convert rq->lock to raw_spinlock plist: Make plist debugging raw_spinlock aware bkl: Fixup core_lock fallout locking: Cleanup the name space completely locking: Further name space cleanups alpha: Fix fallout from locking changes locking: Implement new raw_spinlock locking: Convert raw_rwlock functions to arch_rwlock locking: Convert raw_rwlock to arch_rwlock ...	2009-12-15 09:02:01 -08:00
André Goddard Rosa	e7d2860b69	tree-wide: convert open calls to remove spaces to skip_spaces() lib function Makes use of skip_spaces() defined in lib/string.c for removing leading spaces from strings all over the tree. It decreases lib.a code size by 47 bytes and reuses the function tree-wide: text data bss dec hex filename 64688 584 592 65864 10148 (TOTALS-BEFORE) 64641 584 592 65817 10119 (TOTALS-AFTER) Also, while at it, if we see (str && isspace(str)), we can be sure to remove the first condition (str) as the second one (isspace(str)) also evaluates to 0 whenever str == 0, making it redundant. In other words, "a char equals zero is never a space". Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below, and found occurrences of this pattern on 3 more files: drivers/leds/led-class.c drivers/leds/ledtrig-timer.c drivers/video/output.c @@ expression str; @@ ( // ignore skip_spaces cases while (str && isspace(str)) { $str++;\\|++str;$ } \| - str && isspace(*str) ) Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Julia Lawall <julia@diku.dk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Neil Brown <neilb@suse.de> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: David Howells <dhowells@redhat.com> Cc: <linux-ext4@vger.kernel.org> Cc: Samuel Ortiz <samuel@sortiz.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:32 -08:00
Bernhard Walle	5ada918b82	vt: introduce and use vt_kmsg_redirect() function The kernel offers with TIOCL_GETKMSGREDIRECT ioctl() the possibility to redirect the kernel messages to a specific console. However, since it's not possible to switch to the kernel message console after a panic(), it would be nice if the kernel would print the panic message on the current console. This patch series adds a new interface to access the global kmsg_redirect variable by a function to be able to use it in code where CONFIG_VT_CONSOLE is not set (kernel/panic.c). This patch: Instead of using and exporting a global value kmsg_redirect, introduce a function vt_kmsg_redirect() that both can set and return the console where messages are printed. Change all users of kmsg_redirect (the VT code itself and kernel/power.c) to the new interface. The main advantage is that vt_kmsg_redirect() can also be used when CONFIG_VT_CONSOLE is not set. Signed-off-by: Bernhard Walle <bernhard@bwalle.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:28 -08:00
H Hartley Sweeten	dfc6a736d4	kernel/sys.c: fix "warning: do-while statement is not a compound statement" noise do_each_thread/while_each_thread wrap a block of code that is in this format: for (...) do ... while If curly braces do not surround the inner loop the following warning is generated by sparse: warning: do-while statement is not a compound statement Fix the warning by adding the braces. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:26 -08:00
Xiao Guangrong	c0f68c2fab	generic-ipi: cleanup for generic_smp_call_function_interrupt() Use smp_processor_id() instead of get_cpu() and put_cpu() in generic_smp_call_function_interrupt(), It's no need to disable preempt, because we must call generic_smp_call_function_interrupt() with interrupts disabled. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:25 -08:00
Amerigo Wang	70da2340fb	'sysctl_max_map_count' should be non-negative Jan Engelhardt reported we have this problem: setting max_map_count to a value large enough results in programs dying at first try. This is on 2.6.31.6: 15:59 borg:/proc/sys/vm # echo $[1<<31-1] >max_map_count 15:59 borg:/proc/sys/vm # cat max_map_count 1073741824 15:59 borg:/proc/sys/vm # echo $[1<<31] >max_map_count 15:59 borg:/proc/sys/vm # cat max_map_count Killed This is because we have a chance to make 'max_map_count' negative. but it's meaningless. Make it only accept non-negative values. Reported-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: James Morris <jmorris@namei.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:23 -08:00
Lee Schermerhorn	06808b0827	hugetlb: derive huge pages nodes allowed from task mempolicy This patch derives a "nodes_allowed" node mask from the numa mempolicy of the task modifying the number of persistent huge pages to control the allocation, freeing and adjusting of surplus huge pages when the pool page count is modified via the new sysctl or sysfs attribute "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows: * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer is produced. This will cause the hugetlb subsystem to use node_online_map as the "nodes_allowed". This preserves the behavior before this patch. * For "preferred" mempolicy, including explicit local allocation, a nodemask with the single preferred node will be produced. "local" policy will NOT track any internode migrations of the task adjusting nr_hugepages. * For "bind" and "interleave" policy, the mempolicy's nodemask will be used. * Other than to inform the construction of the nodes_allowed node mask, the actual mempolicy mode is ignored. That is, all modes behave like interleave over the resulting nodes_allowed mask with no "fallback". See the updated documentation [next patch] for more information about the implications of this patch. Examples: Starting with: Node 0 HugePages_Total: 0 Node 1 HugePages_Total: 0 Node 2 HugePages_Total: 0 Node 3 HugePages_Total: 0 Default behavior [with or without this patch] balances persistent hugepage allocation across nodes [with sufficient contiguous memory]: sysctl vm.nr_hugepages[_mempolicy]=32 yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 8 Node 3 HugePages_Total: 8 Of course, we only have nr_hugepages_mempolicy with the patch, but with default mempolicy, nr_hugepages_mempolicy behaves the same as nr_hugepages. Applying mempolicy--e.g., with numactl [using '-m' a.k.a. '--membind' because it allows multiple nodes to be specified and it's easy to type]--we can allocate huge pages on individual nodes or sets of nodes. So, starting from the condition above, with 8 huge pages per node, add 8 more to node 2 using: numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40 This yields: Node 0 HugePages_Total: 8 Node 1 HugePages_Total: 8 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The incremental 8 huge pages were restricted to node 2 by the specified mempolicy. Similarly, we can use mempolicy to free persistent huge pages from specified nodes: numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32 yields: Node 0 HugePages_Total: 4 Node 1 HugePages_Total: 4 Node 2 HugePages_Total: 16 Node 3 HugePages_Total: 8 The 8 huge pages freed were balanced over nodes 0 and 1. [rientjes@google.com: accomodate reworked NODEMASK_ALLOC] Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:12 -08:00
Alexey Dobriyan	4b731d50ff	bsdacct: fix uid/gid misreporting commit `d8e180dcd5` "bsdacct: switch credentials for writing to the accounting file" introduced credential switching during final acct data collecting. However, uid/gid pair continued to be collected from current which became credentials of who created acct file, not who exits. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14676 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Juho K. Juopperi <jkj@kapsi.fi> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: David Howells <dhowells@redhat.com> Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-15 08:53:10 -08:00
Paul Mackerras	0f624e7e56	perf_event: Fix incorrect range check on cpu number It is quite legitimate for CPUs to be numbered sparsely, meaning that it possible for an online CPU to have a number which is greater than the total count of possible CPUs. Currently find_get_context() has a sanity check on the cpu number where it checks it against num_possible_cpus(). This test can fail for a legitimate cpu number if the cpu_possible_mask is sparsely populated. This fixes the problem by checking the CPU number against nr_cpumask_bits instead, since that is the appropriate check to ensure that the cpu number is same to pass to cpu_isset() subsequently. Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Tested-by: Michael Neuling <mikey@neuling.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> LKML-Reference: <20091215084032.GA18661@brick.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-15 13:09:55 +01:00
David Miller	b9f8fcd55b	sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK Relax stable-sched-clock architectures to not save/disable/restore hardirqs in cpu_clock(). The background is that I was trying to resolve a sparc64 perf issue when I discovered this problem. On sparc64 I implement pseudo NMIs by simply running the kernel at IRQ level 14 when local_irq_disable() is called, this allows performance counter events to still come in at IRQ level 15. This doesn't work if any code in an NMI handler does local_irq_save() or local_irq_disable() since the "disable" will kick us back to cpu IRQ level 14 thus letting NMIs back in and we recurse. The only path which that does that in the perf event IRQ handling path is the code supporting frequency based events. It uses cpu_clock(). cpu_clock() simply invokes sched_clock() with IRQs disabled. And that's a fundamental bug all on it's own, particularly for the HAVE_UNSTABLE_SCHED_CLOCK case. NMIs can thus get into the sched_clock() code interrupting the local IRQ disable code sections of it. Furthermore, for the not-HAVE_UNSTABLE_SCHED_CLOCK case, the IRQ disabling done by cpu_clock() is just pure overhead and completely unnecessary. So the core problem is that sched_clock() is not NMI safe, but we are invoking it from NMI contexts in the perf events code (via cpu_clock()). A less important issue is the overhead of IRQ disabling when it isn't necessary in cpu_clock(). CONFIG_HAVE_UNSTABLE_SCHED_CLOCK architectures are not affected by this patch. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091213.182502.215092085.davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-15 09:04:36 +01:00
Steven Rostedt	e36c54582c	tracing: Fix return of trace_dump_stack() The trace_dump_stack() returned a value for a void function. Also, added the missing stub for trace_dump_stack() when tracing is not configured. Reported-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <20091214162713.GA31060@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-15 08:36:11 +01:00
Rusty Russell	d4703aefdb	module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y powerpc applies relocations to the kcrctab. They're absolute symbols, but it's not completely unreasonable: other archs may too, but the relocation is often 0. http://lists.ozlabs.org/pipermail/linuxppc-dev/2009-November/077972.html Inspired-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Tested-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Paul Mackerras <paulus@samba.org>	2009-12-15 16:28:34 +10:30
Thomas Gleixner	b5f91da0a6	clockevents: Convert to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:34 +01:00
Thomas Gleixner	d192c47f25	clockevents: Make tick_device_lock static Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:34 +01:00
Thomas Gleixner	e625cce1b7	perf_event: Convert to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:34 +01:00
Thomas Gleixner	ecb49d1a63	hrtimers: Convert to raw_spinlocks Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:34 +01:00
Thomas Gleixner	239007b844	genirq: Convert irq_desc.lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	9f5a5621e7	smp: Convert smplocks to raw_spinlocks Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	d209d74d52	rtmutes: Convert rtmutex.lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	1d61548254	sched: Convert pi_lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	fe841226bd	sched: Convert cpupri lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	0986b11b12	sched: Convert rt_runtime_lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	05fa785cf8	sched: Convert rq->lock to raw_spinlock Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	a26724591e	plist: Make plist debugging raw_spinlock aware plists are used with spinlocks and raw_spinlocks. Change the plist debugging to handle both types. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	9c1721aa49	locking: Cleanup the name space completely Make the name space hierarchy of locking functions consistent: raw_spin* -> _raw_spin* -> __raw_spin* No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	9828ea9d75	locking: Further name space cleanups The name space hierarchy for the internal lock functions is now a bit backwards. raw_spin* functions map to _spin* which use __spin, while we would like to have _raw_spin and __raw_spin. _raw_spin is already used by lock debugging, so rename those funtions to do_raw_spin* to free up the _raw_spin* name space. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:33 +01:00
Thomas Gleixner	c2f21ce2e3	locking: Implement new raw_spinlock Now that the raw_spin name space is freed up, we can implement raw_spinlock and the related functions which are used to annotate the locks which are not converted to sleeping spinlocks in preempt-rt. A side effect is that only such locks can be used with the low level lock fsunctions which circumvent lockdep. For !rt spin_* functions are mapped to the raw_spin* implementations. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:32 +01:00
Thomas Gleixner	0199c4e68d	locking: Convert __raw_spin* functions to arch_spin* Name space cleanup. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: linux-arch@vger.kernel.org	2009-12-14 23:55:32 +01:00
Thomas Gleixner	edc35bd72e	locking: Rename __RAW_SPIN_LOCK_UNLOCKED to __ARCH_SPIN_LOCK_UNLOCKED Further name space cleanup. No functional change Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: linux-arch@vger.kernel.org	2009-12-14 23:55:32 +01:00
Thomas Gleixner	445c89514b	locking: Convert raw_spinlock to arch_spinlock The raw_spin* namespace was taken by lockdep for the architecture specific implementations. raw_spin_* would be the ideal name space for the spinlocks which are not converted to sleeping locks in preempt-rt. Linus suggested to convert the raw_ to arch_ locks and cleanup the name space instead of using an artifical name like core_spin, atomic_spin or whatever No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: linux-arch@vger.kernel.org	2009-12-14 23:55:32 +01:00
Thomas Gleixner	b7b40ade58	locking: Reorder functions in spinlock.c Separate spin_lock and rw_lock functions. Preempt-RT needs to exclude the rw_lock functions from being compiled. The reordering allows to do that with a single #ifdef. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 23:55:32 +01:00
Linus Torvalds	d0316554d3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits) m68k: rename global variable vmalloc_end to m68k_vmalloc_end percpu: add missing per_cpu_ptr_to_phys() definition for UP percpu: Fix kdump failure if booted with percpu_alloc=page percpu: make misc percpu symbols unique percpu: make percpu symbols in ia64 unique percpu: make percpu symbols in powerpc unique percpu: make percpu symbols in x86 unique percpu: make percpu symbols in xen unique percpu: make percpu symbols in cpufreq unique percpu: make percpu symbols in oprofile unique percpu: make percpu symbols in tracer unique percpu: make percpu symbols under kernel/ and mm/ unique percpu: remove some sparse warnings percpu: make alloc_percpu() handle array types vmalloc: fix use of non-existent percpu variable in put_cpu_var() this_cpu: Use this_cpu_xx in trace_functions_graph.c this_cpu: Use this_cpu_xx for ftrace this_cpu: Use this_cpu_xx in nmi handling this_cpu: Use this_cpu operations in RCU this_cpu: Use this_cpu ops for VM statistics ... Fix up trivial (famous last words) global per-cpu naming conflicts in arch/x86/kvm/svm.c mm/slab.c	2009-12-14 09:58:24 -08:00
Ingo Molnar	0087aabd6a	Merge branch 'tip/tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent	2009-12-14 17:12:37 +01:00
Thomas Gleixner	1a551ae715	sched: Use rcu in sched_get_rr_param() read_lock(&tasklist_lock) does not protect sys_sched_get_rr_param() against a concurrent update of the policy or scheduler parameters as do_sched_scheduler() does not take the tasklist_lock. The access to task->sched_class->get_rr_interval is protected by task_rq_lock(task). Use rcu_read_lock() to protect find_task_by_vpid() and prevent the task struct from going away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091209100706.862897167@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 17:11:35 +01:00
Thomas Gleixner	23f5d14251	sched: Use rcu in sched_get/set_affinity() tasklist_lock is held read locked to protect the find_task_by_vpid() call and to prevent the task going away. sched_setaffinity acquires a task struct ref and drops tasklist lock right away. The access to the cpus_allowed mask is protected by rq->lock. rcu_read_lock() provides the same protection here. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091209100706.789059966@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 17:11:35 +01:00
Thomas Gleixner	5fe85be081	sched: Use rcu in sys_sched_getscheduler/sys_sched_getparam() read_lock(&tasklist_lock) does not protect sys_sched_getscheduler and sys_sched_getparam() against a concurrent update of the policy or scheduler parameters as do_sched_setscheduler() does not take the tasklist_lock. The accessed integers can be retrieved w/o locking and are snapshots anyway. Using rcu_read_lock() to protect find_task_by_vpid() and prevent the task struct from going away is not changing the above situation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091209100706.753790977@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 17:11:34 +01:00
Ingo Molnar	cc0104e877	Merge branch 'linus' into tracing/urgent Conflicts: kernel/trace/trace_kprobe.c Merge reason: resolve the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-14 09:16:49 +01:00
Li Zefan	16620e0f19	ksym_tracer: Fix bad cast Fix this warning: kernel/trace/trace_ksym.c: In function 'ksym_trace_filter_read': kernel/trace/trace_ksym.c:239: warning: cast to pointer from integer of different size Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: "K.Prasad" <prasad@linux.vnet.ibm.com> LKML-Reference: <4B1DC578.9020909@cn.fujitsu.com> [remove the strstrip fix as tglx already fixed that] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:46:54 +01:00
Li Zefan	472bbe02c9	tracing/power: Remove two exports trace_power_start and trace_power_end are used in arch/x86/kernel/power.c, and this file can't be compiled as a module, so these two tracepoints don't need to be exported. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC55F.7060305@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:28 +01:00
Li Zefan	e00bf2ec60	tracing: Change event->profile_count to be int type Like total_profile_count, struct ftrace_event_call::profile_count is protected by event_mutex, so it doesn't need to be atomic_t. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <4B1DC549.5010705@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:28 +01:00
Li Zefan	8d18eaaff5	tracing: Simplify trace_option_write() - remove duplicate code inside trace_options_write() - extract duplicate code in trace_options_write() and set_tracer_option() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC532.9010802@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:28 +01:00
Li Zefan	2cbafd68b8	tracing: Remove useless trace option Since commit `4d9493c90f` ("ftrace: remove add-hoc code"), option "sched-tree" has become useless. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC50A.7040402@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:27 +01:00
Li Zefan	13f16d2091	tracing: Use seq file for trace_clock The buffer for the output is as small as 64 bytes, so it'll overflow if we add more clock type. Use seq file instead. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC4FB.5030407@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:27 +01:00
Li Zefan	fdb372ed4c	tracing: Use seq file for trace_options Code simplification for reading trace_options. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-reference: <4B1DC4EF.3090106@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:27 +01:00
Li Zefan	91baf6285b	function-graph: Allow writing the same val to set_graph_function # echo 'do_open' > set_graph_function # echo 'do_open' >> set_graph_function bash: echo: write error: Invalid argument Make it valid to write the same value to set_graph_function, which is consistent with set_ftrace_filter interface. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-reference: <4B1DC4E1.1060303@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:26 +01:00
Li Zefan	313254a940	ftrace: Call trace_parser_clear() properly I found a weird behavior: # echo 'fuse:*' > set_ftrace_filter bash: echo: write error: Invalid argument # cat set_ftrace_filter fuse_dev_fasync fuse_dev_poll fuse_copy_do We should call trace_parser_clear() no matter ftrace_process_regex() returns 0 or -errno, otherwise we will actually take the unaccepted records from ftrace_regex_release(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC4D2.3000406@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:26 +01:00
Li Zefan	311d16da57	ftrace: Return EINVAL when writing invalid val to set_ftrace_filter Currently it doesn't warn user on invald value: # echo nonexist_symbol > set_ftrace_filter or: # echo 'nonexist_symbol:mod:fuse' > set_ftrace_filter Better make it return failure. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B1DC4BF.2070003@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:25 +01:00
Li Zefan	3b8e427381	tracing: Move a printk out of ftrace_raw_reg_event_foo() Move the printk from each ftrace_raw_reg_event_foo() to its caller ftrace_event_enable_disable(). This avoids each regfunc trace event callbacks to handle a same error report that can be carried from the caller. See how much space this saves: text data bss dec hex filename 5345151 1961864 7103260 14410275 dbe223 vmlinux.o.old 5331487 1961864 7103260 14396611 dbacc3 vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <4B1DC4AC.802@cn.fujitsu.com> [start cmdline record before calling regfunc to avoid lost window of pid to comm resolution] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:37:25 +01:00
Li Zefan	614a71a26b	tracing: Pull up calls to trace_define_common_fields() Call trace_define_common_fields() in event_create_dir() only. This avoids trace events to handle it from their define_fields callbacks and shrinks the kernel code size: text data bss dec hex filename 5346802 1961864 7103260 14411926 dbe896 vmlinux.o.old 5345151 1961864 7103260 14410275 dbe223 vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jason Baron <jbaron@redhat.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> LKML-Reference: <4B1DC49C.8000107@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:34:23 +01:00
Li Zefan	87d9b4e1c5	tracing: Extract duplicate ftrace_raw_init_event_foo() Use a generic trace_event_raw_init() function for all event's raw_init callbacks (but kprobes) instead of defining the same version for each of these. This shrinks the kernel code: text data bss dec hex filename 5355293 1961928 7103260 14420481 dc0a01 vmlinux.o.old 5346802 1961864 7103260 14411926 dbe896 vmlinux.o raw_init can't be removed, because ftrace events and kprobe events use different raw_init callbacks. Though it's possible to totally remove raw_init, I choose to leave it as it is for now. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> LKML-Reference: <4B1DC48C.7080603@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-12-13 18:34:23 +01:00
Joe Perches	663997d417	sched: Use pr_fmt() and pr_<level>() - Convert printk(KERN_<level> to pr_<level> (not KERN_DEBUG) - Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt - Coalesce long format strings - Add missing \n to "ERROR: !SD_LOAD_BALANCE domain has parent" Signed-off-by: Joe Perches <joe@perches.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1260655047.2637.7.camel@Joe-Laptop.home> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-13 08:13:55 +01:00
Rafael J. Wysocki	7539a3b3d1	sched: Make wakeup side and atomic variants of completion API irq safe Alan Stern noticed that all the wakeup side (and atomic) variants of the completion APIs should be irq safe, but the newly introduced completion_done() and try_wait_for_completion() aren't. The use of the irq unsafe variants in IRQ contexts can cause crashes/hangs. Fix the problem by making them use spin_lock_irqsave() and spin_lock_irqrestore(). Reported-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Zhang Rui <rui.zhang@intel.com> Cc: pm list <linux-pm@lists.linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Chinner <david@fromorbit.com> Cc: Lachlan McIlroy <lachlan@sgi.com> LKML-Reference: <200912130007.30541.rjw@sisk.pl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-13 08:12:46 +01:00
Linus Torvalds	702a7c7609	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (21 commits) sched: Remove forced2_migrations stats sched: Fix memory leak in two error corner cases sched: Fix build warning in get_update_sysctl_factor() sched: Update normalized values on user updates via proc sched: Make tunable scaling style configurable sched: Fix missing sched tunable recalculation on cpu add/remove sched: Fix task priority bug sched: cgroup: Implement different treatment for idle shares sched: Remove unnecessary RCU exclusion sched: Discard some old bits sched: Clean up check_preempt_wakeup() sched: Move update_curr() in check_preempt_wakeup() to avoid redundant call sched: Sanitize fork() handling sched: Clean up ttwu() rq locking sched: Remove rq->clock coupling from set_task_cpu() sched: Consolidate select_task_rq() callers sched: Remove sysctl.sched_features sched: Protect sched_rr_get_param() access to task->sched_class sched: Protect task->cpus_allowed access in sched_getaffinity() sched: Fix balance vs hotplug race ... Fixed up conflicts in kernel/sysctl.c (due to sysctl cleanup)	2009-12-12 11:34:10 -08:00
Sam Ravnborg	273b281fa2	kbuild: move utsrelease.h to include/generated Fix up all users of utsrelease.h Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Michal Marek <mmarek@suse.cz>	2009-12-12 13:08:15 +01:00
Sam Ravnborg	01fc0ac198	kbuild: move bounds.h to include/generated Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Michal Marek <mmarek@suse.cz>	2009-12-12 13:08:14 +01:00
Linus Torvalds	3070f27d6e	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: itimer: Fix the itimer trace print format hrtimer: move timer stats helper functions to hrtimer.c hrtimer: Tune hrtimer_interrupt hang logic	2009-12-11 20:49:09 -08:00
Linus Torvalds	1e57c2186f	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Avoid out of bounds array reference in save_trace() futex: Take mmap_sem for get_user_pages in fault_in_user_writeable lockstat: Add usage info to Documentation/lockstat.txt lockstat: Fix min, max times in /proc/lock_stats	2009-12-11 20:48:21 -08:00
Linus Torvalds	df7147b3c3	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Remove comparing of NULL to va_list in trace_array_vprintk() tracing: Fix function graph trace_pipe to properly display failed entries tracing: Add full state to trace_seq tracing: Buffer the output of seq_file in case of filled buffer tracing: Only call pipe_close if pipe_close is defined tracing: Add pipe_close interface	2009-12-11 20:47:44 -08:00
Linus Torvalds	6f696eb17b	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (57 commits) x86, perf events: Check if we have APIC enabled perf_event: Fix variable initialization in other codepaths perf kmem: Fix unused argument build warning perf symbols: perf_header__read_build_ids() offset'n'size should be u64 perf symbols: dsos__read_build_ids() should read both user and kernel buildids perf tools: Align long options which have no short forms perf kmem: Show usage if no option is specified sched: Mark sched_clock() as notrace perf sched: Add max delay time snapshot perf tools: Correct size given to memset perf_event: Fix perf_swevent_hrtimer() variable initialization perf sched: Fix for getting task's execution time tracing/kprobes: Fix field creation's bad error handling perf_event: Cleanup for cpu_clock_perf_event_update() perf_event: Allocate children's perf_event_ctxp at the right time perf_event: Clean up __perf_event_init_context() hw-breakpoints: Modify breakpoints without unregistering them perf probe: Update perf-probe document perf probe: Support --del option trace-kprobe: Support delete probe syntax ...	2009-12-11 20:47:30 -08:00
Linus Torvalds	0f4974c439	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (58 commits) tty: split the lock up a bit further tty: Move the leader test in disassociate tty: Push the bkl down a bit in the hangup code tty: Push the lock down further into the ldisc code tty: push the BKL down into the handlers a bit tty: moxa: split open lock tty: moxa: Kill the use of lock_kernel tty: moxa: Fix modem op locking tty: moxa: Kill off the throttle method tty: moxa: Locking clean up tty: moxa: rework the locking a bit tty: moxa: Use more tty_port ops tty: isicom: fix deadlock on shutdown tty: mxser: Use the new locking rules to fix setserial properly tty: mxser: use the tty_port_open method tty: isicom: sort out the board init logic tty: isicom: switch to the new tty_port_open helper tty: tty_port: Add a kref object to the tty port tty: istallion: tty port open/close methods tty: stallion: Convert to the tty_port_open/close methods ...	2009-12-11 15:34:40 -08:00
Linus Torvalds	880188b243	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: kgdb: Always process the whole breakpoint list on activate or deactivate kgdb: continue and warn on signal passing from gdb kgdb,x86: do not set kgdb_single_step on x86 kgdb: allow for cpu switch when single stepping kgdb,i386: Fix corner case access to ss with NMI watch dog exception kgdb: Replace strstr() by strchr() for single-character needles kgdbts: Read buffer overflow kgdb: Read buffer overflow kgdb,x86: remove redundant test	2009-12-11 15:19:56 -08:00
Alan Cox	5ec93d1154	tty: Move the leader test in disassociate There are two call points, both want to check that tty->signal->leader is set. Move the test into disassociate_ctty() as that will make locking changes easier in a bit Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-12-11 15:18:08 -08:00
Linus Torvalds	11bd04f6f3	Merge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (109 commits) PCI: fix coding style issue in pci_save_state() PCI: add pci_request_acs PCI: fix BUG_ON triggered by logical PCIe root port removal PCI: remove ifdefed pci_cleanup_aer_correct_error_status PCI: unconditionally clear AER uncorr status register during cleanup x86/PCI: claim SR-IOV BARs in pcibios_allocate_resource PCI: portdrv: remove redundant definitions PCI: portdrv: remove unnecessary struct pcie_port_data PCI: portdrv: minor cleanup for pcie_port_device_register PCI: portdrv: add missing irq cleanup PCI: portdrv: enable device before irq initialization PCI: portdrv: cleanup service irqs initialization PCI: portdrv: check capabilities first PCI: portdrv: move PME capability check PCI: portdrv: remove redundant pcie type calculation PCI: portdrv: cleanup pcie_device registration PCI: portdrv: remove redundant pcie_port_device_probe PCI: Always set prefetchable base/limit upper32 registers PCI: read-modify-write the pcie device control register when initiating pcie flr PCI: show dma_mask bits in /sys ... Fixed up conflicts in: arch/x86/kernel/amd_iommu_init.c drivers/pci/dmar.c drivers/pci/hotplug/acpiphp_glue.c	2009-12-11 12:18:16 -08:00
Steven Rostedt	cc51a0fca6	tracing: Add stack trace to irqsoff tracer The irqsoff and friends tracers help in finding causes of latency in the kernel. The also work with the function tracer to show what was happening when interrupts or preemption are disabled. But the function tracer has a bit of an overhead and can cause exagerated readings. Currently, when tracing with /proc/sys/kernel/ftrace_enabled = 0, where the function tracer is disabled, the information that is provided can end up being useless. For example, a 2 and a half millisecond latency only showed: # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 2.6.32 # -------------------------------------------------------------------- # latency: 2463 us, #4/4, CPU#2 \| (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) # ----------------- # \| task: -4242 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: _spin_lock_irqsave # => ended at: remove_wait_queue # # # _------=> CPU# # / _-----=> irqs-off # \| / _----=> need-resched # \|\| / _---=> hardirq/softirq # \|\|\| / _--=> preempt-depth # \|\|\|\| /_--=> lock-depth # \|\|\|\|\|/ delay # cmd pid \|\|\|\|\|\| time \| caller # \ / \|\|\|\|\|\| \ \| / hackbenc-4242 2d.... 0us!: trace_hardirqs_off <-_spin_lock_irqsave hackbenc-4242 2...1. 2463us+: _spin_unlock_irqrestore <-remove_wait_queue hackbenc-4242 2...1. 2466us : trace_preempt_on <-remove_wait_queue The above lets us know that hackbench with pid 2463 grabbed a spin lock somewhere and enabled preemption at remove_wait_queue. This helps a little but where this actually happened is not informative. This patch adds the stack dump to the end of the irqsoff tracer. This provides the following output: hackbenc-4242 2d.... 0us!: trace_hardirqs_off <-_spin_lock_irqsave hackbenc-4242 2...1. 2463us+: _spin_unlock_irqrestore <-remove_wait_queue hackbenc-4242 2...1. 2466us : trace_preempt_on <-remove_wait_queue hackbenc-4242 2...1. 2467us : <stack trace> => sub_preempt_count => _spin_unlock_irqrestore => remove_wait_queue => free_poll_entry => poll_freewait => do_sys_poll => sys_poll => system_call_fastpath Now we see that the culprit of this latency was the free_poll_entry code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-11 13:19:51 -05:00
Steven Rostedt	03889384ce	tracing: Add trace_dump_stack() I've been asked a few times about how to find out what is calling some location in the kernel. One way is to use dynamic function tracing and implement the func_stack_trace. But this only finds out who is calling a particular function. It does not tell you who is calling that function and entering a specific if conditional. I have myself implemented a quick version of trace_dump_stack() for this purpose a few times, and just needed it now. This is when I realized that this would be a good tool to have in the kernel like trace_printk(). Using trace_dump_stack() is similar to dump_stack() except that it writes to the trace buffer instead and can be used in critical locations. For example: @@ -5485,8 +5485,12 @@ need_resched_nonpreemptible: if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { if (unlikely(signal_pending_state(prev->state, prev))) prev->state = TASK_RUNNING; - else + else { deactivate_task(rq, prev, 1); + trace_printk("Deactivating task %s:%d\n", + prev->comm, prev->pid); + trace_dump_stack(); + } switch_count = &prev->nvcsw; } Produces: <...>-3249 [001] 296.105269: schedule: Deactivating task ntpd:3249 <...>-3249 [001] 296.105270: <stack trace> => schedule => schedule_hrtimeout_range => poll_schedule_timeout => do_select => core_sys_select => sys_select => system_call_fastpath Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-11 10:38:47 -05:00
Jason Wessel	7f8b7ed6f8	kgdb: Always process the whole breakpoint list on activate or deactivate This patch fixes 2 edge cases in using kgdb in conjunction with gdb. 1) kgdb_deactivate_sw_breakpoints() should process the entire array of breakpoints. The failure to do so results in breakpoints that you cannot remove, because a break point can only be removed if its state flag is set to BP_SET. The easy way to duplicate this problem is to plant a break point in a kernel module and then unload the kernel module. 2) kgdb_activate_sw_breakpoints() should process the entire array of breakpoints. The failure to do so results in missed breakpoints when a breakpoint cannot be activated. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2009-12-11 08:43:20 -06:00
Jason Wessel	d625e9c0d7	kgdb: continue and warn on signal passing from gdb On some architectures for the segv trap, gdb wants to pass the signal back on continue. For kgdb this is not the default behavior, because it can cause the kernel to crash if you arbitrarily pass back a exception outside of kgdb. Instead of causing instability, pass a message back to gdb about the supported kgdb signal passing and execute a standard kgdb continue operation. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2009-12-11 08:43:19 -06:00
Jason Wessel	028e7b1759	kgdb: allow for cpu switch when single stepping The kgdb core should not assume that a single step operation of a kernel thread will complete on the same CPU. The single step flag is set at the "thread" level and it is possible in a multi cpu system that a kernel thread can get scheduled on another cpu the next time it is run. As a further safety net in case a slave cpu is hung, the debug master cpu will try 100 times before giving up and assuming control of the slave cpus is no longer possible. It is more useful to be able to get some information out of kgdb instead of spinning forever. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2009-12-11 08:43:17 -06:00
Jason Wessel	84667d4849	kgdb: Read buffer overflow Roel Kluin reported an error found with Parfait. Where we want to ensure that that kgdb_info[-1] never gets accessed. Also check to ensure any negative tid does not exceed the size of the shadow CPU array, else report critical debug context because it is an internal kgdb failure. Reported-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2009-12-11 08:43:13 -06:00
Thomas Gleixner	bb6eddf767	clockevents: Prevent clockevent_devices list corruption on cpu hotplug Xiaotian Feng triggered a list corruption in the clock events list on CPU hotplug and debugged the root cause. If a CPU registers more than one per cpu clock event device, then only the active clock event device is removed on CPU_DEAD. The unused devices are kept in the clock events device list. On CPU up the clock event devices are registered again, which means that we list_add an already enqueued list_head. That results in list corruption. Resolve this by removing all devices which are associated to the dead CPU on CPU_DEAD. Reported-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Xiaotian Feng <dfeng@redhat.com> Cc: stable@kernel.org	2009-12-11 10:28:08 +01:00
Steven Rostedt	dd7f594357	ring-buffer: Move resize integrity check under reader lock While using an application that does splice on the ftrace ring buffer at start up, I triggered an integrity check failure. Looking into this, I discovered that resizing the buffer performs an integrity check after the buffer is resized. This check unfortunately is preformed after it releases the reader lock. If a reader is reading the buffer it may cause the integrity check to trigger a false failure. This patch simply moves the integrity checker under the protection of the ring buffer reader lock. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-10 23:20:52 -05:00
Steven Rostedt	184210154b	ring-buffer: Use sync sched protection on ring buffer resizing There was a comment in the ring buffer code that says the calling layers should prevent tracing or reading of the ring buffer while resizing. I have discovered that the tracers do not honor this arrangement. This patch moves the disabling and synchronizing the ring buffer to a higher layer during resizing. This guarantees that no writes are occurring while the resize takes place. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-10 22:54:27 -05:00
Thomas Gleixner	d954fbf0ff	tracing: Fix wrong usage of strstrip in trace_ksyms strstrip returns a pointer to the first non space character, but the code in parse_ksym_trace_str() ignores that. strstrip is now must_check and therefor we get the correct warning: kernel/trace/trace_ksym.c:294: warning: ignoring return value of ‘strstrip’, declared with attribute warn_unused_result We are really not interested in leading whitespace here. Fix that and cleanup the dozen kfree() exit pathes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org>	2009-12-11 00:01:36 +01:00
Thomas Gleixner	d4581a239a	sys: Fix missing rcu protection for __task_cred() access commit `c69e8d9` (CRED: Use RCU to access another task's creds and to release a task's own creds) added non rcu_read_lock() protected access to task creds of the target task in set_prio_one(). The comment above the function says: * - the caller must hold the RCU read lock The calling code in sys_setpriority does read_lock(&tasklist_lock) but not rcu_read_lock(). This works only when CONFIG_TREE_PREEMPT_RCU=n. With CONFIG_TREE_PREEMPT_RCU=y the rcu_callbacks can run in the tick interrupt when they see no read side critical section. There is another instance of __task_cred() in sys_setpriority() itself which is equally unprotected. Wrap the whole code section into a rcu read side critical section to fix this quick and dirty. Will be revisited in course of the read_lock(&tasklist_lock) -> rcu crusade. Oleg noted further: This also fixes another bug here. find_task_by_vpid() is not safe without rcu_read_lock(). I do not mean it is not safe to use the result, just find_pid_ns() by itself is not safe. Usually tasklist gives enough protection, but if copy_process() fails it calls free_pid() lockless and does call_rcu(delayed_put_pid(). This means, without rcu lock find_pid_ns() can't scan the hash table safely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091210004703.029784964@linutronix.de> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2009-12-10 23:04:11 +01:00
Thomas Gleixner	7cf7db8df0	signals: Fix more rcu assumptions 1) Remove the misleading comment in __sigqueue_alloc() which claims that holding a spinlock is equivalent to rcu_read_lock(). 2) Add a rcu_read_lock/unlock around the __task_cred() access in __sigqueue_alloc() This needs to be revisited to remove the remaining users of read_lock(&tasklist_lock) but that's outside the scope of this patch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091210004703.269843657@linutronix.de>	2009-12-10 23:04:11 +01:00
Thomas Gleixner	14d8c9f3c0	signal: Fix racy access to __task_cred in kill_pid_info_as_uid() kill_pid_info_as_uid() accesses __task_cred() without being in a RCU read side critical section. tasklist_lock is not protecting that when CONFIG_TREE_PREEMPT_RCU=y. Convert the whole tasklist_lock section to rcu and use lock_task_sighand to prevent the exit race. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091210004703.232302055@linutronix.de> Acked-by: Oleg Nesterov <oleg@redhat.com>	2009-12-10 23:04:11 +01:00
Ingo Molnar	b9889ed1dd	sched: Remove forced2_migrations stats This build warning: kernel/sched.c: In function 'set_task_cpu': kernel/sched.c:2070: warning: unused variable 'old_rq' Made me realize that the forced2_migrations stat looks pretty pointless (and a misnomer) - remove it. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 20:32:39 +01:00
Linus Torvalds	d71cb81af3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Add debugobjects support	2009-12-10 09:35:44 -08:00
Xiao Guangrong	5e855db5d8	perf_event: Fix variable initialization in other codepaths Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B20BAA6.7010609@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 17:23:02 +01:00
Phil Carmody	dfc12eb26a	sched: Fix memory leak in two error corner cases If the second in each of these pairs of allocations fails, then the first one will not be freed in the error route out. Found by a static code analysis tool. Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1260448177-28448-1-git-send-email-ext-phil.2.carmody@nokia.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 14:28:10 +01:00
Heiko Carstens	5f201907df	hrtimer: move timer stats helper functions to hrtimer.c There is no reason to make timer_stats_hrtimer_set_start_info and friends visible to the rest of the kernel. So move all of them to hrtimer.c. Also make timer_stats_hrtimer_set_start_info a static inline function so it gets inlined and we avoid another function call. Based on a patch by Thomas Gleixner. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> LKML-Reference: <20091210095629.GC4144@osiris.boeblingen.de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-12-10 13:08:11 +01:00
Thomas Gleixner	41d2e49493	hrtimer: Tune hrtimer_interrupt hang logic The hrtimer_interrupt hang logic adjusts min_delta_ns based on the execution time of the hrtimer callbacks. This is error-prone for virtual machines, where a guest vcpu can be scheduled out during the execution of the callbacks (and the callbacks themselves can do operations that translate to blocking operations in the hypervisor), which in can lead to large min_delta_ns rendering the system unusable. Replace the current heuristics with something more reliable. Allow the interrupt code to try 3 times to catch up with the lost time. If that fails use the total time spent in the interrupt handler to defer the next timer interrupt so the system can catch up with other things which got delayed. Limit that deferment to 100ms. The retry events and the maximum time spent in the interrupt handler are recorded and exposed via /proc/timer_list Inspired by a patch from Marcelo. Reported-by: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: kvm@vger.kernel.org	2009-12-10 13:08:11 +01:00
Mike Galbraith	4ca3ef71f5	sched: Fix build warning in get_update_sysctl_factor() Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <new-submission>	2009-12-10 09:34:50 +01:00
Luck, Tony	ea5b41f9d5	lockdep: Avoid out of bounds array reference in save_trace() ia64 found this the hard way (because we currently have a stub for save_stack_trace() that does nothing). But it would be a good idea to be cautious in case a real save_stack_trace() bailed out with an error before it set trace->nr_entries. Signed-off-by: Tony Luck <tony.luck@intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: luming.yu@intel.com LKML-Reference: <4b2024d085302c2a2@agluck-desktop.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 08:29:33 +01:00
Ingo Molnar	788d70dce0	Merge branch 'tip/tracing/core3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/core	2009-12-10 08:18:41 +01:00
Xiao Guangrong	21140f4d33	perf_event: Fix perf_swevent_hrtimer() variable initialization fix: [<c0477471>] ? printk+0x1d/0x24 [<c01c98f9>] ? perf_prepare_sample+0x269/0x280 [<c0149231>] warn_slowpath_common+0x71/0xd0 [<c01c98f9>] ? perf_prepare_sample+0x269/0x280 [<c01492aa>] warn_slowpath_null+0x1a/0x20 [<c01c98f9>] perf_prepare_sample+0x269/0x280 [<c016e9f3>] ? cpu_clock+0x53/0x90 [<c01cc368>] __perf_event_overflow+0x2a8/0x300 [<c01ccc3b>] perf_event_overflow+0x1b/0x30 [<c01ccccf>] perf_swevent_hrtimer+0x7f/0x120 This is because 'data.raw' variable not initialize. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B208E93.1010801@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-10 07:11:05 +01:00
Linus Torvalds	4ef58d4e2a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits) tree-wide: fix misspelling of "definition" in comments reiserfs: fix misspelling of "journaled" doc: Fix a typo in slub.txt. inotify: remove superfluous return code check hdlc: spelling fix in find_pvc() comment doc: fix regulator docs cut-and-pasteism mtd: Fix comment in Kconfig doc: Fix IRQ chip docs tree-wide: fix assorted typos all over the place drivers/ata/libata-sff.c: comment spelling fixes fix typos/grammos in Documentation/edac.txt sysctl: add missing comments fs/debugfs/inode.c: fix comment typos sgivwfb: Make use of ARRAY_SIZE. sky2: fix sky2_link_down copy/paste comment error tree-wide: fix typos "couter" -> "counter" tree-wide: fix typos "offest" -> "offset" fix kerneldoc for set_irq_msi() spidev: fix double "of of" in comment comment typo fix: sybsystem -> subsystem ...	2009-12-09 19:43:33 -08:00
Thomas Gleixner	86fc80f16e	capabilities: Use RCU to protect task lookup in sys_capget cap_get_target_pid() protects the task lookup with tasklist_lock. security_capget() is called under tasklist_lock as well but tasklist_lock does not protect anything there. The capabilities are protected by RCU already. So tasklist_lock only protects the lookup and prevents the task going away, which can be done with rcu_read_lock() as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: James Morris <jmorris@namei.org>	2009-12-10 09:42:48 +11:00
Carsten Emde	f2942487ff	tracing: Remove comparing of NULL to va_list in trace_array_vprintk() Olof Johansson stated the following: Comparing a va_list with NULL is bogus. It's supposed to be treated like an opaque type and only be manipulated with va_* accessors. Olof noticed that this code broke the ARM builds: kernel/trace/trace.c: In function 'trace_array_vprintk': kernel/trace/trace.c:1364: error: invalid operands to binary == (have 'va_list' and 'void *') kernel/trace/trace.c: In function 'tracing_mark_write': kernel/trace/trace.c:3349: error: incompatible type for argument 3 of 'trace_vprintk' This patch partly reverts `c13d2f7c32` and re-installs the original mark_printk() mechanism. Reported-by: Olof Johansson <olof@lixom.net> Signed-off-by: Carsten Emde <C.Emde@osadl.org> LKML-Reference: <4B1BAB74.104@osadl.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-09 14:20:08 -05:00
Jiri Olsa	be1eca3931	tracing: Fix function graph trace_pipe to properly display failed entries There is a case where the graph tracer might get confused and omits displaying of a single record. This applies mostly with the trace_pipe since it is unlikely that the trace_seq buffer will overflow with the trace file. As the function_graph tracer goes through the trace entries keeping a pointer to the current record: current -> func1 ENTRY func2 ENTRY func2 RETURN func1 RETURN When an function ENTRY is encountered, it moves the pointer to the next entry to check if the function is a nested or leaf function. func1 ENTRY current -> func2 ENTRY func2 RETURN func1 RETURN If the rest of the writing of the function fills the trace_seq buffer, then the trace_pipe read will ignore this entry. The next read will Now start at the current location, but the first entry (func1) will be discarded. This patch keeps a copy of the current entry in the iterator private storage and will keep track of when the trace_seq buffer fills. When the trace_seq buffer fills, it will reuse the copy of the entry in the next iteration. [ This patch has been largely modified by Steven Rostedt in order to clean it up and simplify it. The original idea and concept was from Jirka and for that, this patch will go under his name to give him the credit he deserves. But because this was modify by Steven Rostedt anything wrong with the patch should be blamed on Steven. ] Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1259067458-27143-1-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-09 14:09:06 -05:00
Johannes Berg	d184b31c0e	tracing: Add full state to trace_seq The trace_seq buffer might fill up, and right now one needs to check the return value of each printf into the buffer to check for that. Instead, have the buffer keep track of whether it is full or not, and reject more input if it is full or would have overflowed with an input that wasn't added. Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-09 14:05:49 -05:00
Steven Rostedt	a63ce5b306	tracing: Buffer the output of seq_file in case of filled buffer If the seq_read fills the buffer it will call s_start again on the next itertation with the same position. This causes a problem with the function_graph tracer because it consumes the iteration in order to determine leaf functions. What happens is that the iterator stores the entry, and the function graph plugin will look at the next entry. If that next entry is a return of the same function and task, then the function is a leaf and the function_graph plugin calls ring_buffer_read which moves the ring buffer iterator forward (the trace iterator still points to the function start entry). The copying of the trace_seq to the seq_file buffer will fail if the seq_file buffer is full. The seq_read will not show this entry. The next read by userspace will cause seq_read to again call s_start which will reuse the trace iterator entry (the function start entry). But the function return entry was already consumed. The function graph plugin will think that this entry is a nested function and not a leaf. To solve this, the trace code now checks the return status of the seq_printf (trace_print_seq). If the writing to the seq_file buffer fails, we set a flag in the iterator (leftover) and we do not reset the trace_seq buffer. On the next call to s_start, we check the leftover flag, and if it is set, we just reuse the trace_seq buffer and do not call into the plugin print functions. Before this patch: 2) \| fput() { 2) \| __fput() { 2) 0.550 us \| inotify_inode_queue_event(); 2) \| __fsnotify_parent() { 2) 0.540 us \| inotify_dentry_parent_queue_event(); After the patch: 2) \| fput() { 2) \| __fput() { 2) 0.550 us \| inotify_inode_queue_event(); 2) 0.548 us \| __fsnotify_parent(); 2) 0.540 us \| inotify_dentry_parent_queue_event(); [ Updated the patch to fix a missing return 0 from the trace_print_seq() stub when CONFIG_TRACING is disabled. Reported-by: Ingo Molnar <mingo@elte.hu> ] Reported-by: Jiri Olsa <jolsa@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-09 13:55:26 -05:00
Steven Rostedt	29bf4a5e3f	tracing: Only call pipe_close if pipe_close is defined This fixes a cut and paste error that had pipe_close get called if pipe_open was defined (not pipe_close). Reported-by: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> LKML-Reference: <20091209153204.F4CD.A69D9226@jp.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-09 12:47:35 -05:00
Linus Torvalds	3b8ecd2244	Merge branch 'bkl-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'bkl-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sys: Remove BKL from sys_reboot pm_qos: clean up racy global "name" variable pm_qos: remove BKL	2009-12-09 08:07:17 -08:00
Thomas Gleixner	f409adf5b1	futex: Protect pid lookup in compat code with RCU find_task_by_vpid() in compat_sys_get_robust_list() does not require tasklist_lock. It can be protected with rcu_read_lock as done in sys_get_robust_list() already. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org>	2009-12-09 14:22:14 +01:00
Frederic Weisbecker	822a696111	tracing/kprobes: Fix field creation's bad error handling When we define the common event fields in kprobe, we invert the error handling and return immediately in case of success. Then we omit to define specific kprobes fields (ip and nargs), and specific kretprobes fields (func, ret_ip, nargs). And we only define them when we fail to create common fields. The most visible consequence is that we can't create filter for k(ret)probes specific fields. This patch re-invert the success/error handling to fix it. Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <1260263815-5167-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:32:21 +01:00
Christian Ehrhardt	acb4a848da	sched: Update normalized values on user updates via proc The normalized values are also recalculated in case the scaling factor changes. This patch updates the internally used scheduler tuning values that are normalized to one cpu in case a user sets new values via sysfs. Together with patch 2 of this series this allows to let user configured values scale (or not) to cpu add/remove events taking place later. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259579808-11357-4-git-send-email-ehrhardt@linux.vnet.ibm.com> [ v2: fix warning ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:04:02 +01:00
Christian Ehrhardt	1983a922a1	sched: Make tunable scaling style configurable As scaling now takes place on all kind of cpu add/remove events a user that configures values via proc should be able to configure if his set values are still rescaled or kept whatever happens. As the comments state that log2 was just a second guess that worked the interface is not just designed for on/off, but to choose a scaling type. Currently this allows none, log and linear, but more important it allwos us to keep the interface even if someone has an even better idea how to scale the values. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259579808-11357-3-git-send-email-ehrhardt@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:04:01 +01:00
Christian Ehrhardt	0bcdcf28c9	sched: Fix missing sched tunable recalculation on cpu add/remove Based on Peter Zijlstras patch suggestion this enables recalculation of the scheduler tunables in response of a change in the number of cpus. It also adds a max of eight cpus that are considered in that scaling. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259579808-11357-2-git-send-email-ehrhardt@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:58 +01:00
Peter Zijlstra	57785df5ac	sched: Fix task priority bug 83f9ac removed a call to effective_prio() in wake_up_new_task(), which leads to tasks running at MAX_PRIO. This is caused by the idle thread being set to MAX_PRIO before forking off init. O(1) used that to make sure idle was always preempted, CFS uses check_preempt_curr_idle() for that so we can savely remove this bit of legacy code. Reported-by: Mike Galbraith <efault@gmx.de> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1259754383.4003.610.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:10 +01:00
Peter Zijlstra	cd8ad40de3	sched: cgroup: Implement different treatment for idle shares When setting the weight for a per-cpu task-group, we have to put in a phantom weight when there is no work on that cpu, otherwise we'll not service that cpu when new work gets placed there until we again update the per-cpu weights. We used to add these phantom weights to the total, so that the idle per-cpu shares don't get inflated, this however causes the non-idle parts to get deflated, causing unexpected weight distibutions. Reverse this, so that the non-idle shares are correct but the idle shares are inflated. Reported-by: Yasunori Goto <y-goto@jp.fujitsu.com> Tested-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1257934048.23203.76.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:09 +01:00
Peter Zijlstra	fb58bac5c7	sched: Remove unnecessary RCU exclusion As Nick pointed out, and realized by myself when doing: sched: Fix balance vs hotplug race the patch: sched: for_each_domain() vs RCU is wrong, sched_domains are freed after synchronize_sched(), which means disabling preemption is enough. Reported-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:08 +01:00
Peter Zijlstra	6cecd084d0	sched: Discard some old bits WAKEUP_RUNNING was an experiment, not sure why that ever ended up being merged... Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:07 +01:00
Peter Zijlstra	3a7e73a2e2	sched: Clean up check_preempt_wakeup() Streamline the wakeup preemption code a bit, unifying the preempt path so that they all do the same. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:07 +01:00
Jupyung Lee	a65ac745e4	sched: Move update_curr() in check_preempt_wakeup() to avoid redundant call If a RT task is woken up while a non-RT task is running, check_preempt_wakeup() is called to check whether the new task can preempt the old task. The function returns quickly without going deeper because it is apparent that a RT task can always preempt a non-RT task. In this situation, check_preempt_wakeup() always calls update_curr() to update vruntime value of the currently running task. However, the function call is unnecessary and redundant at that moment because (1) a non-RT task can always be preempted by a RT task regardless of its vruntime value, and (2) update_curr() will be called shortly when the context switch between two occurs. By moving update_curr() in check_preempt_wakeup(), we can avoid redundant call to update_curr(), slightly reducing the time taken to wake up RT tasks. Signed-off-by: Jupyung Lee <jupyung@gmail.com> [ Place update_curr() right before the wake_preempt_entity() call, which is the only thing that relies on the updated vruntime ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1258451500-6714-1-git-send-email-jupyung@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:06 +01:00
Peter Zijlstra	cd29fe6f26	sched: Sanitize fork() handling Currently we try to do task placement in wake_up_new_task() after we do the load-balance pass in sched_fork(). This yields complicated semantics in that we have to deal with tasks on different RQs and the set_task_cpu() calls in copy_process() and sched_fork() Rename ->task_new() to ->task_fork() and call it from sched_fork() before the balancing, this gives the policy a clear point to place the task. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:05 +01:00
Peter Zijlstra	ab19cb2331	sched: Clean up ttwu() rq locking Since set_task_clock() doesn't rely on rq->clock anymore we can simplyfy the mess in ttwu(). Optimize things a bit by not fiddling with the IRQ state there. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:04 +01:00
Peter Zijlstra	5afcdab706	sched: Remove rq->clock coupling from set_task_cpu() set_task_cpu() should be rq invariant and only touch task state, it currently fails to do so, which opens up a few races, since not all callers hold both rq->locks. Remove the relyance on rq->clock, as any site calling set_task_cpu() should also do a remote clock update, which should ensure the observed time between these two cpus is monotonic, as per kernel/sched_clock.c:sched_clock_remote(). Therefore we can simply remove the clock_offset bits and be happy. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:03 +01:00
Peter Zijlstra	970b13bacb	sched: Consolidate select_task_rq() callers Small cleanup. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> [ v2: build fix ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:02 +01:00
Peter Zijlstra	6b314d0e11	sched: Remove sysctl.sched_features Since we've had a much saner debugfs interface to this, remove the sysctl one. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> [ v2: build fix ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:03:01 +01:00
Thomas Gleixner	dba091b9e3	sched: Protect sched_rr_get_param() access to task->sched_class sched_rr_get_param calls task->sched_class->get_rr_interval(task) without protection against a concurrent sched_setscheduler() call which modifies task->sched_class. Serialize the access with task_rq_lock(task) and hand the rq pointer into get_rr_interval() as it's needed at least in the sched_fair implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <alpine.LFD.2.00.0912090930120.3089@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:01:07 +01:00
Thomas Gleixner	3160568371	sched: Protect task->cpus_allowed access in sched_getaffinity() sched_getaffinity() is not protected against a concurrent modification of the tasks affinity. Serialize the access with task_rq_lock(task). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091208202026.769251187@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 10:01:06 +01:00
Xiao Guangrong	ec89a06fd4	perf_event: Cleanup for cpu_clock_perf_event_update() Using atomic64_xchg() instead of atomic64_read() and atomic64_set(). Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B1F19DC.90204@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 09:56:27 +01:00
Xiao Guangrong	b93f7978ad	perf_event: Allocate children's perf_event_ctxp at the right time In current code, children task will allocate memory for 'child->perf_event_ctxp' if the parent is counted, we can do it only if the parent allowed children inherit it. It can save memory and reduce overhead. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B1F19A8.5040805@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 09:56:27 +01:00
Xiao Guangrong	aa5452d70c	perf_event: Clean up __perf_event_init_context() Clean up the code a bit: - define 'perf_cpu_context' variable with 'static' - use kzalloc() instead of kmalloc() and memset() Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B1F194D.7080306@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 09:56:27 +01:00
Frederic Weisbecker	44234adcdc	hw-breakpoints: Modify breakpoints without unregistering them Currently, when ptrace needs to modify a breakpoint, like disabling it, changing its address, type or len, it calls modify_user_hw_breakpoint(). This latter will perform the heavy and racy task of unregistering the old breakpoint and registering a new one. This is racy as someone else might steal the reserved breakpoint slot under us, which is undesired as the breakpoint is only supposed to be modified, sometimes in the middle of a debugging workflow. We don't want our slot to be stolen in the middle. So instead of unregistering/registering the breakpoint, just disable it while we modify its breakpoint fields and re-enable it after if necessary. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1260347148-5519-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-09 09:48:20 +01:00
Masami Hiramatsu	a7c312bed7	trace-kprobe: Support delete probe syntax Support delete probe syntax. The syntax is "-:[group/]event". Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> LKML-Reference: <20091208220316.10142.39192.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>	2009-12-09 07:26:53 +01:00
Linus Torvalds	2b876f95d0	Merge branches 'timers-for-linus-ntp' and 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-ntp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ntp: Provide compability defines (You say MOD_NANO, I say ADJ_NANO) * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: do not execute DEBUG_SHIRQ when irq setup failed	2009-12-08 19:30:19 -08:00
Linus Torvalds	fbf07eac7b	Merge branch 'timers-for-linus-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus-urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: Fix /proc/timer_list regression itimers: Fix racy writes to cpu_itimer fields timekeeping: Fix clock_gettime vsyscall time warp	2009-12-08 19:28:09 -08:00
Linus Torvalds	60d8ce2cd6	Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timers, init: Limit the number of per cpu calibration bootup messages posix-cpu-timers: optimize and document timer_create callback clockevents: Add missing include to pacify sparse x86: vmiclock: Fix printk format x86: Fix printk format due to variable type change sparc: fix printk for change of variable type clocksource/events: Fix fallout of generic code changes nohz: Allow 32-bit machines to sleep for more than 2.15 seconds nohz: Track last do_timer() cpu nohz: Prevent clocksource wrapping during idle nohz: Type cast printk argument mips: Use generic mult/shift factor calculation for clocks clocksource: Provide a generic mult/shift factor calculation clockevents: Use u32 for mult and shift factors nohz: Introduce arch_needs_cpu nohz: Reuse ktime in sub-functions of tick_check_idle. time: Remove xtime_cache time: Implement logarithmic time accumulation	2009-12-08 19:27:08 -08:00
Linus Torvalds	4d2a914239	Merge branch 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: core: Clean up user return notifers use of per_cpu	2009-12-08 13:24:20 -08:00
Linus Torvalds	6035ccd8e9	Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits) cfq-iosched: Do not access cfqq after freeing it block: include linux/err.h to use ERR_PTR cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit blkio: Allow CFQ group IO scheduling even when CFQ is a module blkio: Implement dynamic io controlling policy registration blkio: Export some symbols from blkio as its user CFQ can be a module block: Fix io_context leak after failure of clone with CLONE_IO block: Fix io_context leak after clone with CLONE_IO cfq-iosched: make nonrot check logic consistent io controller: quick fix for blk-cgroup and modular CFQ cfq-iosched: move IO controller declerations to a header file cfq-iosched: fix compile problem with !CONFIG_CGROUP blkio: Documentation blkio: Wait on sync-noidle queue even if rq_noidle = 1 blkio: Implement group_isolation tunable blkio: Determine async workload length based on total number of queues blkio: Wait for cfq queue to get backlogged if group is empty blkio: Propagate cgroup weight updation to cfq groups blkio: Drop the reference to queue once the task changes cgroup blkio: Provide some isolation between groups ...	2009-12-08 08:19:16 -08:00
Linus Torvalds	dad3de7d00	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: Add flag for devices capable of generating run-time wake-up events PM / Runtime: Remove unnecessary braces in __pm_runtime_set_status() PM / Runtime: Make documentation of runtime_idle() agree with the code PM / Runtime: Ensure timer_expires is nonzero in pm_schedule_suspend() PM / Runtime: Use deferred_resume flag in pm_request_resume PM / Runtime: Export the PM runtime workqueue PM / Runtime: Fix lockdep warning in __pm_runtime_set_status() PM / Hibernate: Swap, use KERN_CONT PM / Hibernate: Shift remaining code from swsusp.c to hibernate.c PM / Hibernate: Move swap functions to kernel/power/swap.c. PM / freezer: Don't get over-anxious while waiting	2009-12-08 08:07:16 -08:00
Linus Torvalds	ed9216c171	Merge branch 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (84 commits) KVM: VMX: Fix comparison of guest efer with stale host value KVM: s390: Fix prefix register checking in arch/s390/kvm/sigp.c KVM: Drop user return notifier when disabling virtualization on a cpu KVM: VMX: Disable unrestricted guest when EPT disabled KVM: x86 emulator: limit instructions to 15 bytes KVM: s390: Make psw available on all exits, not just a subset KVM: x86: Add KVM_GET/SET_VCPU_EVENTS KVM: VMX: Report unexpected simultaneous exceptions as internal errors KVM: Allow internal errors reported to userspace to carry extra data KVM: Reorder IOCTLs in main kvm.h KVM: x86: Polish exception injection via KVM_SET_GUEST_DEBUG KVM: only clear irq_source_id if irqchip is present KVM: x86: disallow KVM_{SET,GET}_LAPIC without allocated in-kernel lapic KVM: x86: disallow multiple KVM_CREATE_IRQCHIP KVM: VMX: Remove vmx->msr_offset_efer KVM: MMU: update invlpg handler comment KVM: VMX: move CR3/PDPTR update to vmx_set_cr3 KVM: remove duplicated task_switch check KVM: powerpc: Fix BUILD_BUG_ON condition KVM: VMX: Use shared msr infrastructure ... Trivial conflicts due to new Kconfig options in arch/Kconfig and kernel/Makefile	2009-12-08 08:02:38 -08:00
Linus Torvalds	d7fc02c7ba	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1815 commits) mac80211: fix reorder buffer release iwmc3200wifi: Enable wimax core through module parameter iwmc3200wifi: Add wifi-wimax coexistence mode as a module parameter iwmc3200wifi: Coex table command does not expect a response iwmc3200wifi: Update wiwi priority table iwlwifi: driver version track kernel version iwlwifi: indicate uCode type when fail dump error/event log iwl3945: remove duplicated event logging code b43: fix two warnings ipw2100: fix rebooting hang with driver loaded cfg80211: indent regulatory messages with spaces iwmc3200wifi: fix NULL pointer dereference in pmkid update mac80211: Fix TX status reporting for injected data frames ath9k: enable 2GHz band only if the device supports it airo: Fix integer overflow warning rt2x00: Fix padding bug on L2PAD devices. WE: Fix set events not propagated b43legacy: avoid PPC fault during resume b43: avoid PPC fault during resume tcp: fix a timewait refcnt race ... Fix up conflicts due to sysctl cleanups (dead sysctl_check code and CTL_UNNUMBERED removed) in kernel/sysctl_check.c net/ipv4/sysctl_net_ipv4.c net/ipv6/addrconf.c net/sctp/sysctl.c	2009-12-08 07:55:01 -08:00
Linus Torvalds	1557d33007	Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits) security/tomoyo: Remove now unnecessary handling of security_sysctl. security/tomoyo: Add a special case to handle accesses through the internal proc mount. sysctl: Drop & in front of every proc_handler. sysctl: Remove CTL_NONE and CTL_UNNUMBERED sysctl: kill dead ctl_handler definitions. sysctl: Remove the last of the generic binary sysctl support sysctl net: Remove unused binary sysctl code sysctl security/tomoyo: Don't look at ctl_name sysctl arm: Remove binary sysctl support sysctl x86: Remove dead binary sysctl support sysctl sh: Remove dead binary sysctl support sysctl powerpc: Remove dead binary sysctl support sysctl ia64: Remove dead binary sysctl support sysctl s390: Remove dead sysctl binary support sysctl frv: Remove dead binary sysctl support sysctl mips/lasat: Remove dead binary sysctl support sysctl drivers: Remove dead binary sysctl support sysctl crypto: Remove dead binary sysctl support sysctl security/keys: Remove dead binary sysctl support sysctl kernel: Remove binary sysctl logic ...	2009-12-08 07:38:50 -08:00
Andi Kleen	722d017237	futex: Take mmap_sem for get_user_pages in fault_in_user_writeable get_user_pages() must be called with mmap_sem held. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: stable@kernel.org Cc: Andrew Morton <akpm@linuxfoundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091208121942.GA21298@basil.fritz.box> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-12-08 14:59:36 +01:00
Stephen Rothwell	6ab8886326	perf: hw_breakpoints: Fix percpu namespace clash Today's linux-next build failed with: kernel/hw_breakpoint.c:86: error: 'task_bp_pinned' redeclared as different kind of symbol ... Caused by commit `dd17c8f729` ("percpu: remove per_cpu__ prefix") from the percpu tree interacting with commit `56053170ea` ("hw-breakpoints: Fix task-bound breakpoint slot allocation") from the tip tree. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20091208182515.bb6dda4a.sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-08 09:34:43 +01:00
Tejun Heo	50de1a8ef1	Merge branch 'for-linus' into for-next Conflicts: mm/percpu.c	2009-12-08 10:02:12 +09:00
Jiri Kosina	d014d04386	Merge branch 'for-next' into for-linus Conflicts: kernel/irq/chip.c	2009-12-07 18:36:35 +01:00
Steven Rostedt	c521efd170	tracing: Add pipe_close interface An ftrace plugin can add a pipe_open interface when the user opens trace_pipe. But if the plugin allocates something within the pipe_open it can not free it because there exists no pipe_close. The hook to the trace file open has a corresponding close. The closing of the trace_pipe file should also have a corresponding close. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-12-07 12:01:35 -05:00
Frederic Weisbecker	56053170ea	hw-breakpoints: Fix task-bound breakpoint slot allocation Whatever the context nature of a breakpoint, we always perform the following constraint checks before allocating it a slot: - Check the number of pinned breakpoint bound the concerned cpus - Check the max number of task-bound breakpoints that are belonging to a task. - Add both and see if we have a reamining slot for the new breakpoint This is the right thing to do when we are about to register a cpu-only bound breakpoint. But not if we are dealing with a task bound breakpoint. What we want in this case is: - Check the number of pinned breakpoint bound the concerned cpus - Check the number of breakpoints that already belong to the task in which the breakpoint to register is bound to. - Add both This fixes a regression that makes the "firefox -g" command fail to register breakpoints once we deal with a secondary thread. Reported-by: Walt <w41ter@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Prasad <prasad@linux.vnet.ibm.com>	2009-12-07 07:05:28 +01:00
Peter Zijlstra	6ad4c18884	sched: Fix balance vs hotplug race Since (`e761b77`: cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment) we have cpu_active_mask which is suppose to rule scheduler migration and load-balancing, except it never (fully) did. The particular problem being solved here is a crash in try_to_wake_up() where select_task_rq() ends up selecting an offline cpu because select_task_rq_fair() trusts the sched_domain tree to reflect the current state of affairs, similarly select_task_rq_rt() trusts the root_domain. However, the sched_domains are updated from CPU_DEAD, which is after the cpu is taken offline and after stop_machine is done. Therefore it can race perfectly well with code assuming the domains are right. Cure this by building the domains from cpu_active_mask on CPU_DOWN_PREPARE. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-06 21:10:56 +01:00
Geert Uytterhoeven	e1b8090bdf	cpumask: Fix generate_sched_domains() for UP Commit `acc3f5d7ca` ("cpumask: Partition_sched_domains takes array of cpumask_var_t") changed the function signature of generate_sched_domains() for the CONFIG_SMP=y case, but forgot to update the corresponding function for the CONFIG_SMP=n case, causing: kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <alpine.DEB.2.00.0912062038070.5693@ayla.of.borg> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-06 21:08:41 +01:00
Alan Stern	7b199ca202	PM / Runtime: Export the PM runtime workqueue This patch (as1306) exports the PM runtime workqueue for use by loadable modules. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-12-06 16:17:56 +01:00
Jiri Slaby	66d0ae4d6f	PM / Hibernate: Swap, use KERN_CONT Use KERN_CONT in save_image() for printks, so that anybody won't try to add a loglevel. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-12-06 16:16:24 +01:00
Nigel Cunningham	8e60c6a134	PM / Hibernate: Shift remaining code from swsusp.c to hibernate.c Shift the remaining declaration of the variable in_suspend and the function swsusp_show_speed from swsusp.c to hibernate.c, and delete swsusp.c. Signed-off-by: Nigel Cunningham <nigel@tuxonice.net> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-12-06 16:16:07 +01:00
Nigel Cunningham	0414f2ec03	PM / Hibernate: Move swap functions to kernel/power/swap.c. Move hibernation code's functions for allocating and freeing swap from swsusp.c to swap.c, which is where you'd expect to find them. Signed-off-by: Nigel Cunningham <nigel@tuxonice.net> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-12-06 16:15:53 +01:00
Rafael J. Wysocki	64357ed468	Merge branch 'master' into for-linus	2009-12-06 16:06:11 +01:00
Frank Rowand	109d71c6dd	lockstat: Fix min, max times in /proc/lock_stats Fix min, max times in /proc/lock_stats (1) When collecting lock hold and wait times, if the current minimum time is zero, it will be replaced by the next time. (2) When aggregating minimum and maximum lock hold and wait times accross cpus, the values are added, instead of selecting the minimum and maximum. Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4B05BBAE.2050005@am.sony.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-06 13:20:00 +01:00
Frederic Weisbecker	b326e9560a	hw-breakpoints: Use overflow handler instead of the event callback struct perf_event::event callback was called when a breakpoint triggers. But this is a rather opaque callback, pretty tied-only to the breakpoint API and not really integrated into perf as it triggers even when we don't overflow. We prefer to use overflow_handler() as it fits into the perf events rules, being called only when we overflow. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>	2009-12-06 08:27:18 +01:00
Frederic Weisbecker	2f0993e0fb	hw-breakpoints: Drop callback and task parameters from modify helper Drop the callback and task parameters from modify_user_hw_breakpoint(). For now we have no user that need to modify a breakpoint to the point of changing its handler or its task context. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>	2009-12-06 08:27:17 +01:00
Linus Torvalds	897e81bea1	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits) sched, cputime: Introduce thread_group_times() sched, cputime: Cleanups related to task_times() Revert "sched, x86: Optimize branch hint in __switch_to()" sched: Fix isolcpus boot option sched: Revert `498657a478` sched, time: Define nsecs_to_jiffies() sched: Remove task_{u,s,g}time() sched: Introduce task_times() to replace task_{u,s}time() pair sched: Limit the number of scheduler debug messages sched.c: Call debug_show_all_locks() when dumping all tasks sched, x86: Optimize branch hint in __switch_to() sched: Optimize branch hint in context_switch() sched: Optimize branch hint in pick_next_task_fair() sched_feat_write(): Update ppos instead of file->f_pos sched: Sched_rt_periodic_timer vs cpu hotplug sched, kvm: Fix race condition involving sched_in_preempt_notifers sched: More generic WAKE_AFFINE vs select_idle_sibling() sched: Cleanup select_task_rq_fair() sched: Fix granularity of task_u/stime() sched: Fix/add missing update_rq_clock() calls ...	2009-12-05 15:30:49 -08:00
Linus Torvalds	c3fa27d136	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (470 commits) x86: Fix comments of register/stack access functions perf tools: Replace %m with %a in sscanf hw-breakpoints: Keep track of user disabled breakpoints tracing/syscalls: Make syscall events print callbacks static tracing: Add DEFINE_EVENT(), DEFINE_SINGLE_EVENT() support to docbook perf: Don't free perf_mmap_data until work has been done perf_event: Fix compile error perf tools: Fix _GNU_SOURCE macro related strndup() build error trace_syscalls: Remove unused syscall_name_to_nr() trace_syscalls: Simplify syscall profile trace_syscalls: Remove duplicate init_enter_##sname() trace_syscalls: Add syscall_nr field to struct syscall_metadata trace_syscalls: Remove enter_id exit_id trace_syscalls: Set event_enter_##sname->data to its metadata trace_syscalls: Remove unused event_syscall_enter and event_syscall_exit perf_event: Initialize data.period in perf_swevent_hrtimer() perf probe: Simplify event naming perf probe: Add --list option for listing current probe events perf probe: Add argv_split() from lib/argv_split.c perf probe: Move probe event utility functions to probe-event.c ...	2009-12-05 15:30:21 -08:00
David S. Miller	28b4d5cc17	Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ Conflicts: drivers/net/pcmcia/fmvj18x_cs.c drivers/net/pcmcia/nmclan_cs.c drivers/net/pcmcia/xirc2ps_cs.c drivers/net/wireless/ray_cs.c	2009-12-05 15:22:26 -08:00
Linus Torvalds	96fa2b508d	Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (40 commits) tracing: Separate raw syscall from syscall tracer ring-buffer-benchmark: Add parameters to set produce/consumer priorities tracing, function tracer: Clean up strstrip() usage ring-buffer benchmark: Run producer/consumer threads at nice +19 tracing: Remove the stale include/trace/power.h tracing: Only print objcopy version warning once from recordmcount tracing: Prevent build warning: 'ftrace_graph_buf' defined but not used ring-buffer: Move access to commit_page up into function used tracing: do not disable interrupts for trace_clock_local ring-buffer: Add multiple iterations between benchmark timestamps kprobes: Sanitize struct kretprobe_instance allocations tracing: Fix to use __always_unused attribute compiler: Introduce __always_unused tracing: Exit with error if a weak function is used in recordmcount.pl tracing: Move conditional into update_funcs() in recordmcount.pl tracing: Add regex for weak functions in recordmcount.pl tracing: Move mcount section search to front of loop in recordmcount.pl tracing: Fix objcopy revision check in recordmcount.pl tracing: Check absolute path of input file in recordmcount.pl tracing: Correct the check for number of arguments in recordmcount.pl ...	2009-12-05 09:53:36 -08:00
Linus Torvalds	7a797cdcca	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix trace_marker output tracing: Fix event format export tracing: Fix return value of tracing_stats_read()	2009-12-05 09:53:21 -08:00
Linus Torvalds	bb2166c898	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Fix spurious irq seqfile conversion genirq: switch /proc/irq/*/spurious to seq_file irq: Do not attempt to create subdirectories if /proc/irq/<irq> failed irq: Remove unused debug_poll_all_shared_irqs() irq: Fix docbook comments irq: trivial: Fix typo in comment for #endif	2009-12-05 09:53:08 -08:00
Linus Torvalds	0bf7969fea	Merge branch 'core-softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: softlockup: Fix hung_task_check_count sysctl	2009-12-05 09:52:46 -08:00
Linus Torvalds	69f061e0c2	Merge branch 'core-signal-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-signal-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: signal: Print warning message when dropping signals signal: Fix alternate signal stack check	2009-12-05 09:52:33 -08:00
Linus Torvalds	607781762e	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits) rcu: Make RCU's CPU-stall detector be default rcu: Add expedited grace-period support for preemptible RCU rcu: Enable fourth level of TREE_RCU hierarchy rcu: Rename "quiet" functions rcu: Re-arrange code to reduce #ifdef pain rcu: Eliminate unneeded function wrapping rcu: Fix grace-period-stall bug on large systems with CPU hotplug rcu: Eliminate __rcu_pending() false positives rcu: Further cleanups of use of lastcomp rcu: Simplify association of forced quiescent states with grace periods rcu: Accelerate callback processing on CPUs not detecting GP end rcu: Mark init-time-only rcu_bootup_announce() as __init rcu: Simplify association of quiescent states with grace periods rcu: Rename dynticks_completed to completed_fqs rcu: Enable synchronize_sched_expedited() fastpath rcu: Remove inline from forward-referenced functions rcu: Fix note_new_gpnum() uses of ->gpnum rcu: Fix synchronization for rcu_process_gp_end() uses of ->completed counter rcu: Prepare for synchronization fixes: clean up for non-NO_HZ handling of ->completed counter rcu: Cleanup: balance rcu_irq_enter()/rcu_irq_exit() calls ...	2009-12-05 09:52:14 -08:00
Linus Torvalds	d0b093a8b5	Merge branch 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ratelimit: Make suppressed output messages more useful printk: Remove ratelimit.h from kernel.h ratelimit: Fix/allow use in atomic contexts ratelimit: Use per ratelimit context locking	2009-12-05 09:50:22 -08:00
Linus Torvalds	3e72b810e3	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: mutex: Fix missing conditions to build mutex_spin_on_owner() mutex: Better control mutex adaptive spinning config locking, task_struct: Reduce size on TRACE_IRQFLAGS and 64bit locking: Use __[SPIN\|RW]_LOCK_UNLOCKED in [spin\|rw]_lock_init() locking: Remove unused prototype locking: Reduce ifdefs in kernel/spinlock.c locking: Make inlining decision Kconfig based	2009-12-05 09:49:59 -08:00
Linus Torvalds	9b269d4034	Merge branch 'core-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic-ipi: Add smp_call_function_any() generic-ipi: Fix misleading smp_call_function*() description	2009-12-05 09:49:46 -08:00
Louis Rilling	b69f229206	block: Fix io_context leak after failure of clone with CLONE_IO With CLONE_IO, parent's io_context->nr_tasks is incremented, but never decremented whenever copy_process() fails afterwards, which prevents exit_io_context() from calling IO schedulers exit functions. Give a task_struct to exit_io_context(), and call exit_io_context() instead of put_io_context() in copy_process() cleanup path. Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:36:18 +01:00
André Goddard Rosa	af901ca181	tree-wide: fix assorted typos all over the place That is "success", "unknown", "through", "performance", "[re\|un]mapping" , "access", "default", "reasonable", "[con]currently", "temperature" , "channel", "[un]used", "application", "example","hierarchy", "therefore" , "[over\|under]flow", "contiguous", "threshold", "enough" and others. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-12-04 15:39:55 +01:00
Uwe Kleine-König	fbfecd3712	tree-wide: fix typos "couter" -> "counter" This patch was generated by git grep -E -i -l 'couter' \| xargs -r perl -p -i -e 's/couter/counter/' Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-12-04 15:39:51 +01:00
Liuweni	fb3d38b990	fix kerneldoc for set_irq_msi() Signed-off-by: Liuweni <qingshenlwy@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-12-04 15:39:50 +01:00
Patrick McHardy	8153a10c08	ipv4 05/05: add sysctl to accept packets with local source addresses commit 8ec1e0ebe26087bfc5c0394ada5feb5758014fc8 Author: Patrick McHardy <kaber@trash.net> Date: Thu Dec 3 12:16:35 2009 +0100 ipv4: add sysctl to accept packets with local source addresses Change fib_validate_source() to accept packets with a local source address when the "accept_local" sysctl is set for the incoming inet device. Combined with the previous patches, this allows to communicate between multiple local interfaces over the wire. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-03 12:14:38 -08:00
Frederic Weisbecker	c08f782985	mutex: Fix missing conditions to build mutex_spin_on_owner() We don't need to build mutex_spin_on_owner() if we have CONFIG_DEBUG_MUTEXES or CONFIG_HAVE_DEFAULT_NO_SPIN_MUTEXES as it won't be used under such configs. Use CONFIG_MUTEX_SPIN_ON_OWNER as it gathers all the necessary checks before building it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1259783357-8542-2-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>	2009-12-03 11:50:11 +01:00
Frederic Weisbecker	c02260277e	mutex: Better control mutex adaptive spinning config Introduce CONFIG_MUTEX_SPIN_ON_OWNER so that we can centralize in a single place the conditions that determine its definition and use. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1259783357-8542-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org>	2009-12-03 11:50:11 +01:00
Paul E. McKenney	d9a3da0699	rcu: Add expedited grace-period support for preemptible RCU Implement an synchronize_rcu_expedited() for preemptible RCU that actually is expedited. This uses synchronize_sched_expedited() to force all threads currently running in a preemptible-RCU read-side critical section onto the appropriate ->blocked_tasks[] list, then takes a snapshot of all of these lists and waits for them to drain. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1259784616158-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-03 11:35:25 +01:00
Paul E. McKenney	cf244dc01b	rcu: Enable fourth level of TREE_RCU hierarchy Enable a fourth level of rcu_node hierarchy for TREE_RCU and TREE_PREEMPT_RCU. This is for stress-testing and experiemental purposes only, although in theory this would enable 16,777,216 CPUs on 64-bit systems, though only 1,048,576 CPUs on 32-bit systems. Normal experimental use of this fourth level will normally set CONFIG_RCU_FANOUT=2, requiring a 16-CPU system, though the more adventurous (and more fortunate) experimenters may wish to chose CONFIG_RCU_FANOUT=3 for 81-CPU systems or even CONFIG_RCU_FANOUT=4 for 256-CPU systems. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12597846161257-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-03 11:34:53 +01:00
Paul E. McKenney	d3f6bad391	rcu: Rename "quiet" functions The number of "quiet" functions has grown recently, and the names are no longer very descriptive. The point of all of these functions is to do some portion of the task of reporting a quiescent state, so rename them accordingly: o cpu_quiet() becomes rcu_report_qs_rdp(), which reports a quiescent state to the per-CPU rcu_data structure. If this turns out to be a new quiescent state for this grace period, then rcu_report_qs_rnp() will be invoked to propagate the quiescent state up the rcu_node hierarchy. o cpu_quiet_msk() becomes rcu_report_qs_rnp(), which reports a quiescent state for a given CPU (or possibly a set of CPUs) up the rcu_node hierarchy. o cpu_quiet_msk_finish() becomes rcu_report_qs_rsp(), which reports a full set of quiescent states to the global rcu_state structure. o task_quiet() becomes rcu_report_unblock_qs_rnp(), which reports a quiescent state due to a task exiting an RCU read-side critical section that had previously blocked in that same critical section. As indicated by the new name, this type of quiescent state is reported up the rcu_node hierarchy (using rcu_report_qs_rnp() to do so). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Triplett <josh@joshtriplett.org> Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12597846163698-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-03 11:34:26 +01:00
Avi Kivity	58988b07cf	Merge remote branch 'tip/x86/entry' into kvm-updates/2.6.33 Signed-off-by: Avi Kivity <avi@redhat.com>	2009-12-03 09:30:06 +02:00
James Morris	c84d6efd36	Merge branch 'master' into next	2009-12-03 12:03:40 +05:30
Helge Deller	35dead4235	modules: don't export section names of empty sections via sysfs On the parisc architecture we face for each and every loaded kernel module this kernel "badness warning": sysfs: cannot create duplicate filename '/module/ac97_bus/sections/.text' Badness at fs/sysfs/dir.c:487 Reason for that is, that on parisc all kernel modules do have multiple .text sections due to the usage of the -ffunction-sections compiler flag which is needed to reach all jump targets on this platform. An objdump on such a kernel module gives: Sections: Idx Name Size VMA LMA File off Algn 0 .note.gnu.build-id 00000024 00000000 00000000 00000034 22 CONTENTS, ALLOC, LOAD, READONLY, DATA 1 .text 00000000 00000000 00000000 00000058 20 CONTENTS, ALLOC, LOAD, READONLY, CODE 2 .text.ac97_bus_match 0000001c 00000000 00000000 00000058 22 CONTENTS, ALLOC, LOAD, READONLY, CODE 3 .text 00000000 00000000 00000000 000000d4 20 CONTENTS, ALLOC, LOAD, READONLY, CODE ... Since the .text sections are empty (size of 0 bytes) and won't be loaded by the kernel module loader anyway, I don't see a reason why such sections need to be listed under /sys/module/<module_name>/sections/<section_name> either. The attached patch does solve this issue by not exporting section names which are empty. This fixes bugzilla http://bugzilla.kernel.org/show_bug.cgi?id=14703 Signed-off-by: Helge Deller <deller@gmx.de> CC: rusty@rustcorp.com.au CC: akpm@linux-foundation.org CC: James.Bottomley@HansenPartnership.com CC: roland@redhat.com CC: dave@hiauly1.hia.nrc.ca Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-02 15:38:25 -08:00
Hidetoshi Seto	0cf55e1ec0	sched, cputime: Introduce thread_group_times() This is a real fix for problem of utime/stime values decreasing described in the thread: http://lkml.org/lkml/2009/11/3/522 Now cputime is accounted in the following way: - {u,s}time in task_struct are increased every time when the thread is interrupted by a tick (timer interrupt). - When a thread exits, its {u,s}time are added to signal->{u,s}time, after adjusted by task_times(). - When all threads in a thread_group exits, accumulated {u,s}time (and also c{u,s}time) in signal struct are added to c{u,s}time in signal struct of the group's parent. So {u,s}time in task struct are "raw" tick count, while {u,s}time and c{u,s}time in signal struct are "adjusted" values. And accounted values are used by: - task_times(), to get cputime of a thread: This function returns adjusted values that originates from raw {u,s}time and scaled by sum_exec_runtime that accounted by CFS. - thread_group_cputime(), to get cputime of a thread group: This function returns sum of all {u,s}time of living threads in the group, plus {u,s}time in the signal struct that is sum of adjusted cputimes of all exited threads belonged to the group. The problem is the return value of thread_group_cputime(), because it is mixed sum of "raw" value and "adjusted" value: group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time) This misbehavior can break {u,s}time monotonicity. Assume that if there is a thread that have raw values greater than adjusted values (e.g. interrupted by 1000Hz ticks 50 times but only runs 45ms) and if it exits, cputime will decrease (e.g. -5ms). To fix this, we could do: group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time) But task_times() contains hard divisions, so applying it for every thread should be avoided. This patch fixes the above problem in the following way: - Modify thread's exit (= __exit_signal()) not to use task_times(). It means {u,s}time in signal struct accumulates raw values instead of adjusted values. As the result it makes thread_group_cputime() to return pure sum of "raw" values. - Introduce a new function thread_group_times(task, utime, *stime) that converts "raw" values of thread_group_cputime() to "adjusted" values, in same calculation procedure as task_times(). - Modify group's exit (= wait_task_zombie()) to use this introduced thread_group_times(). It make c{u,s}time in signal struct to have adjusted values like before this patch. - Replace some thread_group_cputime() by thread_group_times(). This replacements are only applied where conveys the "adjusted" cputime to users, and where already uses task_times() near by it. (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.) This patch have a positive side effect: - Before this patch, if a group contains many short-life threads (e.g. runs 0.9ms and not interrupted by ticks), the group's cputime could be invisible since thread's cputime was accumulated after adjusted: imagine adjustment function as adj(ticks, runtime), {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0. After this patch it will not happen because the adjustment is applied after accumulated. v2: - remove if()s, put new variables into signal_struct. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Spencer Candland <spencer@bluehost.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <4B162517.8040909@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-02 17:32:40 +01:00
Hidetoshi Seto	d99ca3b977	sched, cputime: Cleanups related to task_times() - Remove if({u,s}t)s because no one call it with NULL now. - Use cputime_{add,sub}(). - Add ifndef-endif for prev_{u,s}time since they are used only when !VIRT_CPU_ACCOUNTING. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Spencer Candland <spencer@bluehost.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <4B1624C7.7040302@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-02 17:32:39 +01:00
Rusty Russell	bdddd2963c	sched: Fix isolcpus boot option Anton Blanchard wrote: > We allocate and zero cpu_isolated_map after the isolcpus > __setup option has run. This means cpu_isolated_map always > ends up empty and if CPUMASK_OFFSTACK is enabled we write to a > cpumask that hasn't been allocated. I introduced this regression in `49557e6203` (sched: Fix boot crash by zalloc()ing most of the cpu masks). Use the bootmem allocator if they set isolcpus=, otherwise allocate and zero like normal. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: peterz@infradead.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> LKML-Reference: <200912021409.17013.rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Anton Blanchard <anton@samba.org>	2009-12-02 10:27:16 +01:00
Avi Kivity	1786bf009f	core: Clean up user return notifers use of per_cpu Instead of using per_cpu(..., raw_smp_processor_id()), use __get_cpu_var(...). Signed-off-by: Avi Kivity <avi@redhat.com> LKML-Reference: <1259578491-4589-1-git-send-email-avi@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-02 10:22:59 +01:00
Tejun Heo	8592e6486a	sched: Revert `498657a478` `498657a478` incorrectly assumed that preempt wasn't disabled around context_switch() and thus was fixing imaginary problem. It also broke KVM because it depended on ->sched_in() to be called with irq enabled so that it can do smp calls from there. Revert the incorrect commit and add comment describing different contexts under with the two callbacks are invoked. Avi: spotted transposed in/out in the added comment. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Avi Kivity <avi@redhat.com> Cc: peterz@infradead.org Cc: efault@gmx.de Cc: rusty@rustcorp.com.au LKML-Reference: <1259726212-30259-2-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-02 09:55:33 +01:00
Randy Dunlap	595dd3d8bf	kmsg_dump: fix build for CONFIG_PRINTK=n kmsg_dump() fails to build when CONFIG_PRINTK=n; provide stubs for the kmsg_dump*() functions when CONFIG_PRINTK=n. kernel/printk.c: In function 'kmsg_dump': kernel/printk.c:1501: error: 'log_buf_len' undeclared (first use in this function) kernel/printk.c:1502: error: 'logged_chars' undeclared (first use in this function) kernel/printk.c:1506: error: 'log_buf' undeclared (first use in this function) Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Simon Kagstrom <simon.kagstrom@netinsight.net> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-12-02 08:44:33 +00:00
Kristian Høgsberg	ec70ccd806	perf: Don't free perf_mmap_data until work has been done In the CONFIG_PERF_USE_VMALLOC case, perf_mmap_data_free() only schedules the cleanup of the perf_mmap_data struct. In that case we have to wait until the work has been done before we free data. Signed-off-by: Kristian Høgsberg <krh@bitplanet.net> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1259697901-1747-1-git-send-email-krh@bitplanet.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-02 09:30:18 +01:00
David S. Miller	ff9c38bba3	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/mac80211/ht.c	2009-12-01 22:13:38 -08:00
Lai Jiangshan	7be077f563	trace_syscalls: Remove unused syscall_name_to_nr() After duplications are removed, syscall_name_to_nr() is unused. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D2A6.6060803@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:31 +01:00
Lai Jiangshan	3bbe84e9d3	trace_syscalls: Simplify syscall profile use only one prof_sysenter_enable() instead of prof_sysenter_enable_##sname() use only one prof_sysenter_disable() instead of prof_sysenter_disable_##sname() use only one prof_sysexit_enable() instead of prof_sysexit_enable_##sname() use only one prof_sysexit_disable() instead of prof_sysexit_disable_##sname() Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D2A1.8060304@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:30 +01:00
Lai Jiangshan	a1301da099	trace_syscalls: Remove duplicate init_enter_##sname() use only one init_syscall_trace instead of many init_enter_##sname()/init_exit_##sname() Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D29B.6090708@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:30 +01:00
Lai Jiangshan	c252f65793	trace_syscalls: Add syscall_nr field to struct syscall_metadata Add syscall_nr field to struct syscall_metadata, it helps us to get syscall number easier. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D293.6090800@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:29 +01:00
Lai Jiangshan	fcc19438dd	trace_syscalls: Remove enter_id exit_id use ->enter_event->id instead of ->enter_id use ->exit_event->id instead of ->exit_id Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D288.7030001@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:29 +01:00
Lai Jiangshan	31c16b1334	trace_syscalls: Set event_enter_##sname->data to its metadata Set event_enter_##sname->data to its metadata, it makes codes simpler. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D282.7050709@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:28 +01:00
Lai Jiangshan	bf56a4ea9f	trace_syscalls: Remove unused event_syscall_enter and event_syscall_exit fix event_enter_##sname->event fix event_exit_##sname->event remove unused event_syscall_enter and event_syscall_exit Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B14D278.4090209@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 17:33:28 +01:00
David Howells	f13a48bd79	SLOW_WORK: Move slow_work's proc file to debugfs Move slow_work's debugging proc file to debugfs. Signed-off-by: David Howells <dhowells@redhat.com> Requested-and-acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-01 08:20:31 -08:00
David Howells	fa1dae4906	SLOW_WORK: Fix the CONFIG_MODULES=n case Commits `3d7a641` ("SLOW_WORK: Wait for outstanding work items belonging to a module to clear") introduced some code to make sure that all of a module's slow-work items were complete before that module was removed, and commit `3bde31a` ("SLOW_WORK: Allow a requeueable work item to sleep till the thread is needed") further extended that, breaking it in the process if CONFIG_MODULES=n: CC kernel/slow-work.o kernel/slow-work.c: In function 'slow_work_execute': kernel/slow-work.c:313: error: 'slow_work_thread_processing' undeclared (first use in this function) kernel/slow-work.c:313: error: (Each undeclared identifier is reported only once kernel/slow-work.c:313: error: for each function it appears in.) kernel/slow-work.c: In function 'slow_work_wait_for_items': kernel/slow-work.c:950: error: 'slow_work_unreg_sync_lock' undeclared (first use in this function) kernel/slow-work.c:951: error: 'slow_work_unreg_wq' undeclared (first use in this function) kernel/slow-work.c:961: error: 'slow_work_unreg_work_item' undeclared (first use in this function) kernel/slow-work.c:974: error: 'slow_work_unreg_module' undeclared (first use in this function) kernel/slow-work.c:977: error: 'slow_work_thread_processing' undeclared (first use in this function) make[1]: *** [kernel/slow-work.o] Error 1 Fix this by: (1) Extracting the bits of slow_work_execute() that are contingent on CONFIG_MODULES, and the bits that should be, into inline functions and placing them into the #ifdef'd section that defines the relevant variables and adding stubs for moduleless kernels. This allows the removal of some #ifdefs. (2) #ifdef'ing out the contents of slow_work_wait_for_items() in moduleless kernels. The four functions related to handling module unloading synchronisation (and their associated variables) could be offloaded into a separate .c file, but each function is only used once and three of them are tiny, so doing so would prevent them from being inlined. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-12-01 07:35:11 -08:00
Xiao Guangrong	59d069eb5a	perf_event: Initialize data.period in perf_swevent_hrtimer() In current code in perf_swevent_hrtimer(), data.period is not initialized, The result is obvious wrong: # ./perf record -f -e cpu-clock make # ./perf report # Samples: 1740 # # Overhead Command ...... # ........ ........ .......................................... # 1025422183050275328.00% sh libc-2.9.90.so ... 1025422183050275328.00% perl libperl.so ... 1025422168240043264.00% perl [kernel] ... 1025422030011210752.00% perl [kernel] ... Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <4B14E220.2050107@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 11:19:07 +01:00
Masami Hiramatsu	ba8665d7dd	trace_kprobes: Fix a memory leak bug and check kstrdup() return value Fix a memory leak case in create_trace_probe(). When an argument is too long (> MAX_ARGSTR_LEN), it just jumps to error path. In that case tp->args[i].name is not released. This also fixes a bug to check kstrdup()'s return value. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091201001919.10235.56455.stgit@harusame> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-12-01 08:19:59 +01:00
Simon Kagstrom	456b565cc5	core: Add kernel message dumper to call on oopses and panics The core functionality is implemented as per Linus suggestion from http://lists.infradead.org/pipermail/linux-mtd/2009-October/027620.html (with the kmsg_dump implementation by Linus). A struct kmsg_dumper has been added which contains a callback to dump the kernel log buffers on crashes. The kmsg_dump function gets called from oops_exit() and panic() and invokes this callbacks with the crash reason. [dwmw2: Fix log_end handling] Signed-off-by: Simon Kagstrom <simon.kagstrom@netinsight.net> Reviewed-by: Anders Grafstrom <anders.grafstrom@netinsight.net> Reviewed-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2009-11-30 12:01:49 +00:00
Avi Kivity	8e7cac7980	core: Fix user return notifier on fork() fork() clones all thread_info flags, including TIF_USER_RETURN_NOTIFY; if the new task is first scheduled on a cpu which doesn't have user return notifiers set, this causes user return notifiers to trigger without any way of clearing itself. This is easy to trigger with a forky workload on the host in parallel with kvm, resulting in a cpu in an endless loop on the verge of returning to userspace. Fix by dropping the TIF_USER_RETURN_NOTIFY immediately after fork. Signed-off-by: Avi Kivity <avi@redhat.com> LKML-Reference: <1259505288-16559-1-git-send-email-avi@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-29 22:03:04 +01:00
Lai Jiangshan	52a11f3549	trace_kprobes: Don't output zero offset "symbol_name+0" is not so friendly. It makes the output longer. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0CEBCB.7080309@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:43:05 +01:00
Lai Jiangshan	3d9b2e1ddf	trace_kprobes: Always show group name Sometimes the group name is not "kprobes", It'll be better if we can read it from tracing/kprobe_events. # echo 'r:laijs/vfs_read vfs_read %ax' > kprobe_events # cat kprobe_events r:laijs/vfs_read vfs_read %ax=%ax Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0CEBAF.6000104@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:43:04 +01:00
Lai Jiangshan	abab9d37d2	trace_kprobes: Fix memory leak tp->nr_args is not set before we "goto error", it causes memory leak for free_trace_probe() use tp->nr_args to free memory of args. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0CEB95.2060107@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:43:04 +01:00
Lai Jiangshan	0f1ef51d24	trace_syscalls: Add syscall nr field Field syscall number is missed in syscall_enter_define_fields()/ syscall_exit_define_fields(). Syscall number is also needed for event filter or other users. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <4B0E330D.1070206@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:24:19 +01:00
Frederic Weisbecker	dd1853c3f4	hw-breakpoints: Use struct perf_event_attr to define kernel breakpoints Kernel breakpoints are created using functions in which we pass breakpoint parameters as individual variables: address, length and type. Although it fits well for x86, this just does not scale across architectures that may support this api later as these may have more or different needs. Pass in a perf_event_attr structure instead because it is meant to evolve as much as possible into a generic hardware breakpoint parameter structure. Reported-by: K.Prasad <prasad@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1259294154-5197-2-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:22:59 +01:00
Frederic Weisbecker	5fa10b28e5	hw-breakpoints: Use struct perf_event_attr to define user breakpoints In-kernel user breakpoints are created using functions in which we pass breakpoint parameters as individual variables: address, length and type. Although it fits well for x86, this just does not scale across archictectures that may support this api later as these may have more or different needs. Pass in a perf_event_attr structure instead because it is meant to evolve as much as possible into a generic hardware breakpoint parameter structure. Reported-by: K.Prasad <prasad@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1259294154-5197-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:22:58 +01:00
Anton Blanchard	e5af022616	softlockup: Fix hung_task_check_count sysctl I'm seeing spikes of up to 0.5ms in khungtaskd on a large machine. To reduce this source of jitter I tried setting hung_task_check_count to 0: # echo 0 > /proc/sys/kernel/hung_task_check_count which didn't have the intended response. Change to a post increment of max_count, so a value of 0 means check 0 tasks. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: msb@google.com LKML-Reference: <20091127022820.GU32182@kryten> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-27 06:21:57 +01:00
Stephane Eranian	b2e74a265d	perf_events: Fix read() bogus counts when in error state When a pinned group cannot be scheduled it goes into error state. Normally a group cannot go out of error state without being explicitly re-enabled or disabled. There was a bug in per-thread mode, whereby upon termination of the thread, the group would transition from error to off leading to bogus counts and timing information returned by read(). Fix it by clearing the error state. Signed-off-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: perfmon2-devel@lists.sourceforge.net LKML-Reference: <4b0eb9ce.0508d00a.573b.ffffeab6@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 18:49:59 +01:00
Hidetoshi Seto	b7b20df91d	sched, time: Define nsecs_to_jiffies() Use of msecs_to_jiffies() for nsecs_to_cputime() have some problems: - The type of msecs_to_jiffies()'s argument is unsigned int, so it cannot convert msecs greater than UINT_MAX = about 49.7 days. - msecs_to_jiffies() returns MAX_JIFFY_OFFSET if MSB of argument is set, assuming that input was negative value. So it cannot convert msecs greater than INT_MAX = about 24.8 days too. This patch defines a new function nsecs_to_jiffies() that can deal greater values, and that can deal all incoming values as unsigned. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Spencer Candland <spencer@bluehost.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Amrico Wang <xiyou.wangcong@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <johnstul@linux.vnet.ibm.com> LKML-Reference: <4B0E16E7.5070307@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 12:59:20 +01:00
Hidetoshi Seto	d5b7c78e97	sched: Remove task_{u,s,g}time() Now all task_{u,s}time() pairs are replaced by task_times(). And task_gtime() is too simple to be an inline function. Cleanup them all. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Spencer Candland <spencer@bluehost.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> LKML-Reference: <4B0E16D1.70902@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 12:59:20 +01:00
Hidetoshi Seto	d180c5bcce	sched: Introduce task_times() to replace task_{u,s}time() pair Functions task_{u,s}time() are called in pair in almost all cases. However task_stime() is implemented to call task_utime() from its inside, so such paired calls run task_utime() twice. It means we do heavy divisions (div_u64 + do_div) twice to get utime and stime which can be obtained at same time by one set of divisions. This patch introduces a function task_times(tsk, utime, *stime) to retrieve utime and stime at once in better, optimized way. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Spencer Candland <spencer@bluehost.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> LKML-Reference: <4B0E16AE.906@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 12:59:19 +01:00
Masami Hiramatsu	ba005e1f41	tracepoint: Add signal loss events Add signal_overflow_fail and signal_lose_info tracepoints for signal-lost events. Changes in v3: - Add docbook style comments Changes in v2: - Use siginfo string macro Suggested-by: Roland McGrath <roland@redhat.com> Reviewed-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Oleg Nesterov <oleg@redhat.com> LKML-Reference: <20091124215658.30449.9934.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:55:38 +01:00
Masami Hiramatsu	f9d4257e01	tracepoint: Add signal deliver event Add a tracepoint where a process gets a signal. This tracepoint shows signal-number, sa-handler and sa-flag. Changes in v3: - Add docbook style comments Changes in v2: - Add siginfo argument - Fix comment Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Reviewed-by: Jason Baron <jbaron@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Oleg Nesterov <oleg@redhat.com> LKML-Reference: <20091124215651.30449.20926.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:55:38 +01:00
Masami Hiramatsu	d1eb650ff4	tracepoint: Move signal sending tracepoint to events/signal.h Move signal sending event to events/signal.h. This patch also renames sched_signal_send event to signal_generate. Changes in v4: - Fix a typo of task_struct pointer. Changes in v3: - Add docbook style comments Changes in v2: - Add siginfo argument - Add siginfo storing macro Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Reviewed-by: Jason Baron <jbaron@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: systemtap <systemtap@sources.redhat.com> Cc: DLE <dle-develop@lists.sourceforge.net> Cc: Oleg Nesterov <oleg@redhat.com> LKML-Reference: <20091124215645.30449.60208.stgit@dhcp-100-2-132.bos.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:55:37 +01:00
Ingo Molnar	16bc67edeb	Merge branch 'sched/urgent' into sched/core Merge reason: Pick up fixes that did not make it into .32.0 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:50:42 +01:00
Frederic Weisbecker	80bbf6b641	hw-breakpoints: Fix unused function in off-case bp_perf_event_destroy() is unused in its off-case version, let's remove it to fix the following warning reported by Stephen Rothwell in linux-next: kernel/perf_event.c:4306: warning: 'bp_perf_event_destroy' defined but not used Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1259180453-5813-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:40:51 +01:00
Mike Travis	feae3203d7	timers, init: Limit the number of per cpu calibration bootup messages Limit the number of per cpu calibration messages by only printing out results for the first cpu to boot. Also, don't print "CPUx is down" as this is expected, and we don't need 4096 reminders... ;-) Signed-off-by: Mike Travis <travis@sgi.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091118002219.889552000@alcatraz.americas.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:18:42 +01:00
Mike Travis	f6630114d9	sched: Limit the number of scheduler debug messages Remove the verbose scheduler debug messages unless kernel parameter "sched_debug" set. /proc/sched_debug unchanged. Signed-off-by: Mike Travis <travis@sgi.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091118002221.489305000@alcatraz.americas.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 10:17:30 +01:00
Andrew Morton	11e6635763	kernel/hw_breakpoint.c: Fix local/global shadowing If the new percpu tree is combined with the perf events tree the following new warning triggers: kernel/hw_breakpoint.c: In function 'toggle_bp_task_slot': kernel/hw_breakpoint.c:151: warning: 'task_bp_pinned' is used uninitialized in this function Because it's not valid anymore to define a local variable and a percpu variable (even if it's file scope local) with the same name. Rename the local variable to resolve this. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <200911260701.nAQ71owx016356@imap1.linux-foundation.org> [ v2: added changelog ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 09:34:04 +01:00
Frederic Weisbecker	605bfaee90	hw-breakpoints: Simplify error handling in breakpoint creation requests This simplifies the error handling when we create a breakpoint. We don't need to check the NULL return value corner case anymore since we have improved perf_event_create_kernel_counter() to always return an error code in the failure case. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1259210142-5714-3-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 09:29:21 +01:00
Frederic Weisbecker	c6567f642e	hw-breakpoints: Improve in-kernel event creation error granularity In fail case, perf_event_create_kernel_counter() returns NULL instead of an error, which doesn't help us to inform the user about the origin of the problem from the outer most callers. Often we can just return -EINVAL, which doesn't help anyone when it's eventually about a memory allocation failure. Then, this patch makes perf_event_create_kernel_counter() always return a detailed error code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1259210142-5714-2-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 09:29:21 +01:00
Frederic Weisbecker	d99be40aff	ksym_tracer: Fix breakpoint removal after modification The error path of a breakpoint modification is broken in the ksym tracer. A modified breakpoint hlist node is immediately released after its removal. Also we leak a breakpoint in this case. Fix the path. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1259210142-5714-1-git-send-regression-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-26 09:29:20 +01:00
Steven Rostedt	7ac0743404	ring-buffer-benchmark: Add parameters to set produce/consumer priorities Running the ring-buffer-benchmark's threads at the lowest priority may work well for keeping it in the background, but it is not appropriate for the benchmarks. This patch adds 4 parameters to the module: consumer_fifo consumer_nice producer_fifo producer_nice By default the consumer and producer still run at nice +19. If the _fifo options are set, they will override the _nice values. modprobe ring_buffer_benchmark consumer_nice=0 producer_fifo=10 The above will set the consumer thread to a nice value of 0, and the producer thread to a RT SCHED_FIFO priority of 10. Note, this patch also fixes a bug where calling set_user_nice on the consumer thread would oops the kernel when the parameter "disable_reader" is set. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-25 14:14:15 -05:00
Shmulik Ladkani	93335a2155	sched.c: Call debug_show_all_locks() when dumping all tasks In commit v2.6.21-691-g39bc89f ("make SysRq-T show all tasks again") the interface of show_state_filter() was changed: zero valued 'state_filter' specifies "dump all tasks" (instead of -1). However, the condition for calling debug_show_all_locks() ("show locks if all tasks are dumped") was not updated accordingly. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: peterz@infradead.org LKML-Reference: <4b0d2fe4.0ab6660a.6437.3cfc@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-25 14:26:52 +01:00
Tom Zanussi	99df5a6a21	trace/syscalls: Change ret param in struct syscall_trace_exit to long Commit `ee949a86b3` ("tracing/syscalls: Use long for syscall ret format and field definitions") changed the syscall exit return type to long, but forgot to change it in the struct. Signed-off-by: Tom Zanussi <tzanussi@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1259133299-23594-3-git-send-email-tzanussi@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-25 09:06:10 +01:00
Frederic Weisbecker	fe61267227	perf_events: Fix bad software/trace event recursion counting Commit `4ed7c92d68` (perf_events: Undo some recursion damage) has introduced a bad reference counting of the recursion context. putting the context behaves like getting it, dropping every software/trace events after the first one in a context. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1259091502-5171-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-24 21:34:00 +01:00
Tim Blechmann	710390d90f	sched: Optimize branch hint in context_switch() Branch hint profiling on my nehalem machine showed over 90% incorrect branch hints: 10420275 170645395 94 context_switch sched.c 3043 10408421 171098521 94 context_switch sched.c 3050 Signed-off-by: Tim Blechmann <tim@klingt.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0BBB9F.6080304@klingt.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-24 12:18:42 +01:00
Tim Blechmann	36ace27e3e	sched: Optimize branch hint in pick_next_task_fair() Branch hint profiling on my nehalem machine showed 90% incorrect branch hints: 15728471 158903754 90 pick_next_task_fair sched_fair.c 1555 Signed-off-by: Tim Blechmann <tim@klingt.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4B0BBBB1.2050100@klingt.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-24 12:18:12 +01:00
Stephane Eranian	184d3da8ef	perf_events: Fix bogus copy_to_user() in perf_event_read_group() When using an event group, the value and id for non leaders events were wrong due to invalid offset into the outgoing buffer. Signed-off-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: paulus@samba.org Cc: perfmon2-devel@lists.sourceforge.net LKML-Reference: <4b0b71e1.0508d00a.075e.ffff84a3@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-24 08:55:27 +01:00
Serge E. Hallyn	b3a222e52e	remove CONFIG_SECURITY_FILE_CAPABILITIES compile option As far as I know, all distros currently ship kernels with default CONFIG_SECURITY_FILE_CAPABILITIES=y. Since having the option on leaves a 'no_file_caps' option to boot without file capabilities, the main reason to keep the option is that turning it off saves you (on my s390x partition) 5k. In particular, vmlinux sizes came to: without patch fscaps=n: 53598392 without patch fscaps=y: 53603406 with this patch applied: 53603342 with the security-next tree. Against this we must weigh the fact that there is no simple way for userspace to figure out whether file capabilities are supported, while things like per-process securebits, capability bounding sets, and adding bits to pI if CAP_SETPCAP is in pE are not supported with SECURITY_FILE_CAPABILITIES=n, leaving a bit of a problem for applications wanting to know whether they can use them and/or why something failed. It also adds another subtly different set of semantics which we must maintain at the risk of severe security regressions. So this patch removes the SECURITY_FILE_CAPABILITIES compile option. It drops the kernel size by about 50k over the stock SECURITY_FILE_CAPABILITIES=y kernel, by removing the cap_limit_ptraced_target() function. Changelog: Nov 20: remove cap_limit_ptraced_target() as it's logic was ifndef'ed. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Acked-by: Andrew G. Morgan" <morgan@kernel.org> Signed-off-by: James Morris <jmorris@namei.org>	2009-11-24 15:06:47 +11:00
Andrew G. Morgan	c4a5af54c8	Silence the existing API for capability version compatibility check. When libcap, or other libraries attempt to confirm/determine the supported capability version magic, they generally supply a NULL dataptr to capget(). In this case, while returning the supported/preferred magic (via a modified header content), the return code of this system call may be 0, -EINVAL, or -EFAULT. No libcap code depends on the previous -EINVAL etc. return code, and all of the above three return codes can accompany a valid (successful) attempt to determine the requested magic value. This patch cleans up the system call to return 0, if the call is successfully being used to determine the supported/preferred capability magic value. Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Acked-by: Steve Grubb <sgrubb@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-11-24 08:53:29 +11:00
Jan Blunck	429947248f	sched_feat_write(): Update ppos instead of file->f_pos sched_feat_write() should update ppos instead of file->f_pos. (This reduces some BKL dependencies of this code.) Signed-off-by: Jan Blunck <jblunck@suse.de> Cc: jkacur@redhat.com Cc: Arnd Bergmann <arnd@arndb.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jamie Lokier <jamie@shareable.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> LKML-Reference: <1258735245-25826-8-git-send-email-jblunck@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 19:38:03 +01:00
Frederic Weisbecker	f5ffe02e50	perf: Add kernel side syscall events support for breakpoints Add the remaining necessary bits to support breakpoints created through perf syscall. We don't use the software counter interface as: - We don't need to check against recursion, this is already done in hardware breakpoints arch level. - We already know the perf event we are dealing with when the event is to be committed. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1258987355-8751-3-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 18:18:31 +01:00
Frederic Weisbecker	fdf6bc9522	hw-breakpoints: Check the breakpoint params from perf tools Perf tools create perf events as disabled in the beginning. Breakpoints are then considered like ptrace temporary breakpoints, only meant to reserve a breakpoint slot until we get all the necessary informations from the user. In this case, we don't check the address that is breakpointed as it is NULL in the ptrace case. But perf tools don't have the same purpose, events are created disabled to wait for all events to be created before enabling all of them. We want to check the breakpoint parameters in this case. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1258987355-8751-2-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 18:18:30 +01:00
K.Prasad	ba6909b719	hw-breakpoint: Attribute authorship of hw-breakpoint related files Attribute authorship to developers of hw-breakpoint related files. Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091123154713.GA5593@in.ibm.com> [ v2: moved it to latest -tip ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 18:18:29 +01:00
Peter Zijlstra	acd1d7c1f8	perf_events: Restore sanity to scaling land It is quite possible to call update_event_times() on a context that isn't actually running and thereby confuse the thing. perf stat was reporting !100% scale values for software counters (`2e2af50b` perf_events: Disable events when we detach them, solved the worst of that, but there was still some left). The thing that happens is that because we are not self-reaping (we have a caring parent) there is a time between the last schedule (out) and having do_exit() called which will detach the events. This period would be accounted as enabled,!running because the event->state==INACTIVE, even though !event->ctx->is_active. Similar issues could have been observed by calling read() on a event while the attached task was not scheduled in. Solve this by teaching update_event_times() about ctx->is_active. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1258984836.4531.480.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 15:22:19 +01:00
Peter Zijlstra	4ed7c92d68	perf_events: Undo some recursion damage Make perf_swevent_get_recursion_context return a context number and disable preemption. This could be used to remove the IRQ disable from the trace bit and index the per-cpu buffer with. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091123103819.993226816@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:57 +01:00
Peter Zijlstra	f67218c3e9	perf_events: Fix __perf_event_exit_task() vs. update_event_times() locking Move the update_event_times() call in __perf_event_exit_task() into list_del_event() because that holds the proper lock (ctx->lock) and seems a more natural place to do the last time update. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091123103819.842455480@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:57 +01:00
Peter Zijlstra	5e942bb333	perf_events: Update the context time on exit It appeared we did call update_event_times() on exit, but we failed to update the context time, which renders the former moot. Locking is a bit iffy, we call update_event_times under ctx->mutex instead of ctx->lock - the next patch fixes this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091123103819.764207355@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:56 +01:00
Peter Zijlstra	2e2af50b1f	perf_events: Disable events when we detach them If we leave the event in STATE_INACTIVE, any read of the event after the detach will increase the running count but not the enabled count and cause funny scaling artefacts. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091123103819.689055515@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:56 +01:00
Peter Zijlstra	6c2bfcbe58	perf_events: Fix style nits Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20091123103819.613427378@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:55 +01:00
Peter Zijlstra	a66a3052e2	perf_events: Undo copy/paste damage We had two almost identical functions, avoid the duplication. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <20091123103819.537537928@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:49:55 +01:00
Ingo Molnar	a4234bfcf4	perf_events: Optimize the swcounter hotpath The structure init creates a bit memcpy, which shows up big time in perf annotate output: : ffffffff810a859d <__perf_sw_event>: 1.68 : ffffffff810a859d: 55 push %rbp 1.69 : ffffffff810a859e: 41 89 fa mov %edi,%r10d 0.01 : ffffffff810a85a1: 49 89 c9 mov %rcx,%r9 0.00 : ffffffff810a85a4: 31 c0 xor %eax,%eax 1.71 : ffffffff810a85a6: b9 16 00 00 00 mov $0x16,%ecx 0.00 : ffffffff810a85ab: 48 89 e5 mov %rsp,%rbp 0.00 : ffffffff810a85ae: 48 83 ec 60 sub $0x60,%rsp 1.52 : ffffffff810a85b2: 48 8d 7d a0 lea -0x60(%rbp),%rdi 85.20 : ffffffff810a85b6: f3 ab rep stos %eax,%es:(%rdi) None of the callees depends on the structure being pre-initialized, so only initialize ->addr. This gets rid of the memcpy overhead. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:48:27 +01:00
Ingo Molnar	457dc928f5	tracing, function tracer: Clean up strstrip() usage Clean up strstrip() usage - which also addresses this build warning: kernel/trace/ftrace.c: In function 'ftrace_pid_write': kernel/trace/ftrace.c:3004: warning: ignoring return value of 'strstrip', declared with attribute warn_unused_result Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 11:04:07 +01:00
Ingo Molnar	6e3d8330ae	perf events: Do not generate function trace entries in perf code Decreases perf overhead when function tracing is enabled, by about 50%. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 10:19:20 +01:00
Ingo Molnar	98e4833ba3	ring-buffer benchmark: Run producer/consumer threads at nice +19 The ring-buffer benchmark threads run on nice 0 by default, using up a lot of CPU time and slowing down the system: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1024 root 20 0 0 0 0 D 95.3 0.0 4:01.67 rb_producer 1023 root 20 0 0 0 0 R 93.5 0.0 2:54.33 rb_consumer 21569 mingo 40 0 14852 1048 772 R 3.6 0.1 0:00.05 top 1 root 40 0 4080 928 668 S 0.0 0.0 0:23.98 init Renice them to +19 to make them less intrusive. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-23 08:03:09 +01:00
Paul E. McKenney	6ebb237bec	rcu: Re-arrange code to reduce #ifdef pain Remove #ifdefs from kernel/rcupdate.c and include/linux/rcupdate.h by moving code to include/linux/rcutiny.h, include/linux/rcutree.h, and kernel/rcutree.c. Also remove some definitions that are no longer used. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1258908830885-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 18:58:16 +01:00
Paul E. McKenney	9f680ab414	rcu: Eliminate unneeded function wrapping The functions rcu_init() is a wrapper for __rcu_init(), and also sets up the CPU-hotplug notifier for rcu_barrier_cpu_hotplug(). But TINY_RCU doesn't need CPU-hotplug notification, and the rcu_barrier_cpu_hotplug() is a simple wrapper for rcu_cpu_notify(). So push rcu_init() out to kernel/rcutree.c and kernel/rcutiny.c and get rid of the wrapper function rcu_barrier_cpu_hotplug(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12589088302320-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 18:58:16 +01:00
Paul E. McKenney	b668c9cf3e	rcu: Fix grace-period-stall bug on large systems with CPU hotplug When the last CPU of a given leaf rcu_node structure goes offline, all of the tasks queued on that leaf rcu_node structure (due to having blocked in their current RCU read-side critical sections) are requeued onto the root rcu_node structure. This requeuing is carried out by rcu_preempt_offline_tasks(). However, it is possible that these queued tasks are the only thing preventing the leaf rcu_node structure from reporting a quiescent state up the rcu_node hierarchy. Unfortunately, the old code would fail to do this reporting, resulting in a grace-period stall given the following sequence of events: 1. Kernel built for more than 32 CPUs on 32-bit systems or for more than 64 CPUs on 64-bit systems, so that there is more than one rcu_node structure. (Or CONFIG_RCU_FANOUT is artificially set to a number smaller than CONFIG_NR_CPUS.) 2. The kernel is built with CONFIG_TREE_PREEMPT_RCU. 3. A task running on a CPU associated with a given leaf rcu_node structure blocks while in an RCU read-side critical section -and- that CPU has not yet passed through a quiescent state for the current RCU grace period. This will cause the task to be queued on the leaf rcu_node's blocked_tasks[] array, in particular, on the element of this array corresponding to the current grace period. 4. Each of the remaining CPUs corresponding to this same leaf rcu_node structure pass through a quiescent state. However, the task is still in its RCU read-side critical section, so these quiescent states cannot be reported further up the rcu_node hierarchy. Nevertheless, all bits in the leaf rcu_node structure's ->qsmask field are now zero. 5. Each of the remaining CPUs go offline. (The events in step #4 and #5 can happen in any order as long as each CPU passes through a quiescent state before going offline.) 6. When the last CPU goes offline, __rcu_offline_cpu() will invoke rcu_preempt_offline_tasks(), which will move the task to the root rcu_node structure, but without reporting a quiescent state up the rcu_node hierarchy (and this failure to report a quiescent state is the bug). But because this leaf rcu_node structure's ->qsmask field is already zero and its ->block_tasks[] entries are all empty, force_quiescent_state() will skip this rcu_node structure. Therefore, grace periods are now hung. This patch abstracts some code out of rcu_read_unlock_special(), calling the result task_quiet() by analogy with cpu_quiet(), and invokes task_quiet() from both rcu_read_lock_special() and __rcu_offline_cpu(). Invoking task_quiet() from __rcu_offline_cpu() reports the quiescent state up the rcu_node hierarchy, fixing the bug. This ends up requiring a separate lock_class_key per level of the rcu_node hierarchy, which this patch also provides. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12589088301770-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 18:58:15 +01:00
Ingo Molnar	645e8cc0c9	perf_events: Fix modular build Fix: ERROR: "perf_swevent_put_recursion_context" [fs/ext4/ext4.ko] undefined! ERROR: "perf_swevent_get_recursion_context" [fs/ext4/ext4.ko] undefined! Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <1258864015-10579-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 12:21:33 +01:00
Márton Németh	96b02d78a7	perf_event: Remove redundant zero fill The buffer is first zeroed out by memset(). Then strncpy() is used to fill the content. The strncpy() function also pads the string till the end of the specified length, which is redundant. The strncpy() does not ensures that the string will be properly closed with 0. Use strlcpy() instead. The semantic match that finds this kind of pattern is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression buffer; expression size; expression str; @@ memset(buffer, 0, size); ... - strncpy( + strlcpy( buffer, str, sizeof(buffer) ); @@ expression buffer; expression size; expression str; @@ memset(&buffer, 0, size); ... - strncpy( + strlcpy( &buffer, str, sizeof(buffer)); @@ expression buffer; identifier field; expression size; expression str; @@ memset(buffer, 0, size); ... - strncpy( + strlcpy( buffer->field, str, sizeof(buffer->field) ); @@ expression buffer; identifier field; expression size; expression str; @@ memset(&buffer, 0, size); ... - strncpy( + strlcpy( buffer.field, str, sizeof(buffer.field)); // </smpl> On strncpy() vs strlcpy() see http://www.gratisoft.us/todd/papers/strlcpy.html . Signed-off-by: Márton Németh <nm127@freemail.hu> Cc: Julia Lawall <julia@diku.dk> Cc: cocci@diku.dk Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <4B086547.5040100@freemail.hu> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 09:49:26 +01:00
Frederic Weisbecker	b3a75542d3	hw-breakpoints: Remove x86 specific headers from core file Remove asm/processor.h and asm/debugreg.h as these headers are not used anymore in the hw-breakpoints core file. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Prasad <prasad@linux.vnet.ibm.com> LKML-Reference: <1258863695-10464-3-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 09:03:43 +01:00
Frederic Weisbecker	28889bf9e2	tracing: Forget about the NMI buffer for syscall events We are never in an NMI context when we commit a syscall trace to perf. So just forget about the nmi buffer there. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <1258863695-10464-2-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 09:03:42 +01:00
Frederic Weisbecker	ce71b9df88	tracing: Use the perf recursion protection from trace event When we commit a trace to perf, we first check if we are recursing in the same buffer so that we don't mess-up the buffer with a recursing trace. But later on, we do the same check from perf to avoid commit recursion. The recursion check is desired early before we touch the buffer but we want to do this check only once. Then export the recursion protection from perf and use it from the trace events before submitting a trace. v2: Put appropriate Reported-by tag Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <1258864015-10579-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-22 09:03:42 +01:00
Stephane Eranian	8904b18046	perf_events: Fix default watermark calculation This patch fixes the default watermark value for the sampling buffer. With the existing calculation (watermark = max(PAGE_SIZE, max_size / 2)), no notification was ever received when the buffer was exactly 1 page. This was because you would never cross the threshold (there is no partial samples). In certain configuration, there was no possibilty detecting the problem because there was not enough space left to store the LOST record.In fact, there may be a more generic problem here. The kernel should ensure that there is alaways enough space to store one LOST record. This patch sets the default watermark to half the buffer size. With such limit, we are guaranteed to get a notification even with a single page buffer assuming no sample is bigger than a page. Signed-off-by: Stephane Eranian <eranian@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212509.344964101@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1256302576-6169-1-git-send-email-eranian@gmail.com>	2009-11-21 14:11:41 +01:00
Peter Zijlstra	6f10581aea	perf: Fix locking for PERF_FORMAT_GROUP We should hold event->child_mutex when iterating the inherited counters, we should hold ctx->mutex when iterating siblings. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212509.251030114@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:40 +01:00
Peter Zijlstra	59ed446f79	perf: Fix event scaling for inherited counters Properly account the full hierarchy of counters for both the count (we already did so) and the scale times (new). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212509.153379276@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:40 +01:00
Peter Zijlstra	2b8988c9f7	perf: Fix time locking Most sites updating ctx->time and event times do so under ctx->lock, make sure they all do. This was made possible by removing the __perf_event_read() call from __perf_event_sync_stat(), which already had this lock taken. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212509.102316434@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:39 +01:00
Peter Zijlstra	58e5ad1de3	perf: Simplify __perf_event_read cpuctx is always active, task context is always active for current the previous condition verifies that if its a task context its for current, hence we can assume ctx->is_active. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212509.000272254@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:39 +01:00
Peter Zijlstra	3dbebf15c5	perf: Simplify __perf_event_sync_stat Removes constraints from __perf_event_read() by leaving it with a single callsite; this callsite had ctx->lock held, the other one does not. Removes some superfluous code from __perf_event_sync_stat(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.918544317@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:39 +01:00
Peter Zijlstra	f6f8378522	perf: Optimize __perf_event_read() Both callers actually have IRQs disabled, no need doing so again. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.863685796@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:38 +01:00
Peter Zijlstra	02ffdbc866	perf: Optimize perf_event_task_sched_out Remove an update_context_time() call from the perf_event_task_sched_out() path and into the branch its needed. The call was both superfluous, because __perf_event_sched_out() already does it, and wrong, because it was done without holding ctx->lock. Place it in perf_event_sync_stat(), which is the only place it is needed and which does already hold ctx->lock. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.779516394@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:38 +01:00
Peter Zijlstra	abf4868b85	perf: Fix PERF_FORMAT_GROUP scale info As Corey reported, the total_enabled and total_running times could occasionally be 0, even though there were events counted. It turns out this is because we record the times before reading the counter while the latter updates the times. This patch corrects that. While looking at this code I found that there is a lot of locking iffyness around, the following patches correct most of that. Reported-by: Corey Ashford <cjashfor@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.685559857@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:37 +01:00
Peter Zijlstra	f6d9dd237d	perf: Optimize perf_event_mmap_ctx() Remove a rcu_read_{,un}lock() pair and a few conditionals. We can remove the rcu_read_lock() by increasing the scope of one in the calling function. We can do away with the system_state check if the machine still boots after this patch (seems to be the case). We can do away with the list_empty() check because the bare list_for_each_entry_rcu() reduces to that now that we've removed everything else. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.606459548@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:37 +01:00
Peter Zijlstra	f6595f3a96	perf: Optimize perf_event_comm_ctx() Remove a rcu_read_{,un}lock() pair and a few conditionals. We can remove the rcu_read_lock() by increasing the scope of one in the calling function. We can do away with the system_state check if the machine still boots after this patch (seems to be the case). We can do away with the list_empty() check because the bare list_for_each_entry_rcu() reduces to that now that we've removed everything else. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.527608793@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:36 +01:00
Peter Zijlstra	d6ff86cfb5	perf: Optimize perf_event_task_ctx() Remove a rcu_read_{,un}lock() pair and a few conditionals. We can remove the rcu_read_lock() by increasing the scope of one in the calling function. We can do away with the system_state check if the machine still boots after this patch (seems to be the case). We can do away with the list_empty() check because the bare list_for_each_entry_rcu() reduces to that now that we've removed everything else. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.452227115@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:36 +01:00
Peter Zijlstra	8152018387	perf: Optimize perf_swevent_ctx_event() Remove a rcu_read_{,un}lock() pair and a few conditionals. We can remove the rcu_read_lock() by increasing the scope of one in the calling function. We can do away with the system_state check if the machine still boots after this patch (seems to be the case). We can do away with the list_empty() check because the bare list_for_each_entry_rcu() reduces to that now that we've removed everything else. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.378188589@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:35 +01:00
Peter Zijlstra	0cff784ae4	perf: Optimize some swcounter attr.sample_period==1 paths Avoid the rather expensive perf_swevent_set_period() if we know we have to sample every single event anyway. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.299508332@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:35 +01:00
Peter Zijlstra	453f19eea7	perf: Allow for custom overflow handlers in-kernel perf users might wish to have custom actions on the sample interrupt. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <20091120212508.222339539@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:11:35 +01:00
Ingo Molnar	96200591a3	Merge branch 'tracing/hw-breakpoints' into perf/core Conflicts: arch/x86/kernel/kprobes.c kernel/trace/Makefile Merge reason: hw-breakpoints perf integration is looking good in testing and in reviews, plus conflicts are mounting up - so merge & resolve. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-21 14:07:23 +01:00
Thomas Gleixner	34769945f7	genirq: Fix spurious irq seqfile conversion single_open data argument must be PDE(inode)->data instead of NULL otherwise seq_file->private is always NULL and we always read the spurious data of irq 0. Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-20 11:55:26 +01:00
David Howells	3bde31a4ac	SLOW_WORK: Allow a requeueable work item to sleep till the thread is needed Add a function to allow a requeueable work item to sleep till the thread processing it is needed by the slow-work facility to perform other work. Sometimes a work item can't progress immediately, but must wait for the completion of another work item that's currently being processed by another slow-work thread. In some circumstances, the waiting item could instead - theoretically - put itself back on the queue and yield its thread back to the slow-work facility, thus waiting till it gets processing time again before attempting to progress. This would allow other work items processing time on that thread. However, this only works if there is something on the queue for it to queue behind - otherwise it will just get a thread again immediately, and will end up cycling between the queue and the thread, eating up valuable CPU time. So, slow_work_sleep_till_thread_needed() is provided such that an item can put itself on a wait queue that will wake it up when the event it is actually interested in occurs, then call this function in lieu of calling schedule(). This function will then sleep until either the item's event occurs or another work item appears on the queue. If another work item is queued, but the item's event hasn't occurred, then the work item should requeue itself and yield the thread back to the slow-work facility by returning. This can be used by CacheFiles for an object that is being created on one thread to wait for an object being deleted on another thread where there is nothing on the queue for the creation to go and wait behind. As soon as an item appears on the queue that could be given thread time instead, CacheFiles can stick the creating object back on the queue and return to the slow-work facility - assuming the object deletion didn't also complete. Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:57 +00:00
David Howells	8fba10a42d	SLOW_WORK: Allow the work items to be viewed through a /proc file Allow the executing and queued work items to be viewed through a /proc file for debugging purposes. The contents look something like the following: THR PID ITEM ADDR FL MARK DESC === ===== ================ == ===== ========== 0 3005 ffff880023f52348 a 952ms FSC: OBJ17d3: LOOK 1 3006 ffff880024e33668 2 160ms FSC: OBJ17e5 OP60d3b: Write1/Store fl=2 2 3165 ffff8800296dd180 a 424ms FSC: OBJ17e4: LOOK 3 4089 ffff8800262c8d78 a 212ms FSC: OBJ17ea: CRTN 4 4090 ffff88002792bed8 2 388ms FSC: OBJ17e8 OP60d36: Write1/Store fl=2 5 4092 ffff88002a0ef308 2 388ms FSC: OBJ17e7 OP60d2e: Write1/Store fl=2 6 4094 ffff88002abaf4b8 2 132ms FSC: OBJ17e2 OP60d4e: Write1/Store fl=2 7 4095 ffff88002bb188e0 a 388ms FSC: OBJ17e9: CRTN vsq - ffff880023d99668 1 308ms FSC: OBJ17e0 OP60f91: Write1/EnQ fl=2 vsq - ffff8800295d1740 1 212ms FSC: OBJ16be OP4d4b6: Write1/EnQ fl=2 vsq - ffff880025ba3308 1 160ms FSC: OBJ179a OP58dec: Write1/EnQ fl=2 vsq - ffff880024ec83e0 1 160ms FSC: OBJ17ae OP599f2: Write1/EnQ fl=2 vsq - ffff880026618e00 1 160ms FSC: OBJ17e6 OP60d33: Write1/EnQ fl=2 vsq - ffff880025a2a4b8 1 132ms FSC: OBJ16a2 OP4d583: Write1/EnQ fl=2 vsq - ffff880023cbe6d8 9 212ms FSC: OBJ17eb: LOOK vsq - ffff880024d37590 9 212ms FSC: OBJ17ec: LOOK vsq - ffff880027746cb0 9 212ms FSC: OBJ17ed: LOOK vsq - ffff880024d37ae8 9 212ms FSC: OBJ17ee: LOOK vsq - ffff880024d37cb0 9 212ms FSC: OBJ17ef: LOOK vsq - ffff880025036550 9 212ms FSC: OBJ17f0: LOOK vsq - ffff8800250368e0 9 212ms FSC: OBJ17f1: LOOK vsq - ffff880025036aa8 9 212ms FSC: OBJ17f2: LOOK In the 'THR' column, executing items show the thread they're occupying and queued threads indicate which queue they're on. 'PID' shows the process ID of a slow-work thread that's executing something. 'FL' shows the work item flags. 'MARK' indicates how long since an item was queued or began executing. Lastly, the 'DESC' column permits the owner of an item to give some information. Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:51 +00:00
Jens Axboe	6b8268b17a	SLOW_WORK: Add delayed_slow_work support This adds support for starting slow work with a delay, similar to the functionality we have for workqueues. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:47 +00:00
Jens Axboe	0160950297	SLOW_WORK: Add support for cancellation of slow work Add support for cancellation of queued slow work and delayed slow work items. The cancellation functions will wait for items that are pending or undergoing execution to be discarded by the slow work facility. Attempting to enqueue work that is in the process of being cancelled will result in ECANCELED. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:43 +00:00
Jens Axboe	4d8bb2cbcc	SLOW_WORK: Make slow_work_ops ->get_ref/->put_ref optional Make the ability for the slow-work facility to take references on a work item optional as not everyone requires this. Even the internal slow-work stubs them out, so those can be got rid of too. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:39 +00:00
David Howells	3d7a641e54	SLOW_WORK: Wait for outstanding work items belonging to a module to clear Wait for outstanding slow work items belonging to a module to clear when unregistering that module as a user of the facility. This prevents the put_ref code of a work item from being taken away before it returns. Signed-off-by: David Howells <dhowells@redhat.com>	2009-11-19 18:10:23 +00:00
David S. Miller	3505d1a9fd	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/sfc/sfe4001.c drivers/net/wireless/libertas/cmd.c drivers/staging/Kconfig drivers/staging/Makefile drivers/staging/rtl8187se/Kconfig drivers/staging/rtl8192e/Kconfig	2009-11-18 22:19:03 -08:00
Eric W. Biederman	6d4561110a	sysctl: Drop & in front of every proc_handler. For consistency drop & in front of every proc_handler. Explicity taking the address is unnecessary and it prevents optimizations like stubbing the proc_handlers to NULL. Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-18 08:37:40 -08:00
Stanislaw Gruszka	8747d793fc	itimers: Fix racy writes to cpu_itimer fields incr_error and error fields of struct cpu_itimer are used when calculating next timer tick in check_cpu_itimers() and should not be modified without tsk->sighand->siglock taken. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <1253802903-979-1-git-send-email-sgruszka@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-18 16:32:12 +01:00
Rusty Russell	2ea6dec4a2	generic-ipi: Add smp_call_function_any() Andrew points out that acpi-cpufreq uses cpumask_any, when it really would prefer to use the same CPU if possible (to avoid an IPI). In general, this seems a good idea to offer. [ tglx: Documented selection preference and Inlined the UP case to avoid the copy of smp_call_function_single() and the extra EXPORT ] Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Len Brown <len.brown@intel.com> Cc: Zhao Yakui <yakui.zhao@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Cc: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-18 14:52:25 +01:00
Alexey Dobriyan	a1afb6371b	genirq: switch /proc/irq/*/spurious to seq_file [ tglx: compacted it a bit ] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> LKML-Reference: <20090828181743.GA14050@x200.localdomain> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-18 12:50:51 +01:00
Stanislaw Gruszka	ba5ea951d0	posix-cpu-timers: optimize and document timer_create callback We have already new_timer initialized to all-zeros hence in function initializations are not needed. Document function expectation about new_timer argument as well. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: johnstul@us.ibm.com Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-18 12:36:05 +01:00
H Hartley Sweeten	8e1a928a2e	clockevents: Add missing include to pacify sparse Include "tick-internal.h" in order to pick up the extern function prototype for clockevents_shutdown(). This quiets the following sparse build noise: warning: symbol 'clockevents_shutdown' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> LKML-Reference: <BD79186B4FD85F4B8E60E381CAEE190901E24550@mi8nycmail19.Mi8.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Cc: johnstul@us.ibm.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-18 12:31:48 +01:00
Tejun Heo	9398180097	workqueue: fix race condition in schedule_on_each_cpu() Commit `65a6446434` ("HWPOISON: Allow schedule_on_each_cpu() from keventd") which allows schedule_on_each_cpu() to be called from keventd added a race condition. schedule_on_each_cpu() may race with cpu hotplug and end up executing the function twice on a cpu. Fix it by moving direct execution into the section protected with get/put_online_cpus(). While at it, update code such that direct execution is done after works have been scheduled for all other cpus and drop unnecessary cpu != orig test from flush loop. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-11-17 17:40:33 -08:00
Lai Jiangshan	f6060f4681	tracing: Prevent build warning: 'ftrace_graph_buf' defined but not used Prevent build warning when CONFIG_FUNCTION_GRAPH_TRACER is not set. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4AF24381.5060307@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-17 11:05:49 -05:00
Carsten Emde	c13d2f7c32	tracing: Fix trace_marker output When a string was written to <debugfs>/tracing/trace_marker, some strange characters appeared in the trace output instead of the string, since a vprint function erroneously called a vararg print function with a va_list argument. This patch fixes the problem and simplifies the related code. Signed-off-by: Carsten Emde <C.Emde@osadl.org> LKML-Reference: <4B01AE5D.1010801@osadl.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-17 09:19:06 -05:00
Steven Rostedt	5a50e33cc9	ring-buffer: Move access to commit_page up into function used With the change of the way we process commits. Where a commit only happens at the outer most level, and that we don't need to worry about a commit ending after the rb_start_commit() has been called, the code use to grab the commit page before the tail page to prevent a possible race. But this race no longer exists with the rb_start_commit() rb_end_commit() interface. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-17 08:43:01 -05:00
Lin Ming	0696b711e4	timekeeping: Fix clock_gettime vsyscall time warp Since commit `0a544198` "timekeeping: Move NTP adjusted clock multiplier to struct timekeeper" the clock multiplier of vsyscall is updated with the unmodified clock multiplier of the clock source and not with the NTP adjusted multiplier of the timekeeper. This causes user space observerable time warps: new CLOCK-warp maximum: 120 nsecs, 00000025c337c537 -> 00000025c337c4bf Add a new argument "mult" to update_vsyscall() and hand in the timekeeping internal NTP adjusted multiplier. Signed-off-by: Lin Ming <ming.m.lin@intel.com> Cc: "Zhang Yanmin" <yanmin_zhang@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tony Luck <tony.luck@intel.com> LKML-Reference: <1258436990.17765.83.camel@minggr.sh.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-17 11:52:34 +01:00
Ingo Molnar	a7b63425a4	Merge branch 'perf/core' into perf/probes Resolved merge conflict in tools/perf/Makefile Merge reason: we want to queue up a dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-17 10:17:47 +01:00
Eric W. Biederman	bb9074ff58	Merge commit 'v2.6.32-rc7' Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler gets a small bug fix and the sysctl tree where I am removing all sysctl strategy routines.	2009-11-17 01:01:34 -08:00
Peter Zijlstra	559fdc3c1b	perf_event: Optimize perf_output_lock() The purpose of perf_output_{un,}lock() is to: 1) avoid publishing incomplete data [ possible when publishing a head that is ahead of an entry that is still being written ] 2) guarantee fwd progress [ a simple refcount on pending writers doesn't need to drop to 0, making it so would end up implementing something like forced quiecent states of RCU ] To satisfy the above without undue complexity it serializes between CPUs, this means that a pending writer can only be the same cpu in a nested context, and since (under normal operation) a cpu always makes progress we're good -- if the head is only published when the bottom most writer completes. Now we don't need to disable IRQs in order to serialize between CPUs, disabling preemption ought to be sufficient, esp since we already deal with nesting due to NMIs. This avoids potentially expensive (and needless) local IRQ disable/enable ops. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <1258373161.26714.254.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-16 13:27:45 +01:00
Peter Zijlstra	047106adcc	sched: Sched_rt_periodic_timer vs cpu hotplug Heiko reported a case where a timer interrupt managed to reference a root_domain structure that was already freed by a concurrent hot-un-plug operation. Solve this like the regular sched_domain stuff is also synchronized, by adding a synchronize_sched() stmt to the free path, this ensures that a root_domain stays present for any atomic section that could have observed it. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Gregory Haskins <ghaskins@novell.com> Cc: Siddha Suresh B <suresh.b.siddha@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> LKML-Reference: <1258363873.26714.83.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-16 10:46:27 +01:00
Thomas Gleixner	dc186ad741	workqueue: Add debugobjects support Add debugobject support to track the life time of work_structs. While at it, remove duplicate definition of INIT_DELAYED_WORK_ON_STACK(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>	2009-11-16 01:09:48 +09:00
Tejun Heo	498657a478	sched, kvm: Fix race condition involving sched_in_preempt_notifers In finish_task_switch(), fire_sched_in_preempt_notifiers() is called after finish_lock_switch(). However, depending on architecture, preemption can be enabled after finish_lock_switch() which breaks the semantics of preempt notifiers. So move it before finish_arch_switch(). This also makes the in- notifiers symmetric to out- notifiers in terms of locking - now both are called under rq lock. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Avi Kivity <avi@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4AFD2801.7020900@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-15 09:59:54 +01:00
Ingo Molnar	0ffa798d94	Merge branches 'perf/powerpc' and 'perf/bench' into perf/core Merge reason: Both 'perf bench' and the pending PowerPC changes are now ready for the next merge window. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-15 09:51:24 +01:00
Ingo Molnar	39dc78b651	Merge commit 'v2.6.32-rc7' into perf/core Merge reason: pick up perf fixlets Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-15 09:50:41 +01:00
Paul E. McKenney	2f51f9884f	rcu: Eliminate __rcu_pending() false positives Now that there are both ->gpnum and ->completed fields in the rcu_node structure, __rcu_pending() should check rdp->gpnum and rdp->completed against rnp->gpnum and rdp->completed, respectively, instead of the prior comparison against the rcu_state fields rsp->gpnum and rsp->completed. Given the old comparison, __rcu_pending() could return 1, resulting in a needless raise_softirq(RCU_SOFTIRQ). This useless work would happen if RCU responded to a scheduling-clock interrupt after the rcu_state fields had been updated, but before the rcu_node fields had been updated. Changing the comparison from the rcu_state fields to the rcu_node fields prevents this useless work from happening. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12581706991966-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-14 10:31:42 +01:00
Paul E. McKenney	560d4bc0df	rcu: Further cleanups of use of lastcomp Now that a copy of the rsp->completed flag is available in all rcu_node structures, make full use of it. It is still legitimate to access rsp->completed while holding the root rcu_node structure's lock, however. Also, tighten up force_quiescent_state()'s checks for end of current grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1258170699933-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-14 10:31:42 +01:00
Thomas Gleixner	a362c638bd	clocksource/events: Fix fallout of generic code changes powerpc grew a new warning due to the type change of clockevent->mult. The architectures which use parts of the generic time keeping infrastructure tripped over my wrong assumption that clocksource_register is only used when GENERIC_TIME=y. I should have looked and also I should have known better. These renitent Gaul villages are racking my nerves. Some serious deprecating is due. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-14 00:35:52 +01:00
Thomas Gleixner	8e13c7b772	locking: Reduce ifdefs in kernel/spinlock.c With the Kconfig based inline decisions we can remove extra ifdefs in kernel/spinlock.c by creating the complex lockbreak functions as inlines which are inserted into the non inlined lock functions. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091109151428.548614772@linutronix.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2009-11-13 20:53:28 +01:00
Thomas Gleixner	6beb000923	locking: Make inlining decision Kconfig based commit `892a7c67` (locking: Allow arch-inlined spinlocks) implements the selection of which lock functions are inlined based on defines in arch/.../spinlock.h: #define __always_inline__LOCK_FUNCTION Despite of the name __always_inline__* the lock functions can be built out of line depending on config options. Also if the arch does not set some inline defines the generic code might set them; again depending on config options. This makes it unnecessary hard to figure out when and which lock functions are inlined. Aside of that it makes it way harder and messier for -rt to manipulate the lock functions. Convert the inlining decision to CONFIG switches. Each lock function is inlined depending on CONFIG_INLINE_. The configs implement the existing dependencies. The architecture code can select ARCH_INLINE_ to signal that it wants the corresponding lock function inlined. ARCH_INLINE_* is necessary as Kconfig ignores "depends on" restrictions when a config element is selected. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20091109151428.504477141@linutronix.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2009-11-13 20:53:28 +01:00
Jon Hunter	97813f2fe7	nohz: Allow 32-bit machines to sleep for more than 2.15 seconds In the dynamic tick code, "max_delta_ns" (member of the "clock_event_device" structure) represents the maximum sleep time that can occur between timer events in nanoseconds. The variable, "max_delta_ns", is defined as an unsigned long which is a 32-bit integer for 32-bit machines and a 64-bit integer for 64-bit machines (if -m64 option is used for gcc). The value of max_delta_ns is set by calling the function "clockevent_delta2ns()" which returns a maximum value of LONG_MAX. For a 32-bit machine LONG_MAX is equal to 0x7fffffff and in nanoseconds this equates to ~2.15 seconds. Hence, the maximum sleep time for a 32-bit machine is ~2.15 seconds, where as for a 64-bit machine it will be many years. This patch changes the type of max_delta_ns to be "u64" instead of "unsigned long" so that this variable is a 64-bit type for both 32-bit and 64-bit machines. It also changes the maximum value returned by clockevent_delta2ns() to KTIME_MAX. Hence this allows a 32-bit machine to sleep for longer than ~2.15 seconds. Please note that this patch also changes "min_delta_ns" to be "u64" too and although this is unnecessary, it makes the patch simpler as it avoids to fixup all callers of clockevent_delta2ns(). [ tglx: changed "unsigned long long" to u64 as we use this data type through out the time code ] Signed-off-by: Jon Hunter <jon-hunter@ti.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1250617512-23567-3-git-send-email-jon-hunter@ti.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-13 20:46:24 +01:00
Thomas Gleixner	27185016b8	nohz: Track last do_timer() cpu The previous patch which limits the sleep time to the maximum deferment time of the time keeping clocksource has some limitations on SMP machines: if all CPUs are idle then for all CPUs the maximum sleep time is limited. Solve this by keeping track of which cpu had the do_timer() duty assigned last and limit the sleep time only for this cpu. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Cc: Jon Hunter <jon-hunter@ti.com> Cc: John Stultz <johnstul@us.ibm.com>	2009-11-13 20:46:24 +01:00
Jon Hunter	98962465ed	nohz: Prevent clocksource wrapping during idle The dynamic tick allows the kernel to sleep for periods longer than a single tick, but it does not limit the sleep time currently. In the worst case the kernel could sleep longer than the wrap around time of the time keeping clock source which would result in losing track of time. Prevent this by limiting it to the safe maximum sleep time of the current time keeping clock source. The value is calculated when the clock source is registered. [ tglx: simplified the code a bit and massaged the commit msg ] Signed-off-by: Jon Hunter <jon-hunter@ti.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1250617512-23567-2-git-send-email-jon-hunter@ti.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-13 20:46:24 +01:00
Thomas Gleixner	529eaccd90	nohz: Type cast printk argument On some archs local_softirq_pending() has a data type of unsigned long on others its unsigned int. Type cast it to (unsigned int) in the printk to avoid the compiler warning. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission>	2009-11-13 20:46:24 +01:00
Thomas Gleixner	7d2f944a2b	clocksource: Provide a generic mult/shift factor calculation MIPS has two functions to calculcate the mult/shift factors for clock sources and clock events at run time. ARM needs such functions as well. Implement a function which calculates the mult/shift factors based on the frequencies to which and from which is converted. The function also has a parameter to specify the minimum conversion range in seconds. This range is guaranteed not to produce a 64bit overflow when a value is multiplied with the calculated mult factor. The larger the conversion range the less becomes the conversion accuracy. Provide two inline wrappers which handle clock events and clock sources. For clock events the "from" frequency is nano seconds per second which corresponds to 1GHz and "to" is the device frequency. For clock sources "from" is the device frequency and "to" is nano seconds per second. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mikael Pettersson <mikpe@it.uu.se> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Linus Walleij <linus.walleij@stericsson.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20091111134229.766673305@linutronix.de>	2009-11-13 20:46:23 +01:00
Thomas Gleixner	23af368e9a	clockevents: Use u32 for mult and shift factors The mult and shift factors of clock events differ in their data type from those of clock sources for no reason. u32 is sufficient for both. shift is always <= 32 and mult is limited to 2^32-1 to avoid 64bit multiplication overflows in the conversion. Preparatory patch for a generic mult/shift factor calculation function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mikael Pettersson <mikpe@it.uu.se> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Linus Walleij <linus.walleij@stericsson.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20091111134229.725664788@linutronix.de>	2009-11-13 20:46:23 +01:00
Frederic Weisbecker	67178767b9	tracing: Rename 'lockdep' event subsystem into 'lock' Lockdep events subsystem gathers various locking related events such as a request, release, contention or acquisition of a lock. The name of this event subsystem is a bit of a misnomer since these events are not quite related to lockdep but more generally to locking, ie: these events are not reporting lock dependencies or possible deadlock scenario but pure locking events. Hence this rename. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <1258103194-843-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-13 10:48:27 +01:00
Paul E. McKenney	8e9aa8f067	rcu: Simplify association of forced quiescent states with grace periods The force_quiescent_state() function also took a snapshot of the ->completed field, which was as obnoxious as it was in rcu_sched_qs() and friends. So snapshot ->gpnum-1. Also, since the dyntick_record_completed() and dyntick_recall_completed() functions are now simple assignments that are independent of CONFIG_NO_HZ, and since their names are now misleading, get rid of them. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12580941042308-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-13 10:18:36 +01:00
Paul E. McKenney	b32e9eb6ad	rcu: Accelerate callback processing on CPUs not detecting GP end An earlier fix for a race resulted in a situation where the CPUs other than the CPU that detected the end of the grace period would not process their callbacks until the next grace period started. This means that these other CPUs would unnecessarily demand that an extra grace period be started. This patch eliminates this extra grace period and speeds callback processing by propagating rsp->completed to the rcu_node structures in the case where the CPU detecting the end of the grace period sees no reason to start a new grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1258094104417-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-13 10:18:36 +01:00
Peter Zijlstra	fe3bcfe1f6	sched: More generic WAKE_AFFINE vs select_idle_sibling() Instead of only considering SD_WAKE_AFFINE \| SD_PREFER_SIBLING domains also allow all SD_PREFER_SIBLING domains below a SD_WAKE_AFFINE domain to change the affinity target. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091112145610.909723612@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-13 10:09:59 +01:00
Peter Zijlstra	a50bde5130	sched: Cleanup select_task_rq_fair() Clean up the new affine to idle sibling bits while trying to grok them. Should not have any function differences. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091112145610.832503781@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-13 10:09:58 +01:00
Hidetoshi Seto	761b1d26df	sched: Fix granularity of task_u/stime() Originally task_s/utime() were designed to return clock_t but later changed to return cputime_t by following commit: commit `efe567fc82` Author: Christian Borntraeger <borntraeger@de.ibm.com> Date: Thu Aug 23 15:18:02 2007 +0200 It only changed the type of return value, but not the implementation. As the result the granularity of task_s/utime() is still that of clock_t, not that of cputime_t. So using task_s/utime() in __exit_signal() makes values accumulated to the signal struct to be rounded and coarse grained. This patch removes casts to clock_t in task_u/stime(), to keep granularity of cputime_t over the calculation. v2: Use div_u64() to avoid error "undefined reference to `__udivdi3`" on some 32bit systems. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: xiyou.wangcong@gmail.com Cc: Spencer Candland <spencer@bluehost.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> LKML-Reference: <4AFB9029.9000208@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-12 15:23:47 +01:00
Mike Galbraith	055a00865d	sched: Fix/add missing update_rq_clock() calls kthread_bind(), migrate_task() and sched_fork were missing updates, and try_to_wake_up() was updating after having already used the stale clock. Aside from preventing potential latency hits, there' a side benefit in that early boot printk time stamps become monotonic. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1258020464.6491.2.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <new-submission>	2009-11-12 12:28:29 +01:00
Eric W. Biederman	4739a9748e	sysctl: Remove the last of the generic binary sysctl support Now that all of the users stopped using ctl_name and strategy it is safe to remove the fields from struct ctl_table, and it is safe to remove the stub strategy routines as well. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-12 02:05:06 -08:00
Eric W. Biederman	56992309cc	sysctl kernel: Remove binary sysctl logic Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name and .strategy members of sysctl tables are dead code. Remove them. Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-12 02:04:55 -08:00
Eric W. Biederman	757010f026	sysctl binary: Reorder the tests to process wild card entries first. A malicious user could have passed in a ctl_name of 0 and triggered the well know ctl_name to procname mapping code, instead of the wild card matching code. This is a slight problem as wild card entries don't have procnames, and because in some alternate universe a network device might have ifindex 0. So test for and handle wild card entries first. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-12 01:42:31 -08:00
Eric W. Biederman	63395b6597	sysctl: sysctl_binary.c Fix compilation when !CONFIG_NET dev_get_by_index does not exist when the network stack is not compiled in, so only include the code to follow wild card paths when the network stack is present. I have shuffled the code around a little to make it clear that dev_put is called after dev_get_by_index showing that there is no leak. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-12 01:25:43 -08:00
Steven Rostedt	8b2a5dac78	tracing: do not disable interrupts for trace_clock_local Disabling interrupts in trace_clock_local takes quite a performance hit to the recording of traces. Using perf top we see: ------------------------------------------------------------------------------ PerfTop: 244 irqs/sec kernel:100.0% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 2842.00 - 40.4% : trace_clock_local 1043.00 - 14.8% : rb_reserve_next_event 784.00 - 11.1% : ring_buffer_lock_reserve 600.00 - 8.5% : __rb_reserve_next 579.00 - 8.2% : rb_end_commit 440.00 - 6.3% : ring_buffer_unlock_commit 290.00 - 4.1% : ring_buffer_producer_thread [ring_buffer_benchmark] 155.00 - 2.2% : debug_smp_processor_id 117.00 - 1.7% : trace_recursive_unlock 103.00 - 1.5% : ring_buffer_event_data 28.00 - 0.4% : do_gettimeofday 22.00 - 0.3% : _spin_unlock_irq 14.00 - 0.2% : native_read_tsc 11.00 - 0.2% : getnstimeofday Where trace_clock_local is 40% of the tracing, and the time for recording a trace according to ring_buffer_benchmark is 210ns. After converting the interrupts to preemption disabling we have from perf top: ------------------------------------------------------------------------------ PerfTop: 1084 irqs/sec kernel:99.9% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 1277.00 - 16.8% : native_read_tsc 1148.00 - 15.1% : rb_reserve_next_event 896.00 - 11.8% : ring_buffer_lock_reserve 688.00 - 9.1% : __rb_reserve_next 664.00 - 8.8% : rb_end_commit 563.00 - 7.4% : ring_buffer_unlock_commit 508.00 - 6.7% : _spin_unlock_irq 365.00 - 4.8% : debug_smp_processor_id 321.00 - 4.2% : trace_clock_local 303.00 - 4.0% : ring_buffer_producer_thread [ring_buffer_benchmark] 273.00 - 3.6% : native_sched_clock 122.00 - 1.6% : trace_recursive_unlock 113.00 - 1.5% : sched_clock 101.00 - 1.3% : ring_buffer_event_data 53.00 - 0.7% : tick_nohz_stop_sched_tick Where trace_clock_local drops from 40% to only taking 4% of the total time. The trace time also goes from 210ns down to 179ns (31ns). I talked with Peter Zijlstra about the impact that sched_clock may have without having interrupts disabled, and he told me that if a timer interrupt comes in, sched_clock may report a wrong time. Balancing a seldom incorrect timestamp with a 15% performance boost, I'll take the performance boost. Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-11 23:38:33 -05:00
Eric W. Biederman	2fb10732c3	sysctl: Warn about all uses of sys_sysctl. Now that the glibc pthread implemenation no longers uses sysctl() users of sysctl are as rare as hen's teeth. So remove the glibc exception from the warning, and use the standard printk_ratelimit instead of rolling our own. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 19:35:52 -08:00
Steven Rostedt	a6f0eb6adc	ring-buffer: Add multiple iterations between benchmark timestamps The ring_buffer_benchmark does a gettimeofday after every write to the ring buffer in its measurements. This adds the overhead of the call to gettimeofday to the measurements and does not give an accurate picture of the length of time it takes to record a trace. This was first noticed with perf top: ------------------------------------------------------------------------------ PerfTop: 679 irqs/sec kernel:99.9% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 1673.00 - 27.8% : trace_clock_local 806.00 - 13.4% : do_gettimeofday 590.00 - 9.8% : rb_reserve_next_event 554.00 - 9.2% : native_read_tsc 431.00 - 7.2% : ring_buffer_lock_reserve 365.00 - 6.1% : __rb_reserve_next 355.00 - 5.9% : rb_end_commit 322.00 - 5.4% : getnstimeofday 268.00 - 4.5% : ring_buffer_unlock_commit 262.00 - 4.4% : ring_buffer_producer_thread [ring_buffer_benchmark] 113.00 - 1.9% : read_tsc 91.00 - 1.5% : debug_smp_processor_id 69.00 - 1.1% : trace_recursive_unlock 66.00 - 1.1% : ring_buffer_event_data 25.00 - 0.4% : _spin_unlock_irq And the length of each write to the ring buffer measured at 310ns. This patch adds a new module parameter called "write_interval" which is defaulted to 50. This is the number of writes performed between timestamps. After this patch perf top shows: ------------------------------------------------------------------------------ PerfTop: 244 irqs/sec kernel:100.0% [1000Hz cpu-clock-msecs], (all, 4 CPUs) ------------------------------------------------------------------------------ samples pcnt kernel function _______ _____ _______________ 2842.00 - 40.4% : trace_clock_local 1043.00 - 14.8% : rb_reserve_next_event 784.00 - 11.1% : ring_buffer_lock_reserve 600.00 - 8.5% : __rb_reserve_next 579.00 - 8.2% : rb_end_commit 440.00 - 6.3% : ring_buffer_unlock_commit 290.00 - 4.1% : ring_buffer_producer_thread [ring_buffer_benchmark] 155.00 - 2.2% : debug_smp_processor_id 117.00 - 1.7% : trace_recursive_unlock 103.00 - 1.5% : ring_buffer_event_data 28.00 - 0.4% : do_gettimeofday 22.00 - 0.3% : _spin_unlock_irq 14.00 - 0.2% : native_read_tsc 11.00 - 0.2% : getnstimeofday do_gettimeofday dropped from 13% usage to a mere 0.4%! (using the default 50 interval) The measurement for each timestamp went from 310ns to 210ns. That's 100ns (1/3rd) overhead that the gettimeofday call was introducing. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-11 22:22:15 -05:00
David S. Miller	3586e0a9a4	clocksource/timecompare: Fix symbol exports to be GPL'd. Noticed by Thomas GLeixner. Signed-off-by: David S. Miller <davem@davemloft.net>	2009-11-11 19:06:30 -08:00
Roel Kluin	a646365cc3	tracing: Fix return value of tracing_stats_read() The function tracing_stats_read() mistakenly returns ENOMEM instead of the negative value -ENOMEM. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> LKML-Reference: <4AFB2C0B.50605@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-11 21:26:55 -05:00
Paul E. McKenney	0e0fc1c23e	rcu: Mark init-time-only rcu_bootup_announce() as __init Because rcu_bootup_announce() is used only at boot time, mark it as __init, presumably so that its memory can be reclaimed. Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <20091111192806.GA10073@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-11 21:27:42 +01:00
Linus Torvalds	961767b75d	Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: highmem: Fix debug_kmap_atomic() to also handle KM_IRQ_PTE, KM_NMI, and KM_NMI_PTE highmem: Fix race in debug_kmap_atomic() which could cause warn_count to underflow rcu: Fix long-grace-period race between forcing and initialization uids: Prevent tear down race	2009-11-11 11:30:15 -08:00
Linus Torvalds	1fd18a871a	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: try_one_irq() must be called with irq disabled	2009-11-11 11:29:58 -08:00
Eric W. Biederman	2315ffa0a9	sysctl: Don't look at ctl_name and strategy in the generic code The ctl_name and strategy fields are unused, now that sys_sysctl is a compatibility wrapper around /proc/sys. No longer looking at them in the generic code is effectively what we are doing now and provides the guarantee that during further cleanups we can just remove references to those fields and everything will work ok. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 00:53:43 -08:00
Eric W. Biederman	6fce56ec91	sysctl: Remove references to ctl_name and strategy from the generic sysctl table Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name and .strategy members of sysctl tables are dead code. Remove them. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 00:42:56 -08:00
Eric W. Biederman	83ac201b4f	sysctl: Remove dead code from sysctl_check Now that the sys_sysctl is now a compatibility wrapper around /proc/sys we can remove much of sysctl_check and reduce it to a few remaining sanity checks. This completely decouples it from the binary sysctl system call. Little things like ensuring that the sysctl has not already been registered are all that remain. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 00:42:53 -08:00
Eric W. Biederman	a965cf946d	sysctl: Neuter the generic sysctl strategy routines. Now that sys_sysctl is a compatibility layer on top of /proc/sys these routines are never called but are still put in sysctl tables so I have reduced them to stubs until they can be removed entirely. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 00:42:51 -08:00
Eric W. Biederman	26a7034b40	sysctl: Reduce sys_sysctl to a compatibility wrapper around /proc/sys To simply maintenance and to be able to remove all of the binary sysctl support from various subsystems I have rewritten the binary sysctl code as a compatibility wrapper around proc/sys. The code is built around a hard coded table based on the table in sysctl_check.c that lists all of our current binary sysctls and provides enough information to convert from the sysctl binary input into into ascii and back again. New in this patch is the realization that the only dynamic entries that need to be handled have ifname as the asscii string and ifindex as their ctl_name. When a sys_sysctl is called the code now looks in the translation table converting the binary name to the path under /proc where the value is to be found. Opens that file, and calls into a format conversion wrapper that calls fop->read and then fop->write as appropriate. Since in practice the practically no one uses or tests sys_sysctl rewritting the code to be beautiful is a little silly. The redeeming merit of this work is it allows us to rip out all of the binary sysctl syscall support from everywhere else in the tree. Allowing us to remove a lot of dead (after this patch) and barely maintained code. In addition it becomes much easier to optimize the sysctl implementation for being the backing store of /proc/sys, without having to worry about sys_sysctl. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-11 00:42:05 -08:00
Paul E. McKenney	c64ac3ce06	rcu: Simplify association of quiescent states with grace periods The rdp->passed_quiesc_completed fields are used to properly associate the recorded quiescent state with a grace period. It is OK to wrongly associate a given quiescent state with a preceding grace period, but it is fatal to associate a given quiescent state with a grace period that begins after the quiescent state occurred. Grace periods are numbered, and the following fields track them: o ->gpnum is the number of the grace period currently in progress, or the number of the last grace period to complete if no grace period is currently in progress. o ->completed is the number of the last grace period to have completed. These two fields are equal if there is no grace period in progress, otherwise ->gpnum is one greater than ->completed. But the rdp->passed_quiesc_completed field compared against ->completed, and if equal, the quiescent state is presumed to count against the current grace period. The earlier code copied rdp->completed to rdp->passed_quiesc_completed, which has been made to work, but is error-prone. In contrast, copying one less than rdp->gpnum is guaranteed safe, because rdp->gpnum is not incremented until after the start of the corresponding grace period. At the end of the grace period, when ->completed has incremented, then any quiescent periods recorded previously will be discarded. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12578890421011-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 22:48:50 +01:00
Paul E. McKenney	4bcfe05503	rcu: Rename dynticks_completed to completed_fqs This field is used whether or not CONFIG_NO_HZ is set, so the old name of ->dynticks_completed is quite misleading. Change to ->completed_fqs, given that it the value that force_quiescent_state() is trying to drive the ->completed field away from. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12578890423298-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 22:48:50 +01:00
Paul E. McKenney	956539b759	rcu: Enable synchronize_sched_expedited() fastpath This patch adds a counter increment to enable tasks to actually take the synchronize_sched_expedited() function's fastpath. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1257889042435-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 22:48:49 +01:00
Paul E. McKenney	dbe01350fa	rcu: Remove inline from forward-referenced functions Some variants of gcc are reputed to dislike forward references to functions declared "inline". Remove the "inline" keyword from such functions. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <12578890422402-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 22:48:49 +01:00
Peter Zijlstra	ffd44db5f0	sched: Make sure task has correct sched_class after policy change From the code in rt_mutex_setprio(), it is evident that the intention is that task's with a RT 'prio' value as a consequence of receiving a PI boost also have their 'sched_class' field set to '&rt_sched_class'. However, Peter noticed that the code in __setscheduler() could result in this intention being frustrated. Fix it. Reported-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <1257880321.4108.457.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 20:22:31 +01:00
Frederic Weisbecker	f60d24d2ad	hw-breakpoints: Fix broken hw-breakpoint sample module The hw-breakpoint sample module has been broken during the hw-breakpoint internals refactoring. Propagate the changes to it. Reported-by: "K. Prasad" <prasad@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-11-10 11:23:29 +01:00
Paul Mundt	676c0dbe6e	ksym_tracer: Support read accesses independent of read/write. All of the infrastructure already exists to support read accesses for platforms that support a read access independently of read/write (such as in the case of the SuperH UBC). This just trivially hooks up the read case by itself. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kiszka <jan.kiszka@web.de> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Avi Kivity <avi@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20091109083733.GA25848@linux-sh.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-11-10 11:10:08 +01:00
Mike Galbraith	eae0c9dfb5	sched: Fix and clean up rate-limit newidle code Commit `1b9508f`, "Rate-limit newidle" has been confirmed to fix the netperf UDP loopback regression reported by Alex Shi. This is a cleanup and a fix: - moved to a more out of the way spot - fix to ensure that balancing doesn't try to balance runqueues which haven't gone online yet, which can mess up CPU enumeration during boot. Reported-by: Alex Shi <alex.shi@intel.com> Reported-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org> # .32.x: `a1f84a3`: sched: Check for an idle shared cache Cc: <stable@kernel.org> # .32.x: `1b9508f`: sched: Rate-limit newidle Cc: <stable@kernel.org> # .32.x: `fd21073`: sched: Fix affinity logic Cc: <stable@kernel.org> # .32.x LKML-Reference: <1257821402.5648.17.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 04:25:58 +01:00
Paul E. McKenney	9160306e6f	rcu: Fix note_new_gpnum() uses of ->gpnum Impose a clear locking design on the note_new_gpnum() function's use of the ->gpnum counter. This is done by updating rdp->gpnum only from the corresponding leaf rcu_node structure's rnp->gpnum field, and even then only under the protection of that same rcu_node structure's ->lock field. Performance and scalability are maintained using a form of double-checked locking, and excessive spinning is avoided by use of the spin_trylock() function. The use of spin_trylock() is safe due to the fact that CPUs who fail to acquire this lock will try again later. The hierarchical nature of the rcu_node data structure limits contention (which could be limited further if need be using the RCU_FANOUT kernel parameter). Without this patch, obscure but quite possible races could result in a quiescent state that occurred during one grace period to be accounted to the following grace period, causing this following grace period to end prematurely. Not good! Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: <stable@kernel.org> # .32.x LKML-Reference: <12571987492350-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 04:12:11 +01:00
Paul E. McKenney	d09b62dfa3	rcu: Fix synchronization for rcu_process_gp_end() uses of ->completed counter Impose a clear locking design on the rcu_process_gp_end() function's use of the ->completed counter. This is done by creating a ->completed field in the rcu_node structure, which can safely be accessed under the protection of that structure's lock. Performance and scalability are maintained by using a form of double-checked locking, so that rcu_process_gp_end() only acquires the leaf rcu_node structure's ->lock if a grace period has recently ended. This fix reduces rcutorture failure rate by at least two orders of magnitude under heavy stress with force_quiescent_state() being invoked artificially often. Without this fix, unsynchronized access to the ->completed field can cause rcu_process_gp_end() to advance callbacks whose grace period has not yet expired. (Bad idea!) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: <stable@kernel.org> # .32.x LKML-Reference: <12571987494069-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 04:11:54 +01:00
Paul E. McKenney	281d150c5f	rcu: Prepare for synchronization fixes: clean up for non-NO_HZ handling of ->completed counter Impose a clear locking design on non-NO_HZ handling of the ->completed counter. This increases the distance between the RCU and the CPU-hotplug mechanisms. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: <stable@kernel.org> # .32.x LKML-Reference: <12571987491353-git-send-email-> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 04:11:17 +01:00
Ingo Molnar	7e1a2766e6	Merge branch 'core/urgent' into core/rcu Merge reason: Pick up RCU fixlet to base further commits on. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-10 04:10:35 +01:00
Eric Paris	dd8dbf2e68	security: report the module name to security_module_request For SELinux to do better filtering in userspace we send the name of the module along with the AVC denial when a program is denied module_request. Example output: type=SYSCALL msg=audit(11/03/2009 10:59:43.510:9) : arch=x86_64 syscall=write success=yes exit=2 a0=3 a1=7fc28c0d56c0 a2=2 a3=7fffca0d7440 items=0 ppid=1727 pid=1729 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=rpc.nfsd exe=/usr/sbin/rpc.nfsd subj=system_u:system_r:nfsd_t:s0 key=(null) type=AVC msg=audit(11/03/2009 10:59:43.510:9) : avc: denied { module_request } for pid=1729 comm=rpc.nfsd kmod="net-pf-10" scontext=system_u:system_r:nfsd_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=system Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>	2009-11-10 09:33:46 +11:00
Naohiro Ooiwa	f84d49b218	signal: Print warning message when dropping signals When the system has too many timers or too many aggregate queued signals, the EAGAIN error is returned to application from kernel, including timer_create() [POSIX.1b]. It means that the app exceeded the limit of pending signals, but in general application writers do not expect this outcome and the current silent failure can cause rare app failures under very high load. This patch adds a new message when we reach the limit and if print_fatal_signals is enabled: task/1234: reached RLIMIT_SIGPENDING, dropping signal If you see this message and your system behaved unexpectedly, you can run following command to lift the limit: # ulimit -i unlimited With help from Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>. Signed-off-by: Naohiro Ooiwa <nooiwa@miraclelinux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Roland McGrath <roland@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: oleg@redhat.com LKML-Reference: <4AF6E7E2.9080406@miraclelinux.com> [ Modified a few small details, gave surrounding code some love. ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-09 09:44:26 +01:00
Uwe Kleine-K�nig	b71a8eb0fa	tree-wide: fix typos "selct" + "slect" -> "select" This patch was generated by git grep -E -i -l 's(le\|el)ct' \| xargs -r perl -p -i -e 's/([Ss])(le\|el)ct/$1elect/ with only skipping net/netfilter/xt_SECMARK.c and include/linux/netfilter/xt_SECMARK.h which have a struct member called selctx. Signed-off-by: Uwe Kleine-K�nig <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-11-09 09:40:56 +01:00
Li Zefan	30ff21e31f	ksym_tracer: Remove KSYM_SELFTEST_ENTRY The macro used to be used in both trace_selftest.c and trace_ksym.c, but no longer, so remove it from header file. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-11-08 16:21:01 +01:00
Frederic Weisbecker	ba1c813a6b	hw-breakpoints: Arbitrate access to pmu following registers constraints Allow or refuse to build a counter using the breakpoints pmu following given constraints. We keep track of the pmu users by using three per cpu variables: - nr_cpu_bp_pinned stores the number of pinned cpu breakpoints counters in the given cpu - nr_bp_flexible stores the number of non-pinned breakpoints counters in the given cpu. - task_bp_pinned stores the number of pinned task breakpoints in a cpu The latter is not a simple counter but gathers the number of tasks that have n pinned breakpoints. Considering HBP_NUM the number of available breakpoint address registers: task_bp_pinned[0] is the number of tasks having 1 breakpoint task_bp_pinned[1] is the number of tasks having 2 breakpoints [...] task_bp_pinned[HBP_NUM - 1] is the number of tasks having the maximum number of registers (HBP_NUM). When a breakpoint counter is created and wants an access to the pmu, we evaluate the following constraints: == Non-pinned counter == - If attached to a single cpu, check: (per_cpu(nr_bp_flexible, cpu) \|\| (per_cpu(nr_cpu_bp_pinned, cpu) + max(per_cpu(task_bp_pinned, cpu)))) < HBP_NUM -> If there are already non-pinned counters in this cpu, it means there is already a free slot for them. Otherwise, we check that the maximum number of per task breakpoints (for this cpu) plus the number of per cpu breakpoint (for this cpu) doesn't cover every registers. - If attached to every cpus, check: (per_cpu(nr_bp_flexible, ) \|\| (max(per_cpu(nr_cpu_bp_pinned, )) + max(per_cpu(task_bp_pinned, )))) < HBP_NUM -> This is roughly the same, except we check the number of per cpu bp for every cpu and we keep the max one. Same for the per tasks breakpoints. == Pinned counter == - If attached to a single cpu, check: ((per_cpu(nr_bp_flexible, cpu) > 1) + per_cpu(nr_cpu_bp_pinned, cpu) + max(per_cpu(task_bp_pinned, cpu))) < HBP_NUM -> Same checks as before. But now the nr_bp_flexible, if any, must keep one register at least (or flexible breakpoints will never be be fed). - If attached to every cpus, check: ((per_cpu(nr_bp_flexible, ) > 1) + max(per_cpu(nr_cpu_bp_pinned, )) + max(per_cpu(task_bp_pinned, ))) < HBP_NUM Changes in v2: - Counter -> event rename Changes in v5: - Fix unreleased non-pinned task-bound-only counters. We only released it in the first cpu. (Thanks to Paul Mackerras for reporting that) Changes in v6: - Currently, events scheduling are done in this order: cpu context pinned + cpu context non-pinned + task context pinned + task context non-pinned events. Then our current constraints are right theoretically but not in practice, because non-pinned counters may be scheduled before we can apply every possible pinned counters. So consider non-pinned counters as pinned for now. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jan Kiszka <jan.kiszka@web.de> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Avi Kivity <avi@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org>	2009-11-08 16:20:47 +01:00
Frederic Weisbecker	24f1e32c60	hw-breakpoints: Rewrite the hw-breakpoints layer on top of perf events This patch rebase the implementation of the breakpoints API on top of perf events instances. Each breakpoints are now perf events that handle the register scheduling, thread/cpu attachment, etc.. The new layering is now made as follows: ptrace kgdb ftrace perf syscall \ \| / / \ \| / / / Core breakpoint API / / \| / \| / Breakpoints perf events \| \| Breakpoints PMU ---- Debug Register constraints handling (Part of core breakpoint API) \| \| Hardware debug registers Reasons of this rewrite: - Use the centralized/optimized pmu registers scheduling, implying an easier arch integration - More powerful register handling: perf attributes (pinned/flexible events, exclusive/non-exclusive, tunable period, etc...) Impact: - New perf ABI: the hardware breakpoints counters - Ptrace breakpoints setting remains tricky and still needs some per thread breakpoints references. Todo (in the order): - Support breakpoints perf counter events for perf tools (ie: implement perf_bpcounter_event()) - Support from perf tools Changes in v2: - Follow the perf "event " rename - The ptrace regression have been fixed (ptrace breakpoint perf events weren't released when a task ended) - Drop the struct hw_breakpoint and store generic fields in perf_event_attr. - Separate core and arch specific headers, drop asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h - Use new generic len/type for breakpoint - Handle off case: when breakpoints api is not supported by an arch Changes in v3: - Fix broken CONFIG_KVM, we need to propagate the breakpoint api changes to kvm when we exit the guest and restore the bp registers to the host. Changes in v4: - Drop the hw_breakpoint_restore() stub as it is only used by KVM - EXPORT_SYMBOL_GPL hw_breakpoint_restore() as KVM can be built as a module - Restore the breakpoints unconditionally on kvm guest exit: TIF_DEBUG_THREAD doesn't anymore cover every cases of running breakpoints and vcpu->arch.switch_db_regs might not always be set when the guest used debug registers. (Waiting for a reliable optimization) Changes in v5: - Split-up the asm-generic/hw-breakpoint.h moving to linux/hw_breakpoint.h into a separate patch - Optimize the breakpoints restoring while switching from kvm guest to host. We only want to restore the state if we have active breakpoints to the host, otherwise we don't care about messed-up address registers. - Add asm/hw_breakpoint.h to Kbuild - Fix bad breakpoint type in trace_selftest.c Changes in v6: - Fix wrong header inclusion in trace.h (triggered a build error with CONFIG_FTRACE_SELFTEST Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jan Kiszka <jan.kiszka@web.de> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Avi Kivity <avi@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org>	2009-11-08 15:34:42 +01:00
Lai Jiangshan	d8c80ce091	sched, no_hz: Remove unused rq->last_tick_seen field In `15934a3732`, field last_tick_seen is added to struct rq. But it is unused now. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Guillaume Chazarain <guichaz@yahoo.fr> LKML-Reference: <4AE6A513.6010100@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-08 13:17:47 +01:00
Cyrill Gorcunov	e9036b36ee	sched: Use root_task_group_empty only with FAIR_GROUP_SCHED root_task_group_empty is used only with FAIR_GROUP_SCHED so if we use other scheduler options we get: kernel/sched.c:314: warning: 'root_task_group_empty' defined but not used So move CONFIG_FAIR_GROUP_SCHED up that it covers root_task_group_empty(). Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091026192414.GB5321@lenovo> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-08 13:15:48 +01:00
Cyrill Gorcunov	c82a43d40b	irq: Do not attempt to create subdirectories if /proc/irq/<irq> failed If a parent directory (ie /proc/irq/<irq>) could not be created we should not attempt to create subdirectories. Otherwise it would lead that "smp_affinity" and "spurious" entries are may be registered under /proc root instead of a proper place. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <20091026202811.GD5321@lenovo> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-08 13:14:22 +01:00
Randy Dunlap	968c86458a	sched: Fix kernel-doc function parameter name Fix variable name in sched.c kernel-doc notation. Fixes this DocBook warning: Warning(kernel/sched.c:2008): No description found for parameter 'p' Warning(kernel/sched.c:2008): Excess function parameter 'k' description in 'kthread_bind' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <4AF4B1BC.8020604@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-08 11:26:25 +01:00
Frederic Weisbecker	444a2a3bcd	tracing, perf_events: Protect the buffer from recursion in perf While tracing using events with perf, if one enables the lockdep:lock_acquire event, it will infect every other perf trace events. Basically, you can enable whatever set of trace events through perf but if this event is part of the set, the only result we can get is a long list of lock_acquire events of rcu read lock, and only that. This is because of a recursion inside perf. 1) When a trace event is triggered, it will fill a per cpu buffer and submit it to perf. 2) Perf will commit this event but will also protect some data using rcu_read_lock 3) A recursion appears: rcu_read_lock triggers a lock_acquire event that will fill the per cpu event and then submit the buffer to perf. 4) Perf detects a recursion and ignores it 5) Perf continues its work on the previous event, but its buffer has been overwritten by the lock_acquire event, it has then been turned into a lock_acquire event of rcu read lock Such scenario also happens with lock_release with rcu_read_unlock(). We could turn the rcu_read_lock() into __rcu_read_lock() to drop the lock debugging from perf fast path, but that would make us lose the rcu debugging and that doesn't prevent from other possible kind of recursion from perf in the future. This patch adds a recursion protection based on a counter on the perf trace per cpu buffers to solve the problem. -v2: Fixed lost whitespace, added reviewed-by tag Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Jason Baron <jbaron@redhat.com> LKML-Reference: <1257477185-7838-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-08 10:31:42 +01:00
Yong Zhang	e7e7e0c084	genirq: try_one_irq() must be called with irq disabled Prarit reported: ================================= [ INFO: inconsistent lock state ] 2.6.32-rc5 #1 --------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes: (&irq_desc_lock_class){?.-...}, at: [<ffffffff810c264e>] try_one_irq+0x32/0x138 {IN-HARDIRQ-W} state was registered at: [<ffffffff81095160>] __lock_acquire+0x2fc/0xd5d [<ffffffff81095cb4>] lock_acquire+0xf3/0x12d [<ffffffff814cdadd>] _spin_lock+0x40/0x89 [<ffffffff810c3389>] handle_level_irq+0x30/0x105 [<ffffffff81014e0e>] handle_irq+0x95/0xb7 [<ffffffff810141bd>] do_IRQ+0x6a/0xe0 [<ffffffff81012813>] ret_from_intr+0x0/0x16 irq event stamp: 195096 hardirqs last enabled at (195096): [<ffffffff814cd7f7>] _spin_unlock_irq+0x3a/0x5c hardirqs last disabled at (195095): [<ffffffff814cdbdd>] _spin_lock_irq+0x29/0x95 softirqs last enabled at (195088): [<ffffffff81068c92>] __do_softirq+0x1c1/0x1ef softirqs last disabled at (195093): [<ffffffff8101304c>] call_softirq+0x1c/0x30 other info that might help us debug this: 1 lock held by swapper/0: #0: (kernel/irq/spurious.c:21){+.-...}, at: [<ffffffff81070cf2>] run_timer_softirq+0x1a9/0x315 stack backtrace: Pid: 0, comm: swapper Not tainted 2.6.32-rc5 #1 Call Trace: <IRQ> [<ffffffff81093e94>] valid_state+0x187/0x1ae [<ffffffff81093fe4>] mark_lock+0x129/0x253 [<ffffffff810951d4>] __lock_acquire+0x370/0xd5d [<ffffffff81095cb4>] lock_acquire+0xf3/0x12d [<ffffffff814cdadd>] _spin_lock+0x40/0x89 [<ffffffff810c264e>] try_one_irq+0x32/0x138 [<ffffffff810c2795>] poll_all_shared_irqs+0x41/0x6d [<ffffffff810c27dd>] poll_spurious_irqs+0x1c/0x49 [<ffffffff81070d82>] run_timer_softirq+0x239/0x315 [<ffffffff81068bd3>] __do_softirq+0x102/0x1ef [<ffffffff8101304c>] call_softirq+0x1c/0x30 [<ffffffff81014b65>] do_softirq+0x59/0xca [<ffffffff810686ad>] irq_exit+0x58/0xae [<ffffffff81029b84>] smp_apic_timer_interrupt+0x94/0xba [<ffffffff81012a33>] apic_timer_interrupt+0x13/0x20 The reason is that try_one_irq() is called from hardirq context with interrupts disabled and from softirq context (poll_all_shared_irqs()) with interrupts enabled. Disable interrupts before calling it from poll_all_shared_irqs(). Reported-and-tested-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> LKML-Reference: <1257563773-4620-1-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-07 21:44:45 +01:00
Eric W. Biederman	642c6d946b	sysctl: Make do_sysctl static Now that all of the architectures use compat_sys_sysctl do_sysctl can become static. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-06 03:53:59 -08:00
Eric W. Biederman	942405f360	sysctl: Remove the cond_syscall entry for sys32_sysctl Now that all architechtures are use compat_sys_sysctl and sys32_sysctl does not exist there is not point in retaining a cond_syscall entry for it. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-06 03:53:59 -08:00
Eric W. Biederman	da3f6f9b3e	sysctl: Introduce a generic compat sysctl sysctl This uses compat_alloc_userspace to remove the various hacks to allow do_sysctl to write to throuh oldlenp. The rest of our mature compat syscall helper facitilies are used as well to ensure we have a nice clean maintainable compat syscall that can be used on all architectures. The motiviation for a generic compat sysctl (besides the obvious hack removal) is to reduce the number of compat sysctl defintions out there so I can refactor the binary sysctl implementation. ppc already used the name compat_sys_sysctl so I remove the ppcs version here. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-06 03:52:55 -08:00
Eric W. Biederman	2830b68361	sysctl: Refactor the binary sysctl handling to remove duplicate code Read in the binary sysctl path once, instead of reread it from user space each time the code needs to access a path element. The deprecated sysctl warning is moved to do_sysctl so that the compat_sysctl entries syscalls will also warn. The return of -ENOSYS when !CONFIG_SYSCTL_SYSCALL is moved to binary_sysctl. Always leaving a do_sysctl available that handles !CONFIG_SYSCTL_SYSCALL and printing the deprecated sysctl warning allows for a single defitition of the sysctl syscall. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-06 03:23:35 -08:00
Eric W. Biederman	afa588b265	sysctl: Separate the binary sysctl logic into it's own file. In preparation for more invasive cleanups separate the core binary sysctl logic into it's own file. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2009-11-06 03:20:07 -08:00
Linus Torvalds	608221fdf9	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c sched: Disable SD_PREFER_LOCAL at node level sched: Fix boot crash by zalloc()ing most of the cpu masks sched: Strengthen buddies and mitigate buddy induced latencies	2009-11-05 10:56:47 -08:00
Linus Torvalds	72cc129e8d	Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: ftrace: Fix unmatched locking in ftrace_regex_write() ring-buffer: Synchronize resizing buffer with reader lock	2009-11-05 10:56:25 -08:00
Mike Galbraith	fd210738f6	sched: Fix affinity logic in select_task_rq_fair() Ingo Molnar reported: [ 26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10 [ 26.808000] caller is vmstat_update+0x26/0x70 [ 26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887 [ 26.816000] Call Trace: [ 26.820000] [<c1924a24>] ? printk+0x28/0x3c [ 26.824000] [<c13258a0>] debug_smp_processor_id+0xf0/0x110 [ 26.824000] mount used greatest stack depth: 1464 bytes left [ 26.828000] [<c111d086>] vmstat_update+0x26/0x70 [ 26.832000] [<c1086418>] worker_thread+0x188/0x310 [ 26.836000] [<c10863b7>] ? worker_thread+0x127/0x310 [ 26.840000] [<c108d310>] ? autoremove_wake_function+0x0/0x60 [ 26.844000] [<c1086290>] ? worker_thread+0x0/0x310 [ 26.848000] [<c108cf0c>] kthread+0x7c/0x90 [ 26.852000] [<c108ce90>] ? kthread+0x0/0x90 [ 26.856000] [<c100c0a7>] kernel_thread_helper+0x7/0x10 [ 26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10 [ 26.864000] caller is vmstat_update+0x3c/0x70 Because this commit: `a1f84a3`: sched: Check for an idle shared cache in select_task_rq_fair() broke ->cpus_allowed. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: arjan@infradead.org Cc: <stable@kernel.org> LKML-Reference: <1257415066.12867.1.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-05 11:01:39 +01:00
Martin Schwidefsky	3c5d92a0cf	nohz: Introduce arch_needs_cpu Allow the architecture to request a normal jiffy tick when the system goes idle and tick_nohz_stop_sched_tick is called . On s390 the hook is used to prevent the system going fully idle if there has been an interrupt other than a clock comparator interrupt since the last wakeup. On s390 the HiperSockets response time for 1 connection ping-pong goes down from 42 to 34 microseconds. The CPU cost decreases by 27%. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> LKML-Reference: <20090929122533.402715150@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-05 07:53:53 +01:00
Martin Schwidefsky	eed3b9cf3f	nohz: Reuse ktime in sub-functions of tick_check_idle. On a system with NOHZ=y tick_check_idle calls tick_nohz_stop_idle and tick_nohz_update_jiffies. Given the right conditions (ts->idle_active and/or ts->tick_stopped) both function get a time stamp with ktime_get. The same time stamp can be reused if both function require one. On s390 this change has the additional benefit that gcc inlines the tick_nohz_stop_idle function into tick_check_idle. The number of instructions to execute tick_check_idle drops from 225 to 144 (without the ktime_get optimization it is 367 vs 215 instructions). before: 0) \| tick_check_idle() { 0) \| tick_nohz_stop_idle() { 0) \| ktime_get() { 0) \| read_tod_clock() { 0) 0.601 us \| } 0) 1.765 us \| } 0) 3.047 us \| } 0) \| ktime_get() { 0) \| read_tod_clock() { 0) 0.570 us \| } 0) 1.727 us \| } 0) \| tick_do_update_jiffies64() { 0) 0.609 us \| } 0) 8.055 us \| } after: 0) \| tick_check_idle() { 0) \| ktime_get() { 0) \| read_tod_clock() { 0) 0.617 us \| } 0) 1.773 us \| } 0) \| tick_do_update_jiffies64() { 0) 0.593 us \| } 0) 4.477 us \| } Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: john stultz <johnstul@us.ibm.com> LKML-Reference: <20090929122533.206589318@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-05 07:53:53 +01:00
Bjorn Helgaas	1e5ad96790	resources: when allocate_resource() fails, leave resource untouched When "allocate_resource(root, new, size, ...)" fails, we currently clobber "new". This is inconvenient for the caller, who might care about the original contents of the resource. For example, when pci_bus_alloc_resource() fails, the "can't allocate mem resource %pR" message from pci_assign_resources() currently contains junk for the resource start/end. This patch delays the "new" update until we're about to return success. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2009-11-04 13:06:46 -08:00
Mike Galbraith	1b9508f683	sched: Rate-limit newidle Rate limit newidle to migration_cost. It's a win for all stages of sysbench oltp tests. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 19:13:48 +01:00
Mike Galbraith	a1f84a3ab8	sched: Check for an idle shared cache in select_task_rq_fair() When waking affine, check for an idle shared cache, and if found, wake to that CPU/sibling instead of the waker's CPU. This improves pgsql+oltp ramp up by roughly 8%. Possibly more for other loads, depending on overlap. The trade-off is a roughly 1% peak downturn if tasks are truly synchronous. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: <stable@kernel.org> LKML-Reference: <1256654138.17752.7.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 18:46:22 +01:00
Thomas Gleixner	663e695928	irq: Remove unused debug_poll_all_shared_irqs() commit `74296a8ed` added this function for debug purposes, but it was never used for anything. Remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-04 14:22:21 +01:00
Liuweni	24b26d4211	irq: Fix docbook comments Fix docbook comments to match the actual function names (set_irq_msi, handle_percpu_irq). Signed-off-by: Liuwenyi <qingshenlwy@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2009-11-04 14:13:14 +01:00
Rusty Russell	acc3f5d7ca	cpumask: Partition_sched_domains takes array of cpumask_var_t Currently partition_sched_domains() takes a 'struct cpumask *doms_new' which is a kmalloc'ed array of cpumask_t. You can't have such an array if 'struct cpumask' is undefined, as we plan for CONFIG_CPUMASK_OFFSTACK=y. So, we make this an array of cpumask_var_t instead: this is the same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case. Hence we add alloc_sched_domains() and free_sched_domains() functions. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <200911031453.40668.rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 13:16:40 +01:00
Rusty Russell	e2c8806304	cpumask: Simplify sched_rt.c find_lowest_rq() wants to call pick_optimal_cpu() on the intersection of sched_domain_span(sd) and lowest_mask. Rather than doing a cpus_and into a temporary, we can open-code it. This actually makes the code slightly clearer, IMHO. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Gregory Haskins <ghaskins@novell.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <200911031453.15350.rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 13:16:38 +01:00
Masami Hiramatsu	77b44d1b7c	tracing/kprobes: Rename Kprobe-tracer to kprobe-event Rename Kprobes-based event tracer to kprobes-based tracing event (kprobe-event), since it is not a tracer but an extensible tracing event interface. This also changes CONFIG_KPROBE_TRACER to CONFIG_KPROBE_EVENT and sets it y by default. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> LKML-Reference: <20091104001247.3454.14131.stgit@harusame> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 13:02:48 +01:00
Ingo Molnar	a2e7127153	Merge commit 'v2.6.32-rc6' into perf/core Conflicts: tools/perf/Makefile Merge reason: Resolve the conflict, merge to upstream and merge in perf fixes so we can add a dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 11:59:45 +01:00
Hiroshi Shimamoto	9824a2b728	sched: Remove unused cpu_nr_migrations() cpu_nr_migrations() is not used, remove it. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4AF12A66.6020609@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 11:43:43 +01:00
Hiroshi Shimamoto	1477b6a7ed	sched: Remove unused __schedule() declaration __schedule() had been removed. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4AF129C8.3030008@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-04 11:43:42 +01:00
Li Zefan	ed146b2594	ftrace: Fix unmatched locking in ftrace_regex_write() When a command is passed to the set_ftrace_filter, then the ftrace_regex_lock is still held going back to user space. # echo 'do_open : foo' > set_ftrace_filter (still holding ftrace_regex_lock when returning to user space!) Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4AEF7F8A.3080300@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-04 01:42:10 -05:00
Lai Jiangshan	f7112949f6	ring-buffer: Synchronize resizing buffer with reader lock We got a sudden panic when we reduced the size of the ringbuffer. We can reproduce the panic by the following steps: echo 1 > events/sched/enable cat trace_pipe > /dev/null & while ((1)) do echo 12000 > buffer_size_kb echo 512 > buffer_size_kb done (not more than 5 seconds, panic ...) Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4AF01735.9060409@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-11-04 00:04:20 -05:00
Frederic Weisbecker	97eaf5300b	perf/core: Add a callback to perf events A simple callback in a perf event can be used for multiple purposes. For example it is useful for triggered based events like hardware breakpoints that need a callback to dispatch a triggered breakpoint event. v2: Simplify a bit the callback attribution as suggested by Paul Mackerras Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "K.Prasad" <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mundt <lethal@linux-sh.org>	2009-11-03 19:11:53 +01:00
Arjan van de Ven	fb0459d75c	perf/core: Provide a kernel-internal interface to get to performance counters There are reasons for kernel code to ask for, and use, performance counters. For example, in CPU freq governors this tends to be a good idea, but there are other examples possible as well of course. This patch adds the needed bits to do enable this functionality; they have been tested in an experimental cpufreq driver that I'm working on, and the changes are all that I needed to access counters properly. [fweisbec@gmail.com: added pid to perf_event_create_kernel_counter so that we can profile a particular task too TODO: Have a better error reporting, don't just return NULL in fail case.] v2: Remove the wrong comment about the fact perf_event_create_kernel_counter must be called from a kernel thread. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: "K.Prasad" <prasad@linux.vnet.ibm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Avi Kivity <avi@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jan Kiszka <jan.kiszka@web.de> Cc: Avi Kivity <avi@redhat.com> LKML-Reference: <20090925122556.2f8bd939@infradead.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2009-11-03 18:04:17 +01:00
Linus Torvalds	38dc63459f	Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: Remove some debug messages producing too much noise PM: Fix warning on suspend errors PM / Hibernate: Add newline to load_image() fail path PM / Hibernate: Fix error handling in save_image() PM / Hibernate: Fix blkdev refleaks PM / yenta: Split resume into early and late parts (rev. 4)	2009-11-03 07:52:57 -08:00
Ian Campbell	1d51075094	Correct nr_processes() when CPUs have been unplugged nr_processes() returns the sum of the per cpu counter process_counts for all online CPUs. This counter is incremented for the current CPU on fork() and decremented for the current CPU on exit(). Since a process does not necessarily fork and exit on the same CPU the process_count for an individual CPU can be either positive or negative and effectively has no meaning in isolation. Therefore calculating the sum of process_counts over only the online CPUs omits the processes which were started or stopped on any CPU which has since been unplugged. Only the sum of process_counts across all possible CPUs has meaning. The only caller of nr_processes() is proc_root_getattr() which calculates the number of links to /proc as stat->nlink = proc_root.nlink + nr_processes(); You don't have to be all that unlucky for the nr_processes() to return a negative value leading to a negative number of links (or rather, an apparently enormous number of links). If this happens then you can get failures where things like "ls /proc" start to fail because they got an -EOVERFLOW from some stat() call. Example with some debugging inserted to show what goes on: # ps haux\|wc -l nr_processes: CPU0: 90 nr_processes: CPU1: 1030 nr_processes: CPU2: -900 nr_processes: CPU3: -136 nr_processes: TOTAL: 84 proc_root_getattr. nlink 12 + nr_processes() 84 = 96 84 # echo 0 >/sys/devices/system/cpu/cpu1/online # ps haux\|wc -l nr_processes: CPU0: 85 nr_processes: CPU2: -901 nr_processes: CPU3: -137 nr_processes: TOTAL: -953 proc_root_getattr. nlink 12 + nr_processes() -953 = -941 75 # stat /proc/ nr_processes: CPU0: 84 nr_processes: CPU2: -901 nr_processes: CPU3: -137 nr_processes: TOTAL: -954 proc_root_getattr. nlink 12 + nr_processes() -954 = -942 File: `/proc/' Size: 0 Blocks: 0 IO Block: 1024 directory Device: 3h/3d Inode: 1 Links: 4294966354 Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2009-11-03 09:06:55.000000000 +0000 Modify: 2009-11-03 09:06:55.000000000 +0000 Change: 2009-11-03 09:06:55.000000000 +0000 I'm not 100% convinced that the per_cpu regions remain valid for offline CPUs, although my testing suggests that they do. If not then I think the correct solution would be to aggregate the process_count for a given CPU into a global base value in cpu_down(). This bug appears to pre-date the transition to git and it looks like it may even have been present in linux-2.6.0-test7-bk3 since it looks like the code Rusty patched in http://lwn.net/Articles/64773/ was already wrong. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-11-03 07:52:39 -08:00
Jiri Slaby	bf9fd67a03	PM / Hibernate: Add newline to load_image() fail path Finish a line by \n when load_image fails in the middle of loading. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-11-03 11:03:09 +01:00
Jiri Slaby	4ff277f9e4	PM / Hibernate: Fix error handling in save_image() There are too many retval variables in save_image(). Thus error return value from snapshot_read_next() may be ignored and only part of the snapshot (successfully) written. Remove 'error' variable, invert the condition in the do-while loop and convert the loop to use only 'ret' variable. Switch the rest of the function to consider only 'ret'. Also make sure we end printed line by \n if an error occurs. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-11-03 11:02:43 +01:00
Jiri Slaby	76b57e613f	PM / Hibernate: Fix blkdev refleaks While cruising through the swsusp code I found few blkdev reference leaks of resume_bdev. swsusp_read: remove blkdev_put altogether. Some fail paths do not do that. swsusp_check: make sure we always put a reference on fail paths software_resume: all fail paths between swsusp_check and swsusp_read omit swsusp_close. Add it in those cases. And since swsusp_read doesn't drop the reference anymore, do it here unconditionally. [rjw: Fixed a small coding style issue.] Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2009-11-03 11:01:46 +01:00
Mike Galbraith	b84ff7d6f1	sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c Eric Paris reported that commit `f685ceacab` causes boot time PREEMPT_DEBUG complaints. [ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314 [ 4.593043] caller is task_hot+0x86/0xd0 Since kthread_bind() messes with scheduler internals, move the body to sched.c, and lock the runqueue. Reported-by: Eric Paris <eparis@redhat.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Tested-by: Eric Paris <eparis@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1256813310.7574.3.camel@marge.simson.net> [ v2: fix !SMP build and clean up ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-11-03 07:25:00 +01:00

... 28 29 30 31 32 ...

11465 Commits