Commit Graph

6833 Commits

Author SHA1 Message Date
Tejun Heo
edcb463997 percpu, module: implement reserved allocation and use it for module percpu variables
Impact: add reserved allocation functionality and use it for module
	percpu variables

This patch implements reserved allocation from the first chunk.  When
setting up the first chunk, arch can ask to set aside certain number
of bytes right after the core static area which is available only
through a separate reserved allocator.  This will be used primarily
for module static percpu variables on architectures with limited
relocation range to ensure that the module perpcu symbols are inside
the relocatable range.

If reserved area is requested, the first chunk becomes reserved and
isn't available for regular allocation.  If the first chunk also
includes piggy-back dynamic allocation area, a separate chunk mapping
the same region is created to serve dynamic allocation.  The first one
is called static first chunk and the second dynamic first chunk.
Although they share the page map, their different area map
initializations guarantee they serve disjoint areas according to their
purposes.

If arch doesn't setup reserved area, reserved allocation is handled
like any other allocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06 14:33:59 +09:00
Steven Rostedt
770cb24345 tracing: add format files for ftrace default entries
Impact: allow user apps to read binary format of basic ftrace entries

Currently, only defined raw events export their formats so a binary
reader can parse them. There's no reason that the default ftrace entries
can't export their formats.

This patch adds a subsystem called "ftrace" in the events directory
that includes the ftrace entries for basic ftrace recorded items.

These only have three files in the events directory:

 type             : printf
 available_types  : printf
 format           : format for the event entry

For example:

 # cat /debug/tracing/events/ftrace/wakeup/format
name: wakeup
ID: 3
format:
        field:unsigned char type;       offset:0;       size:1;
        field:unsigned char flags;      offset:1;       size:1;
        field:unsigned char preempt_count;      offset:2;       size:1;
        field:int pid;  offset:4;       size:4;
        field:int tgid; offset:8;       size:4;

        field:unsigned int prev_pid;    offset:12;      size:4;
        field:unsigned char prev_prio;  offset:16;      size:1;
        field:unsigned char prev_state; offset:17;      size:1;
        field:unsigned int next_pid;    offset:20;      size:4;
        field:unsigned char next_prio;  offset:24;      size:1;
        field:unsigned char next_state; offset:25;      size:1;
        field:unsigned int next_cpu;    offset:28;      size:4;

print fmt: "%u:%u:%u  ==+ %u:%u:%u [%03u]"

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-05 21:46:44 -05:00
Steven Rostedt
33b0c229e3 tracing: move print of event format to separate file
Impact: clean up

Move the macro that creates the event format file to a separate header.
This will allow the default ftrace events to use this same macro
to create the formats to read those events.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-05 21:46:42 -05:00
Steven Rostedt
5e2336a0d4 tracing: make all file_operations const
Impact: cleanup

All file_operations structures should be constant. No one is going to
change them.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-05 21:46:40 -05:00
Ingo Molnar
40ada30f96 tracing: clean up menu
Clean up menu structure, introduce TRACING_SUPPORT switch that signals
whether an architecture supports various instrumentation mechanisms.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-05 21:53:25 +01:00
Ingo Molnar
a1413c89ae Merge branch 'x86/urgent' into x86/core
Conflicts:
	arch/x86/include/asm/fixmap_64.h
Semantic merge:
	arch/x86/include/asm/fixmap.h

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-05 21:48:50 +01:00
Frederic Weisbecker
8a0be9ef82 sched: don't rebalance if attached on NULL domain
Impact: fix function graph trace hang / drop pointless softirq on UP

While debugging a function graph trace hang on an old PII, I saw
that it consumed most of its time on the timer interrupt. And
the domain rebalancing softirq was the most concerned.

The timer interrupt calls trigger_load_balance() which will
decide if it is worth to schedule a rebalancing softirq.

In case of builtin UP kernel, no problem arises because there is
no domain question.

In case of builtin SMP kernel running on an SMP box, still no
problem, the softirq will be raised each time we reach the
next_balance time.

In case of builtin SMP kernel running on a UP box (most distros
provide default SMP kernels, whatever the box you have), then
the CPU is attached to the NULL sched domain. So a kind of
unexpected behaviour happen:

trigger_load_balance() -> raises the rebalancing softirq later
on softirq: run_rebalance_domains() -> rebalance_domains() where
the for_each_domain(cpu, sd) is not taken because of the NULL
domain we are attached at. Which means rq->next_balance is never
updated. So on the next timer tick, we will enter
trigger_load_balance() which will always reschedule() the
rebalacing softirq:

if (time_after_eq(jiffies, rq->next_balance))
	raise_softirq(SCHED_SOFTIRQ);

So for each tick, we process this pointless softirq.

This patch fixes it by checking if we are attached to the null
domain before raising the softirq, another possible fix would be
to set the maximal possible JIFFIES value to rq->next_balance if
we are attached to the NULL domain.

v2: build fix on UP

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af242d.1c07d00a.32d5.ffffc019@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-05 14:04:44 +01:00
Frederic Weisbecker
0012693ad4 tracing/function-graph-tracer: use the more lightweight local clock
Impact: decrease hangs risks with the graph tracer on slow systems

Since the function graph tracer can spend too much time on timer
interrupts, it's better now to use the more lightweight local
clock. Anyway, the function graph traces are more reliable on a
per cpu trace.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af243d.06e9300a.53ad.ffff840c@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-05 12:14:41 +01:00
Ingo Molnar
49d2d266ad Merge commit 'v2.6.29-rc7' into sched/core 2009-03-05 11:59:10 +01:00
David Rientjes
03d78913f0 lockdep: remove duplicate CONFIG_DEBUG_LOCKDEP definitions
Impact: cleanup

The atomic debug modifiers are already defined in
kernel/lockdep_internals.h.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <alpine.DEB.2.00.0903050222160.30401@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-05 11:46:00 +01:00
Ingo Molnar
a140feab42 Merge commit 'v2.6.29-rc7' into core/locking 2009-03-05 11:45:22 +01:00
David S. Miller
508827ff0a Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/tokenring/tmspci.c
	drivers/net/ucc_geth_mii.c
2009-03-05 02:06:47 -08:00
Ingo Molnar
5e1607a00b tracing: rename ftrace_printk() => trace_printk()
Impact: cleanup

Use a more generic name - this also allows the prototype to move
to kernel.h and be generally available to kernel developers who
want to do some quick tracing.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-05 10:24:48 +01:00
Steven Rostedt
e9d25fe6ea tracing: have latency tracers set the latency format
The latency tracers (irqsoff, preemptoff, preemptirqsoff, and wakeup)
are pretty useless with the default output format. This patch makes them
automatically enable the latency format when they are selected. They
also record the state of the latency option, and if it was not enabled
when selected, they disable it on reset.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-04 22:15:30 -05:00
Steven Rostedt
27d48be844 tracing: consolidate print_lat_fmt and print_trace_fmt
Impact: clean up

Both print_lat_fmt and print_trace_fmt do pretty much the same thing
except for one different function call. This patch consolidates the
two functions and adds an if statement to perform the difference.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-04 21:57:29 -05:00
Steven Rostedt
5fd73f8624 tracing: remove extra latency_trace method from trace structure
Impact: clean up

The trace and latency_trace function pointers are identical for
every tracer but the function tracer. The differences in the function
tracer are trivial (latency output puts paranthesis around parent).

This patch removes the latency_trace pointer and all prints will
now just use the trace output function pointer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-04 21:42:04 -05:00
Steven Rostedt
c032ef64d6 tracing: add latency output format option
With the removal of the latency_trace file, we lost the ability
to see some of the finer details in a trace. Like the state of
interrupts enabled, the preempt count, need resched, and if we
are in an interrupt handler, softirq handler or not.

This patch simply creates an option to bring back the old format.
This also removes the warning about an unused variable that held
the latency_trace file operations.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-04 20:34:24 -05:00
Steven Rostedt
e74da5235c tracing: fix seq read from trace files
The buffer used by trace_seq was updated incorrectly. Instead
of consuming what was actually read, it consumed the rest of the
buffer on reads.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-04 20:31:11 -05:00
Steven Rostedt
2dc5d12b1f tracing: do not return EFAULT if read copied anything
Impact: fix trace read to conform to standards

Andrew Morton, Theodore Tso and H. Peter Anvin brought to my attention
that a userspace read should not return -EFAULT if it succeeded in
copying anything. It should only return -EFAULT if it failed to copy
at all.

This patch modifies the check of copy_from_user and updates the return
code appropriately.

I also used H. Peter Anvin's short cut rule to just test ret == count.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-04 19:10:05 -05:00
Steven Rostedt
4f3640f8a3 ring-buffer: fix timestamp in partial ring_buffer_page_read
If a partial ring_buffer_page_read happens, then some of the
incremental timestamps may be lost. This patch writes the
recent timestamp into the page that is passed back to the caller.

A partial ring_buffer_page_read is where the full page would not
be written back to the user, and instead, just part of the page
is copied to the user. A full page would be a page swap with the
ring buffer and the timestamps would be correct.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-04 19:01:45 -05:00
Steven Rostedt
e543ad7691 tracing: add cpu_file intialization for ftrace_dump
Impact: fix to ftrace_dump output corruption

The commit: b04cc6b1f6
  tracing/core: introduce per cpu tracing files

added a new field to the iterator called cpu_file. This was a handle
to differentiate between the per cpu trace output files and the
all cpu "trace" file. The all cpu "trace" file required setting this
to TRACE_PIPE_ALL_CPU.

The problem is that the ftrace_dump sets up its own iterator but was
not updated to handle this change. The result was only CPU 0 printing
out on crash and a lot of "<0>"'s also being printed.

Reported-by: Thomas Gleixner <tglx@linuxtronix.de>
Tested-by: Darren Hart <dvhtc@us.ibm.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-04 18:32:28 -05:00
Eric Dumazet
64ca5ab913 rcu: increment quiescent state counter in ksoftirqd()
If a machine is flooded by network frames, a cpu can loop
100% of its time inside ksoftirqd() without calling schedule().
This can delay RCU grace period to insane values.

Adding rcu_qsctr_inc() call in ksoftirqd() solves this problem.

Paul: "This regression was a result of the recent change from
"schedule()" to "cond_resched()", which got rid of that quiescent
state in the common case where a reschedule is not needed".

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-04 22:08:45 +01:00
Peter Zijlstra
efed792d67 tracing: add lockdep tracepoints for lock acquire/release
Augment the traces with lock names when lockdep is available:

 1)               |  down_read_trylock() {
 1)               |    _spin_lock_irqsave() {
 1)               |      /* lock_acquire: &sem->wait_lock */
 1)   4.201 us    |    }
 1)               |    _spin_unlock_irqrestore() {
 1)               |      /* lock_release: &sem->wait_lock */
 1)   3.523 us    |    }
 1)               |  /* lock_acquire: try read &mm->mmap_sem */
 1) + 13.386 us   |  }
 1)   1.635 us    |  find_vma();
 1)               |  handle_mm_fault() {
 1)               |    __do_fault() {
 1)               |      filemap_fault() {
 1)               |        find_lock_page() {
 1)               |          find_get_page() {
 1)               |            /* lock_acquire: read rcu_read_lock */
 1)               |            /* lock_release: rcu_read_lock */
 1)   5.697 us    |          }
 1)   8.158 us    |        }
 1) + 11.079 us   |      }
 1)               |      _spin_lock() {
 1)               |        /* lock_acquire: __pte_lockptr(page) */
 1)   3.949 us    |      }
 1)   1.460 us    |      page_add_file_rmap();
 1)               |      _spin_unlock() {
 1)               |        /* lock_release: __pte_lockptr(page) */
 1)   3.115 us    |      }
 1)               |      unlock_page() {
 1)   1.421 us    |        page_waitqueue();
 1)   1.220 us    |        __wake_up_bit();
 1)   6.519 us    |      }
 1) + 34.328 us   |    }
 1) + 37.452 us   |  }
 1)               |  up_read() {
 1)               |  /* lock_release: &mm->mmap_sem */
 1)               |    _spin_lock_irqsave() {
 1)               |      /* lock_acquire: &sem->wait_lock */
 1)   3.865 us    |    }
 1)               |    _spin_unlock_irqrestore() {
 1)               |      /* lock_release: &sem->wait_lock */
 1)   8.562 us    |    }
 1) + 17.370 us   |  }

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: =?ISO-8859-1?Q?T=F6r=F6k?= Edwin <edwintorok@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1236166375.5330.7209.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-04 18:49:58 +01:00
Ingo Molnar
28b1bd1cbc Merge branch 'core/locking' into tracing/ftrace 2009-03-04 18:49:19 +01:00
Peter Zijlstra
26575e28df lockdep: remove extra "irq" string
Impact: clarify lockdep printk text

print_irq_inversion_bug() gets handed state strings of the form

  "HARDIRQ", "SOFTIRQ", "RECLAIM_FS"

and appends "-irq-{un,}safe" to them, which is either redudant for *IRQ or
confusing in the RECLAIM_FS case.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1236175192.5330.7585.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-04 18:38:34 +01:00
Peter Zijlstra
1c21f14ec4 lockdep: fix incorrect state name
In the recent mark_lock_irq() rework a bug snuck in that would report the
state of write locks causing irq inversion under a read lock as a read
lock.

Fix this by masking the read bit of the state when validating write
dependencies.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1236172646.5330.7450.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-04 18:38:33 +01:00
Ingo Molnar
2602c3ba45 Merge branch 'rfc/splice/tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-03-04 11:15:42 +01:00
Ingo Molnar
a1be621dfa Merge branch 'tracing/ftrace'; commit 'v2.6.29-rc7' into tracing/core 2009-03-04 11:14:47 +01:00
Steven Rostedt
2cadf9135e tracing: add binary buffer files for use with splice
Impact: new feature

This patch creates a directory of files that correspond to the
per CPU ring buffers. These are binary files and are made to
be used with splice. This is the fastest way to extract data from
the ftrace ring buffers.

Thanks to Jiaying Zhang for pushing me to get this code fixed,
 and to Eduard - Gabriel Munteanu for his splice code that helped
 me debug my code.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-03 21:01:55 -05:00
Steven Rostedt
474d32b68d ring-buffer: make ring_buffer_read_page read from start on partial page
Impact: dont leave holes in read buffer page

The ring_buffer_read_page swaps a given page with the reader page
of the ring buffer, if certain conditions are set:

 1) requested length is big enough to hold entire page data

 2) a writer is not currently on the page

 3) the page is not partially consumed.

Instead of swapping with the supplied page. It copies the data to
the supplied page instead. But currently the data is copied in the
same offset as the source page. This causes a hole at the start
of the reader page. This complicates the use of this function.
Instead, it should copy the data at the beginning of the function
and update the index fields accordingly.

Other small clean ups are also done in this patch.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-03 20:52:27 -05:00
Steven Rostedt
e3d6bf0a07 ring-buffer: replace sizeof of event header with offsetof
Impact: fix to possible alignment problems on some archs.

Some arch compilers include an NULL char array in the sizeof field.
Since the ring_buffer_event type includes one of these, it is better
to use the "offsetof" instead, to avoid strange bugs on these archs.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-03 20:52:01 -05:00
Steven Rostedt
ef7a4a1614 ring-buffer: fix ring_buffer_read_page
The ring_buffer_read_page was broken if it were to only copy part
of the page. This patch fixes that up as well as adds a parameter
to allow a length field, in order to only copy part of the buffer page.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-03 20:51:24 -05:00
Steven Rostedt
41be4da4e8 ring-buffer: reset write field for ring_buffer_read_page
Impact: fix ring_buffer_read_page

After a page is swapped into the ring buffer, the write field must
also be reset.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-03 20:50:54 -05:00
Ingo Molnar
91d75e209b Merge branch 'x86/core' into core/percpu 2009-03-04 02:29:19 +01:00
Ingo Molnar
8b0e5860cb Merge branches 'x86/apic', 'x86/cpu', 'x86/fixmap', 'x86/mm', 'x86/sched', 'x86/setup-lzma', 'x86/signal' and 'x86/urgent' into x86/core 2009-03-04 02:22:31 +01:00
Linus Torvalds
219f170a85 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: don't allow setuid to succeed if the user does not have rt bandwidth
  sched_rt: don't start timer when rt bandwidth disabled
2009-03-03 14:33:20 -08:00
Linus Torvalds
b24746c7be Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  rcu: Teach RCU that idle task is not quiscent state at boot
2009-03-03 14:32:04 -08:00
Ingo Molnar
3612fdf780 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-03-03 16:02:16 +01:00
Steven Rostedt
633ddaa7f4 tracing: fix return value to registering events
The registering of events had the return value check backwards.
A zero returned is success, the check had it as a failure.

This patch also fixes a missing "\n" in the warning that the check
failed.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-03 09:43:50 -05:00
Roland McGrath
5b1017404a x86-64: seccomp: fix 32/64 syscall hole
On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
ljmp, and then use the "syscall" instruction to make a 64-bit system
call.  A 64-bit process make a 32-bit system call with int $0x80.

In both these cases under CONFIG_SECCOMP=y, secure_computing() will use
the wrong system call number table.  The fix is simple: test TS_COMPAT
instead of TIF_IA32.  Here is an example exploit:

	/* test case for seccomp circumvention on x86-64

	   There are two failure modes: compile with -m64 or compile with -m32.

	   The -m64 case is the worst one, because it does "chmod 777 ." (could
	   be any chmod call).  The -m32 case demonstrates it was able to do
	   stat(), which can glean information but not harm anything directly.

	   A buggy kernel will let the test do something, print, and exit 1; a
	   fixed kernel will make it exit with SIGKILL before it does anything.
	*/

	#define _GNU_SOURCE
	#include <assert.h>
	#include <inttypes.h>
	#include <stdio.h>
	#include <linux/prctl.h>
	#include <sys/stat.h>
	#include <unistd.h>
	#include <asm/unistd.h>

	int
	main (int argc, char **argv)
	{
	  char buf[100];
	  static const char dot[] = ".";
	  long ret;
	  unsigned st[24];

	  if (prctl (PR_SET_SECCOMP, 1, 0, 0, 0) != 0)
	    perror ("prctl(PR_SET_SECCOMP) -- not compiled into kernel?");

	#ifdef __x86_64__
	  assert ((uintptr_t) dot < (1UL << 32));
	  asm ("int $0x80 # %0 <- %1(%2 %3)"
	       : "=a" (ret) : "0" (15), "b" (dot), "c" (0777));
	  ret = snprintf (buf, sizeof buf,
			  "result %ld (check mode on .!)\n", ret);
	#elif defined __i386__
	  asm (".code32\n"
	       "pushl %%cs\n"
	       "pushl $2f\n"
	       "ljmpl $0x33, $1f\n"
	       ".code64\n"
	       "1: syscall # %0 <- %1(%2 %3)\n"
	       "lretl\n"
	       ".code32\n"
	       "2:"
	       : "=a" (ret) : "0" (4), "D" (dot), "S" (&st));
	  if (ret == 0)
	    ret = snprintf (buf, sizeof buf,
			    "stat . -> st_uid=%u\n", st[7]);
	  else
	    ret = snprintf (buf, sizeof buf, "result %ld\n", ret);
	#else
	# error "not this one"
	#endif

	  write (1, buf, ret);

	  syscall (__NR_exit, 1);
	  return 2;
	}

Signed-off-by: Roland McGrath <roland@redhat.com>
[ I don't know if anybody actually uses seccomp, but it's enabled in
  at least both Fedora and SuSE kernels, so maybe somebody is. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-02 15:41:30 -08:00
Peter Zijlstra
044d408409 genirq: assert that irq handlers are indeed running in hardirq context
Make sure the genirq layer handlers are indeed running handlers
in hardirq context. That is the genirq expectation and doing
anything else is broken.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1236006812.5330.632.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-03 00:05:45 +01:00
Ingo Molnar
ed662d9b2a Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-03-02 22:38:31 +01:00
Ingo Molnar
fdfa66ab45 Merge branches 'tracing/ftrace', 'tracing/mmiotrace' and 'linus' into tracing/core 2009-03-02 22:37:35 +01:00
Ingo Molnar
c02368a9d0 Merge branch 'linus' into irq/genirq 2009-03-02 22:08:56 +01:00
Steven Rostedt
96ccd21cd1 tracing: add print format to event trace format files
This patch adds the internal print format used to print the raw events
to the event trace point format file.

 # cat /debug/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 29
format:
        field:unsigned char type;       offset:0;       size:1;
        field:unsigned char flags;      offset:1;       size:1;
        field:unsigned char preempt_count;      offset:2;       size:1;
        field:int pid;  offset:4;       size:4;
        field:int tgid; offset:8;       size:4;

        field:pid_t prev_pid;   offset:12;      size:4;
        field:int prev_prio;    offset:16;      size:4;
        field special:char next_comm[TASK_COMM_LEN];    offset:20;      size:16;
        field:pid_t next_pid;   offset:36;      size:4;
        field:int next_prio;    offset:40;      size:4;

print fmt: "prev %d:%d ==> next %s:%d:%d"

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-02 15:22:21 -05:00
Steven Rostedt
c5e4e19271 tracing: add trace name and id to event formats
To be able to identify the trace in the binary format output, the
id of the trace event (which is dynamically assigned) must also be listed.

This patch adds the name of the trace point as well as the id assigned.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-02 15:10:02 -05:00
Steven Rostedt
91729ef966 tracing: add ftrace headers to event format files
This patch includes the ftrace header to the event formats files:

 # cat /debug/tracing/events/sched/sched_switch/format
        field:unsigned char type;       offset:0;       size:1;
        field:unsigned char flags;      offset:1;       size:1;
        field:unsigned char preempt_count;      offset:2;       size:1;
        field:int pid;  offset:4;       size:4;
        field:int tgid; offset:8;       size:4;

        field:pid_t prev_pid;   offset:12;      size:4;
        field:int prev_prio;    offset:16;      size:4;
        field special:char next_comm[TASK_COMM_LEN];    offset:20;      size:16;
        field:pid_t next_pid;   offset:36;      size:4;
        field:int next_prio;    offset:40;      size:4;

A blank line is used as a deliminator between the ftrace header and the
trace point fields.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-02 15:03:01 -05:00
Steven Rostedt
981d081ec8 tracing: add format file to describe event struct fields
This patch adds the "format" file to the trace point event directory.
This is based off of work by Tom Zanussi, in which a file is exported
to be tread from user land such that a user space app may read the
binary record stored in the ring buffer.

 # cat /debug/tracing/events/sched/sched_switch/format
        field:pid_t prev_pid;   offset:12;      size:4;
        field:int prev_prio;    offset:16;      size:4;
        field special:char next_comm[TASK_COMM_LEN];    offset:20;      size:16;
        field:pid_t next_pid;   offset:36;      size:4;
        field:int next_prio;    offset:40;      size:4;

Idea-from: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-02 14:27:27 -05:00
Steven Rostedt
f9520750c4 tracing: make trace_seq_reset global and rename to trace_seq_init
Impact: clean up

The trace_seq functions may be used separately outside of the ftrace
iterator. The trace_seq_reset is needed for these operations.

This patch also renames trace_seq_reset to the more appropriate
trace_seq_init.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-02 14:08:51 -05:00
Steven Rostedt
11a241a330 tracing: add protection around modify trace event fields
The trace event objects are currently not proctected against
reentrancy. This patch adds a mutex around the modifications of
the trace event fields.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-02 11:49:04 -05:00
Steven Rostedt
d20e3b0384 tracing: add TRACE_FIELD_SPECIAL to record complex entries
Tom Zanussi pointed out that the simple TRACE_FIELD was not enough to
record trace data that required memcpy. This patch addresses this issue
by adding a TRACE_FIELD_SPECIAL. The format is similar to TRACE_FIELD
but looks like so:

  TRACE_FIELD_SPECIAL(type_item, item, cmd)

What TRACE_FIELD gave was:

  TRACE_FIELD(type, item, assign)

The TRACE_FIELD would be used in declaring a structure:

  struct {
	type	item;
  };

And later assign it via:

  entry->item = assign;

What TRACE_FIELD_SPECIAL gives us is:

In the declaration of the structure:

  struct {
	type_item;
  };

And the assignment:

  cmd;

This change log will explain the one example used in the patch:

 TRACE_EVENT_FORMAT(sched_switch,
	TPPROTO(struct rq *rq, struct task_struct *prev,
		struct task_struct *next),
	TPARGS(rq, prev, next),
	TPFMT("task %s:%d ==> %s:%d",
	      prev->comm, prev->pid, next->comm, next->pid),
	TRACE_STRUCT(
		TRACE_FIELD(pid_t, prev_pid, prev->pid)
		TRACE_FIELD(int, prev_prio, prev->prio)
		TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
				    next_comm,
				    TPCMD(memcpy(TRACE_ENTRY->next_comm,
						 next->comm,
						 TASK_COMM_LEN)))
		TRACE_FIELD(pid_t, next_pid, next->pid)
		TRACE_FIELD(int, next_prio, next->prio)
	),
	TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
	);

 The struct will be create as:

  struct {
	pid_t		prev_pid;
	int		prev_prio;
	char next_comm[TASK_COMM_LEN];
	pid_t		next_pid;
	int		next_prio;
  };

Note the TRACE_ENTRY in the cmd part of TRACE_SPECIAL. TRACE_ENTRY will
be set by the tracer to point to the structure inside the trace buffer.

  entry->prev_pid	= prev->pid;
  entry->prev_prio	= prev->prio;
  memcpy(entry->next_comm, next->comm, TASK_COMM_LEN);
  entry->next_pid	= next->pid;
  entry->next_prio	= next->prio

Reported-by: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-03-02 10:53:15 -05:00
Wang Chen
b67802ea80 sched: kill unused parameter of pick_next_task()
Impact: micro-optimization

Parameter "prev" is not used really.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-02 12:02:53 +01:00
Ingo Molnar
5512b3ece0 Merge branches 'sched/clock', 'sched/urgent' and 'linus' into sched/core 2009-03-02 12:02:36 +01:00
David S. Miller
aa4abc9bcc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-tx.c
	net/8021q/vlan_core.c
	net/core/dev.c
2009-03-01 21:35:16 -08:00
Ingo Molnar
55f2b78995 Merge branch 'x86/urgent' into x86/pat 2009-03-01 12:47:58 +01:00
Ingo Molnar
acdb2c2879 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-02-28 10:15:58 +01:00
Steven Rostedt
fd99498989 tracing: add raw fast tracing interface for trace events
This patch adds the interface to enable the C style trace points.
In the directory /debugfs/tracing/events/subsystem/event
We now have three files:

 enable : values 0 or 1 to enable or disable the trace event.

 available_types: values 'raw' and 'printf' which indicate the tracing
       types available for the trace point. If a developer does not
       use the TRACE_EVENT_FORMAT macro and just uses the TRACE_FORMAT
       macro, then only 'printf' will be available. This file is
       read only.

 type: values 'raw' or 'printf'. This indicates which type of tracing
       is active for that trace point. 'printf' is the default and
       if 'raw' is not available, this file is read only.

 # echo raw > /debug/tracing/events/sched/sched_wakeup/type
 # echo 1 > /debug/tracing/events/sched/sched_wakeup/enable

 Will enable the C style tracing for the sched_wakeup trace point.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-28 04:04:03 -05:00
Steven Rostedt
c32e827b25 tracing: add raw trace point recording infrastructure
Impact: lower overhead tracing

The current event tracer can automatically pick up trace points
that are registered with the TRACE_FORMAT macro. But it required
a printf format string and parsing. Although, this adds the ability
to get guaranteed information like task names and such, it took
a hit in overhead processing. This processing can add about 500-1000
nanoseconds overhead, but in some cases that too is considered
too much and we want to shave off as much from this overhead as
possible.

Tom Zanussi recently posted tracing patches to lkml that are based
on a nice idea about capturing the data via C structs using
STRUCT_ENTER, STRUCT_EXIT type of macros.

I liked that method very much, but did not like the implementation
that required a developer to add data/code in several disjoint
locations.

This patch extends the event_tracer macros to do a similar "raw C"
approach that Tom Zanussi did. But instead of having the developers
needing to tweak a bunch of code all over the place, they can do it
all in one macro - preferably placed near the code that it is
tracing. That makes it much more likely that tracepoints will be
maintained on an ongoing basis by the code they modify.

The new macro TRACE_EVENT_FORMAT is created for this approach. (Note,
a developer may still utilize the more low level DECLARE_TRACE macros
if they don't care about getting their traces automatically in the event
tracer.)

They can also use the existing TRACE_FORMAT if they don't need to code
the tracepoint in C, but just want to use the convenience of printf.

So if the developer wants to "hardwire" a tracepoint in the fastest
possible way, and wants to acquire their data via a user space utility
in a raw binary format, or wants to see it in the trace output but not
sacrifice any performance, then they can implement the faster but
more complex TRACE_EVENT_FORMAT macro.

Here's what usage looks like:

  TRACE_EVENT_FORMAT(name,
	TPPROTO(proto),
	TPARGS(args),
	TPFMT(fmt, fmt_args),
	TRACE_STUCT(
		TRACE_FIELD(type1, item1, assign1)
		TRACE_FIELD(type2, item2, assign2)
			[...]
	),
	TPRAWFMT(raw_fmt)
	);

Note name, proto, args, and fmt, are all identical to what TRACE_FORMAT
uses.

 name: is the unique identifier of the trace point
 proto: The proto type that the trace point uses
 args: the args in the proto type
 fmt: printf format to use with the event printf tracer
 fmt_args: the printf argments to match fmt

 TRACE_STRUCT starts the ability to create a structure.
 Each item in the structure is defined with a TRACE_FIELD

  TRACE_FIELD(type, item, assign)

 type: the C type of item.
 item: the name of the item in the stucture
 assign: what to assign the item in the trace point callback

 raw_fmt is a way to pretty print the struct. It must match
  the order of the items are added in TRACE_STUCT

 An example of this would be:

 TRACE_EVENT_FORMAT(sched_wakeup,
	TPPROTO(struct rq *rq, struct task_struct *p, int success),
	TPARGS(rq, p, success),
	TPFMT("task %s:%d %s",
	      p->comm, p->pid, success?"succeeded":"failed"),
	TRACE_STRUCT(
		TRACE_FIELD(pid_t, pid, p->pid)
		TRACE_FIELD(int, success, success)
	),
	TPRAWFMT("task %d success=%d")
	);

 This creates us a unique struct of:

 struct {
	pid_t		pid;
	int		success;
 };

 And the way the call back would assign these values would be:

	entry->pid = p->pid;
	entry->success = success;

The nice part about this is that the creation of the assignent is done
via macro magic in the event tracer.  Once the TRACE_EVENT_FORMAT is
created, the developer will then have a faster method to record
into the ring buffer. They do not need to worry about the tracer itself.

The developer would only need to touch the files in include/trace/*.h

Again, I would like to give special thanks to Tom Zanussi for this
nice idea.

Idea-from: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-28 03:09:32 -05:00
Steven Rostedt
ef5580d0ff tracing: add interface to write into current tracer buffer
Right now all tracers must manage their own trace buffers. This was
to enforce tracers to be independent in case we finally decide to
allow each tracer to have their own trace buffer.

But now we are adding event tracing that writes to the current tracer's
buffer. This adds an interface to allow events to write to the current
tracer buffer without having to manage its own. Since event tracing
has no "tracer", and is just a way to hook into any other tracer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-28 03:06:44 -05:00
Steven Rostedt
b628b3e629 tracing: make the set_event and available_events subsystem aware
This patch makes the event files, set_event and available_events
aware of the subsystem.

Now you can enable an entire subsystem with:

  echo 'irq:*' > set_event

Note: the '*' is not needed.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-28 03:05:40 -05:00
Ingo Molnar
e9abf4c59d Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-02-28 09:03:58 +01:00
Steven Rostedt
6ecc2d1ca3 tracing: add subsystem level to trace events
If a trace point header defines TRACE_SYSTEM, then it will add the
following trace points into that event system.

If include/trace/irq_event_types.h has:

 #define TRACE_SYSTEM irq

at the top and

 #undef TRACE_SYSTEM

at the bottom, then a directory "irq" will be created in the
/debug/tracing/events directory. Inside that directory will contain the
two trace points that are defined in include/trace/irq_event_types.h.

Only adding the above to irq and not to sched, we get:

 # ls /debug/tracing/events/
irq                     sched_process_exit  sched_signal_send  sched_wakeup_new
sched_kthread_stop      sched_process_fork  sched_switch
sched_kthread_stop_ret  sched_process_free  sched_wait_task
sched_migrate_task      sched_process_wait  sched_wakeup

 # ls /debug/tracing/events/irq
irq_handler_entry  irq_handler_exit

If we add #define TRACE_SYSTEM sched to the trace/sched_event_types.h
then the rest of the trace events will be put in a sched directory
within the events directory.

I've been playing with this idea of the subsystem for a while, but
recently Tom Zanussi posted some patches to lkml that included this
method. Tom's approach was clean and got me to finally put some effort
to clean up the event trace points.

Thanks to Tom Zanussi for demonstrating how nice the subsystem
method is.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-28 02:59:43 -05:00
Steven Rostedt
eb594e45f6 tracing: move trace point formats to files in include/trace directory
Impact: clean up

To further facilitate the ease of adding trace points for developers, this
patch creates include/trace/trace_events.h and
include/trace/trace_event_types.h.

The former file will hold the trace/<type>.h files and the latter will hold
the trace/<type>_event_types.h files.

To create new tracepoints and to have them automatically
appear in the event tracer, a developer makes the trace/<type>.h file
which includes <linux/tracepoint.h> and the trace/<type>_event_types.h file.

The trace/<type>_event_types.h file will hold the TRACE_FORMAT
macros.

Then add the trace/<type>.h file to trace/trace_events.h,
and add the trace/<type>_event_types.h to the trace_event_types.h file.

No need to modify files elsewhere.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-28 02:58:50 -05:00
David Howells
5170836679 Fix recursive lock in free_uid()/free_user_ns()
free_uid() and free_user_ns() are corecursive when CONFIG_USER_SCHED=n,
but free_user_ns() is called from free_uid() by way of uid_hash_remove(),
which requires uidhash_lock to be held.  free_user_ns() then calls
free_uid() to complete the destruction.

Fix this by deferring the destruction of the user_namespace.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-27 16:26:21 -08:00
Steven Rostedt
0cfe82451d tracing: replace kzalloc with kcalloc
Impact: clean up

kcalloc is a better approach to allocate a NULL array.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-27 10:51:10 -05:00
Dhaval Giani
54e9912428 sched: don't allow setuid to succeed if the user does not have rt bandwidth
Impact: fix hung task with certain (non-default) rt-limit settings

Corey Hickey reported that on using setuid to change the uid of a
rt process, the process would be unkillable and not be running.
This is because there was no rt runtime for that user group. Add
in a check to see if a user can attach an rt task to its task group.
On failure, return EINVAL, which is also returned in
CONFIG_CGROUP_SCHED.

Reported-by: Corey Hickey <bugfood-ml@fatooh.org>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-27 11:11:53 +01:00
Ingo Molnar
da74ff0f9b Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace
Conflicts:
	kernel/sched_clock.c
2009-02-27 09:06:11 +01:00
Ingo Molnar
1b49061d40 Merge branch 'sched/clock' into tracing/ftrace
Conflicts:
	kernel/sched_clock.c
2009-02-27 08:35:19 +01:00
Steven Rostedt
5c6a3ae1b4 tracing: use newline separator for trace options list
Impact: clean up

Instead of listing the trace options like:

 # cat /debug/tracing/trace_options
print-parent nosym-offset nosym-addr noverbose noraw nohex nobin noblock nostacktrace nosched-tree ftrace_printk noftrace_preempt nobranch annotate nouserstacktrace nosym-userobj

We now list them like:

 # cat /debug/tracing/trace_options
print-parent
nosym-offset
nosym-addr
noverbose
noraw
nohex
nobin
noblock
nostacktrace
nosched-tree
ftrace_printk
noftrace_preempt
nobranch
annotate
nouserstacktrace
nosym-userobj

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-27 00:22:21 -05:00
Steven Rostedt
85a2f9b46f tracing: use pointer error returns for __tracing_open
Impact: fix compile warning and clean up

When I first wrote __tracing_open, instead of passing the error
code via the ERR_PTR macros, I lazily used a separate parameter
to hold the return for errors.

When Frederic Weisbecker updated that function, he used the Linux
kernel ERR_PTR for the returns. This caused the parameter return
to possibly not be initialized on error. gcc correctly pointed this
out with a warning.

This patch converts the entire function to use the Linux kernel
ERR_PTR macro methods.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-27 00:12:38 -05:00
Steven Rostedt
d8e83d26b5 tracing: add protection around open use of current_tracer
Impact: fix to possible race conditions

There's some uses of current_tracer that is not protected by the
trace_types_lock. There is a small chance that a sysadmin changes
the tracer while the current_tracer is being referenced.

If the race is hit, it is unlikely to cause any harm since the
tracers are constant and are not freed. But some strang side
effects may occur.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-26 23:55:58 -05:00
Steven Rostedt
577b785f55 tracing: add tracer dependent options to options directory
This patch adds the tracer dependent options dynamically to the
options directory when the tracer is activated. These options are
removed when the tracer is deactivated.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-26 23:43:05 -05:00
Steven Rostedt
a825907507 tracing: add options directory and core option files
This patch creates an options directory in the debugfs, that contains
the available tracing options. These files contain 1 or 0, where 1
is the option is enabled and 0 it is disabled.

Simply echoing in 1 will enable the option and 0 will disable it.
This patch only contains the core options, not the tracer options.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-26 22:22:46 -05:00
Serge E. Hallyn
1d1e97562e keys: distinguish per-uid keys in different namespaces
per-uid keys were looked by uid only.  Use the user namespace
to distinguish the same uid in different namespaces.

This does not address key_permission.  So a task can for instance
try to join a keyring owned by the same uid in another namespace.
That will be handled by a separate patch.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-02-27 12:35:06 +11:00
Peter Zijlstra
8325d9c09d sched_clock: cleanups
- remove superfluous checks in __update_sched_clock()
- skip sched_clock_tick() for sched_clock_stable
- reinstate the simple !HAVE_UNSTABLE_SCHED_CLOCK code to please the bloatwatch

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 21:56:07 +01:00
Ingo Molnar
5d0859cef2 Merge branch 'sched/clock' into tracing/ftrace
Conflicts:
	kernel/sched_clock.c
2009-02-26 21:21:59 +01:00
Ingo Molnar
b342501cd3 sched: allow architectures to specify sched_clock_stable
Allow CONFIG_HAVE_UNSTABLE_SCHED_CLOCK architectures to still specify
that their sched_clock() implementation is reliable.

This will be used by x86 to switch on a faster sched_clock_cpu()
implementation on certain CPU types.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 21:20:22 +01:00
John Stultz
a2a5ac8650 time: ntp: fix bug in ntp_update_offset() & do_adjtimex(), fix
The time_status conditional was accidentally placed right after we clear
the checked time_status bits, which causes us to take the conditional
every time through. This fixes it by moving the conditional to before we
clear the time_status bits.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Clark Williams <williams@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 19:39:47 +01:00
Ingo Molnar
14131f2f98 tracing: implement trace_clock_*() APIs
Impact: implement new tracing timestamp APIs

Add three trace clock variants, with differing scalability/precision
tradeoffs:

 -   local: CPU-local trace clock
 -  medium: scalable global clock with some jitter
 -  global: globally monotonic, serialized clock

Make the ring-buffer use the local trace clock internally.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 18:44:06 +01:00
Ingo Molnar
6409c4da28 sched: sched_clock() improvement: use in_nmi()
make sure we dont execute more complex sched_clock() code in NMI context.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 18:44:05 +01:00
Jason Baron
af39241b90 tracing, genirq: add irq enter and exit trace events
Impact: add new tracepoints

Add them to the generic IRQ code, that way every architecture
gets these new tracepoints, not just x86.

Using Steve's new 'TRACE_FORMAT', I can get function graph
trace as follows using the original two IRQ tracepoints:

 3)               |    handle_IRQ_event() {
 3)               |    /* (irq_handler_entry) irq=28 handler=eth0 */
 3)               |    e1000_intr_msi() {
 3)   2.460 us    |      __napi_schedule();
 3)   9.416 us    |    }
 3)               |    /* (irq_handler_exit) irq=28 handler=eth0 return=handled */
 3) + 22.935 us   |  }

Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 18:43:50 +01:00
Hiroshi Shimamoto
cac64d00c2 sched_rt: don't start timer when rt bandwidth disabled
Impact: fix incorrect condition check

No need to start rt bandwidth timer when rt bandwidth is disabled.
If this timer starts, it may stop at sched_rt_period_timer() on the first time.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 14:18:55 +01:00
Frederic Weisbecker
8656e7a2fa tracing/core: make the per cpu trace files in per cpu directories
Impact: restructure the VFS layout of per CPU trace buffers

The per cpu trace files are all in a single directory:
/debug/tracing/per_cpu. In case of a large number of cpu, the
content of this directory becomes messy so we create now one
directory per cpu inside /debug/tracing/per_cpu which contain
each their own trace_pipe and trace files.

Ie:

 /debug/tracing$ ls -R per_cpu
 per_cpu:
 cpu0  cpu1

 per_cpu/cpu0:
 trace  trace_pipe

 per_cpu/cpu1:
 trace  trace_pipe

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 14:04:08 +01:00
Li Zefan
c40c6f85a7 cpuacct: add a branch prediction
cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
always returns false after cpuacct has been initialized at system boot.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 13:22:38 +01:00
Ingo Molnar
4434e51564 Merge branches 'sched/cleanups', 'sched/urgent' and 'linus' into sched/core 2009-02-26 13:22:13 +01:00
Paul E. McKenney
a682604838 rcu: Teach RCU that idle task is not quiscent state at boot
This patch fixes a bug located by Vegard Nossum with the aid of
kmemcheck, updated based on review comments from Nick Piggin,
Ingo Molnar, and Andrew Morton.  And cleans up the variable-name
and function-name language.  ;-)

The boot CPU runs in the context of its idle thread during boot-up.
During this time, idle_cpu(0) will always return nonzero, which will
fool Classic and Hierarchical RCU into deciding that a large chunk of
the boot-up sequence is a big long quiescent state.  This in turn causes
RCU to prematurely end grace periods during this time.

This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
function to ignore the idle task as a quiescent state until the
system has started up the scheduler in rest_init(), introducing a
new non-API function rcu_idle_now_means_idle() to inform RCU of this
transition.  RCU maintains an internal rcu_idle_cpu_truthful variable
to track this state, which is then used by rcu_check_callback() to
determine if it should believe idle_cpu().

Because this patch has the effect of disallowing RCU grace periods
during long stretches of the boot-up sequence, this patch also introduces
Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
no-op if num_online_cpus() returns 1.  This allows boot-time code that
calls synchronize_rcu() to proceed normally.  Note, however, that RCU
callbacks registered by call_rcu() will likely queue up until later in
the boot sequence.  Although rcuclassic and rcutree can also use this
same optimization after boot completes, rcupreempt must restrict its
use of this optimization to the portion of the boot sequence before the
scheduler starts up, given that an rcupreempt RCU read-side critical
section may be preeempted.

In addition, this patch takes Nick Piggin's suggestion to make the
system_state global variable be __read_mostly.

Changes since v4:

o	Changes the name of the introduced function and variable to
	be less emotional.  ;-)

Changes since v3:

o	WARN_ON(nr_context_switches() > 0) to verify that RCU
	switches out of boot-time mode before the first context
	switch, as suggested by Nick Piggin.

Changes since v2:

o	Created rcu_blocking_is_gp() internal-to-RCU API that
	determines whether a call to synchronize_rcu() is itself
	a grace period.

o	The definition of rcu_blocking_is_gp() for rcuclassic and
	rcutree checks to see if but a single CPU is online.

o	The definition of rcu_blocking_is_gp() for rcupreempt
	checks to see both if but a single CPU is online and if
	the system is still in early boot.

	This allows rcupreempt to again work correctly if running
	on a single CPU after booting is complete.

o	Added check to rcupreempt's synchronize_sched() for there
	being but one online CPU.

Tested all three variants both SMP and !SMP, booted fine, passed a short
rcutorture test on both x86 and Power.

Located-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-26 04:08:14 +01:00
Ingo Molnar
f4abfb8d0d Merge branch 'tip/tracing/ftrace' of ssh://master.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-02-26 03:48:44 +01:00
Ingo Molnar
e36b1e136a Merge branches 'tracing/ftrace', 'tracing/hw-branch-tracing' and 'linus' into tracing/core 2009-02-26 03:47:27 +01:00
Steven Rostedt
eef62a6826 tracing: rename DEFINE_TRACE_FMT to just TRACE_FORMAT
There's been a bit confusion to whether DEFINE/DECLARE_TRACE_FMT should
be a DEFINE or a DECLARE. Ingo Molnar suggested simply calling it
TRACE_FORMAT.

Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-25 21:44:22 -05:00
Ingo Molnar
39854fe8c1 time: ntp: clean up second_overflow()
Impact: cleanup, no functionality changed

The 'time_adj' local variable is named in a very confusing
way because it almost shadows the 'time_adjust' global
variable - which is used in this same function.

Rename it to 'delta' - to make them stand apart more clearly.

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2545	    114	    144	   2803	    af3	ntp.o.before
   2545	    114	    144	   2803	    af3	ntp.o.after

md5:
   1bf0b3be564512279ba7cee299d1d2be  ntp.o.before.asm
   1bf0b3be564512279ba7cee299d1d2be  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:17 +01:00
Ingo Molnar
069569e025 time: ntp: simplify ntp_tick_adj calculations
Impact: micro-optimization

Convert the (internal) ntp_tick_adj value we store from unscaled
units to scaled units. This is a constant that we never modify,
so scaling it up once during bootup is enough - we dont have to
do it for every adjustment step.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:16 +01:00
Ingo Molnar
2b9d1496e7 time: ntp: make 64-bit constants more robust
Impact: cleanup, no functionality changed

 - make PPM_SCALE an explicit s64 constant, to
   remove (s64) casts from usage sites.

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2536	    114	    136	   2786	    ae2	ntp.o.before
   2536	    114	    136	   2786	    ae2	ntp.o.after

md5:
   40a7728d1188aa18e83e21a81fa7b150  ntp.o.before.asm
   40a7728d1188aa18e83e21a81fa7b150  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:15 +01:00
Ingo Molnar
e96291653b time: ntp: refactor do_adjtimex() some more
Impact: cleanup, no functionality changed

Further simplify do_adjtimex():

 - introduce the ntp_start_leap_timer() helper function
 - eliminate the goto adj_done complication

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:15 +01:00
Ingo Molnar
80f2257116 time: ntp: refactor do_adjtimex()
Impact: cleanup, no functionality changed

do_adjtimex() is currently a monster function with a maze of
branches. Refactor the txc->modes setting aspects of it into
two new helper functions:

	process_adj_status()
	process_adjtimex_modes()

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2512	    114	    136	   2762	    aca	ntp.o.before
   2512	    114	    136	   2762	    aca	ntp.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:14 +01:00
Ingo Molnar
10dd31a7a1 time: ntp: fix bug in ntp_update_offset() & do_adjtimex()
Impact: change (fix) the way the NTP PLL seconds offset is initialized/tracked

Fix a bug and do a micro-optimization:

When PLL is enabled we do not reset time_reftime. If the PLL
was off for a long time (for example after bootup), this is
arguably the wrong thing to do.

We already had a hack for the common boot-time case in
ntp_update_offset(), in form of:

	if (unlikely(time_status & STA_FREQHOLD || time_reftime == 0))
 		secs = 0;

But the update delta should be reset later on too - not just when
the PLL is enabled for the first time after bootup.

So do it on !STA_PLL -> STA_PLL transitions.

This changes behavior, as previously if ntpd was disabled for
a long time and we restarted it, we'd run from that last update,
with a very large delta.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:13 +01:00
Ingo Molnar
c7986acba2 time: ntp: micro-optimize ntp_update_offset()
Impact: cleanup, no functionality changed

The time_reftime update in ntp_update_offset() to xtime.tv_sec
is a convoluted way of saying that we want to freeze the frequency
and want the 'secs' delta to be 0. Also make this branch unlikely.

This shaves off 8 bytes from the code size:

   text	   data	    bss	    dec	    hex	filename
   2504	    114	    136	   2754	    ac2	ntp.o.before
   2496	    114	    136	   2746	    aba	ntp.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:12 +01:00
Ingo Molnar
478b7aab16 time: ntp: simplify ntp_update_offset_fll()
Impact: cleanup, no functionality changed

Change ntp_update_offset_fll() to delta logic instead of
absolute value logic. This eliminates 'freq_adj' from the
function.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:12 +01:00
Ingo Molnar
f939890b66 time: ntp: refactor and clean up ntp_update_offset()
Impact: cleanup, no functionality changed

- introduce the ntp_update_offset_fll() helper
- clean up the flow and variable naming

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2504	    114	    136	   2754	    ac2	ntp.o.before
   2504	    114	    136	   2754	    ac2	ntp.o.after

md5:
   01f7b8e1a5472a3056f9e4ae84d46315  ntp.o.before.asm
   01f7b8e1a5472a3056f9e4ae84d46315  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:11 +01:00
Ingo Molnar
bc26c31d44 time: ntp: refactor up ntp_update_frequency()
Impact: cleanup, no functionality changed

Change ntp_update_frequency() from a hard to follow code
flow that uses global variables as temporaries, to a clean
input+output flow.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:10 +01:00
Ingo Molnar
9ce616aaef time: ntp: clean up ntp_update_frequency()
Impact: cleanup, no functionality changed

Prepare a refactoring of ntp_update_frequency().

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2504	    114	    136	   2754	    ac2	ntp.o.before
   2504	    114	    136	   2754	    ac2	ntp.o.after

md5:
   41f3009debc9b397d7394dd77d912f0a  ntp.o.before.asm
   41f3009debc9b397d7394dd77d912f0a  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:09 +01:00
Ingo Molnar
bbd1267690 time: ntp: simplify the MAX_TICKADJ_SCALED definition
Impact: cleanup, no functionality changed

There's an ugly u64 typecase in the MAX_TICKADJ_SCALED definition,
this can be eliminated by making the MAX_TICKADJ constant's type
64-bit (signed).

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2504	    114	    136	   2754	    ac2	ntp.o.before
   2504	    114	    136	   2754	    ac2	ntp.o.after

md5:
   41f3009debc9b397d7394dd77d912f0a  ntp.o.before.asm
   41f3009debc9b397d7394dd77d912f0a  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:08 +01:00
Ingo Molnar
3c972c2444 time: ntp: simplify the second_overflow() code flow
Impact: cleanup, no functionality changed

Instead of a hierarchy of conditions, transform them to clean
gradual conditions and return's.

This makes the flow easier to read and makes the purpose of
the function easier to understand.

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2552	    170	    168	   2890	    b4a	ntp.o.before
   2552	    170	    168	   2890	    b4a	ntp.o.after

md5:
   eae1275df0b7d6290c13f6f6f8f05c8c  ntp.o.before.asm
   eae1275df0b7d6290c13f6f6f8f05c8c  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:07 +01:00
Ingo Molnar
53bbfa9e94 time: ntp: clean up kernel/time/ntp.c
Impact: cleanup, no functionality changed

Make this file a bit more readable by applying a consistent coding style.

No code changed:

kernel/time/ntp.o:

   text	   data	    bss	    dec	    hex	filename
   2552	    170	    168	   2890	    b4a	ntp.o.before
   2552	    170	    168	   2890	    b4a	ntp.o.after

md5:
   eae1275df0b7d6290c13f6f6f8f05c8c  ntp.o.before.asm
   eae1275df0b7d6290c13f6f6f8f05c8c  ntp.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 18:38:06 +01:00
Ingo Molnar
0b13fda1e0 generic-ipi: cleanups
Andrew pointed out that there's some small amount of
style rot in kernel/smp.c.

Clean it up.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 16:52:50 +01:00
Peter Zijlstra
6e2756376c generic-ipi: remove CSD_FLAG_WAIT
Oleg noticed that we don't strictly need CSD_FLAG_WAIT, rework
the code so that we can use CSD_FLAG_LOCK for both purposes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 14:13:44 +01:00
Peter Zijlstra
8969a5ede0 generic-ipi: remove kmalloc()
Remove the use of kmalloc() from the smp_call_function_*()
calls.

Steven's generic-ipi patch (d7240b98: generic-ipi: use per cpu
data for single cpu ipi calls) started the discussion on the use
of kmalloc() in this code and fixed the
smp_call_function_single(.wait=0) fallback case.

In this patch we complete this by also providing means for the
_many() call, which fully removes the need for kmalloc() in this
code.

The problem with the _many() call is that other cpus might still
be observing our entry when we're done with it. It solved this
by dynamically allocating data elements and RCU-freeing it.

We solve it by using a single per-cpu entry which provides
static storage and solves one half of the problem (avoiding
referencing freed data).

The other half, ensuring the queue iteration it still possible,
is done by placing re-used entries at the head of the list. This
means that if someone was still iterating that entry when it got
moved, he will now re-visit the entries on the list he had
already seen, but avoids skipping over entries like would have
happened had we placed the new entry at the end.

Furthermore, visiting entries twice is not a problem, since we
remove our cpu from the entry's cpumask once its called.

Many thanks to Oleg for his suggestions and him poking holes in
my earlier attempts.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 14:13:43 +01:00
Frederic Weisbecker
d7350c3f45 tracing/core: make the read callbacks reentrants
Now that several per-cpu files can be read or spliced at the
same, we want the read/splice callbacks for tracing files to be
reentrants.

Until now, a single global mutex (trace_types_lock) serialized
the access to tracing_read_pipe(), tracing_splice_read_pipe(),
and the seq helpers.

Ie: it means that if a user tries to read trace_pipe0 and
trace_pipe1 at the same time, the access to the function
tracing_read_pipe() is contended and one reader must wait for
the other to finish its read call.

The trace_type_lock mutex is mostly here to serialize the access
to the global current tracer (current_trace), which can be
changed concurrently. Although the iter struct keeps a private
pointer to this tracer, its callbacks can be changed by another
function.

The method used here is to not keep anymore private reference to
the tracer inside the iterator but to make a copy of it inside
the iterator. Then it checks on subsequents read calls if the
tracer has changed. This is not costly because the current
tracer is not expected to be changed often, so we use a branch
prediction for that.

Moreover, we add a private mutex to the iterator (there is one
iterator per file descriptor) to serialize the accesses in case
of multiple consumers per file descriptor (which would be a
silly idea from the user). Note that this is not to protect the
ring buffer, since the ring buffer already serializes the
readers accesses. This is to prevent from traces weirdness in
case of concurrent consumers. But these mutexes can be dropped
anyway, that would not result in any crash. Just tell me what
you think about it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 13:40:58 +01:00
Frederic Weisbecker
b04cc6b1f6 tracing/core: introduce per cpu tracing files
Impact: split up tracing output per cpu

Currently, on the tracing debugfs directory, three files are
available to the user to let him extracting the trace output:

- trace is an iterator through the ring-buffer. It's a reader
  but not a consumer It doesn't block when no more traces are
  available.

- trace pretty similar to the former, except that it adds more
  informations such as prempt count, irq flag, ...

- trace_pipe is a reader and a consumer, it will also block
  waiting for traces if necessary (heh, yes it's a pipe).

The traces coming from different cpus are curretly mixed up
inside these files. Sometimes it messes up the informations,
sometimes it's useful, depending on what does the tracer
capture.

The tracing_cpumask file is useful to filter the output and
select only the traces captured a custom defined set of cpus.
But still it is not enough powerful to extract at the same time
one trace buffer per cpu.

So this patch creates a new directory: /debug/tracing/per_cpu/.

Inside this directory, you will now find one trace_pipe file and
one trace file per cpu.

Which means if you have two cpus, you will have:

 trace0
 trace1
 trace_pipe0
 trace_pipe1

And of course, reading these files will have the same effect
than with the usual tracing files, except that you will only see
the traces from the given cpu.

The original all-in-one cpu trace file are still available on
their original place.

Until now, only one consumer was allowed on trace_pipe to avoid
racy consuming on the ring-buffer. Now the approach changed a
bit, you can have only one consumer per cpu.

Which means you are allowed to read concurrently trace_pipe0 and
trace_pipe1 But you can't have two readers on trace_pipe0 or
trace_pipe1.

Following the same logic, if there is one reader on the common
trace_pipe, you can not have at the same time another reader on
trace_pipe0 or in trace_pipe1. Because in trace_pipe is already
a consumer in all cpu buffers in essence.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 13:40:58 +01:00
Ingo Molnar
2b1b858f69 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-02-25 12:50:07 +01:00
Nick Piggin
15d0d3b337 generic IPI: simplify barriers and locking
Simplify the barriers in generic remote function call interrupt
code.

Firstly, just unconditionally take the lock and check the list
in the generic_call_function_single_interrupt IPI handler. As
we've just taken an IPI here, the chances are fairly high that
there will be work on the list for us, so do the locking
unconditionally. This removes the tricky lockless list_empty
check and dubious barriers. The change looks bigger than it is
because it is just removing an outer loop.

Secondly, clarify architecture specific IPI locking rules.
Generic code has no tools to impose any sane ordering on IPIs if
they go outside normal cache coherency, ergo the arch code must
make them appear to obey cache coherency as a "memory operation"
to initiate an IPI, and a "memory operation" to receive one.
This way at least they can be reasoned about in generic code,
and smp_mb used to provide ordering.

The combination of these two changes means that explict barriers
can be taken out of queue handling for the single case -- shared
data is explicitly locked, and ipi ordering must conform to
that, so no barriers needed. An extra barrier is needed in the
many handler, so as to ensure we load the list element after the
IPI is received.

Does any architecture actually *need* these barriers? For the
initiator I could see it, but for the handler I would be
surprised. So the other thing we could do for simplicity is just
to require that, rather than just matching with cache coherency,
we just require a full barrier before generating an IPI, and
after receiving an IPI. In which case, the smp_mb()s can go
away. But just for now, we'll be on the safe side and use the
barriers (they're in the slow case anyway).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-arch@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 12:27:08 +01:00
Ingo Molnar
886b5b73d7 tracing: remove /debug/tracing/latency_trace
Impact: remove old debug/tracing API

/debug/tracing/latency_trace is an old legacy format we kept from
the old latency tracer. Remove the file for now. If there's any
useful bit missing then we'll propagate any useful output bits into
the /debug/tracing/trace output.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 11:05:34 +01:00
Ingo Molnar
2d542cf342 tracing/hw-branch-tracing: convert bts-tracer mutex to a spinlock
Impact: fix CPU hotplug lockup

bts_hotcpu_handler() is called with irqs disabled, so using mutex_lock()
is a no-no.

All the BTS codepaths here are atomic (they do not schedule), so using
a spinlock is the right solution.

Cc: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-25 09:16:01 +01:00
Steven Rostedt
1473e4417c tracing: make event directory structure
This patch adds the directory /debug/tracing/events/ that will contain
all the registered trace points.

 # ls /debug/tracing/events/
sched_kthread_stop      sched_process_fork  sched_switch
sched_kthread_stop_ret  sched_process_free  sched_wait_task
sched_migrate_task      sched_process_wait  sched_wakeup
sched_process_exit      sched_signal_send   sched_wakeup_new

 # ls /debug/tracing/events/sched_switch/
enable

 # cat /debug/tracing/events/sched_switch/enable
1

 # cat /debug/tracing/set_event
sched_switch

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-24 21:54:08 -05:00
Steven Rostedt
f3fe8e4a38 tracing: add schedule events to event trace
This patch changes the trace/sched.h to use the DECLARE_TRACE_FMT
such that they are automatically registered with the event tracer.

And it also adds the tracing sched headers to kernel/trace/events.c

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-24 21:54:07 -05:00
Steven Rostedt
b77e38aa24 tracing: add event trace infrastructure
This patch creates the event tracing infrastructure of ftrace.
It will create the files:

 /debug/tracing/available_events
 /debug/tracing/set_event

The available_events will list the trace points that have been
registered with the event tracer.

set_events will allow the user to enable or disable an event hook.

example:

 # echo sched_wakeup > /debug/tracing/set_event

Will enable the sched_wakeup event (if it is registered).

 # echo "!sched_wakeup" >> /debug/tracing/set_event

Will disable the sched_wakeup event (and only that event).

 # echo > /debug/tracing/set_event

Will disable all events (notice the '>')

 # cat /debug/tracing/available_events > /debug/tracing/set_event

Will enable all registered event hooks.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-24 21:54:05 -05:00
Ingo Molnar
0edcf8d692 Merge branch 'tj-percpu' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/percpu
Conflicts:
	arch/x86/include/asm/pgtable.h
2009-02-24 21:52:45 +01:00
Markus Metzger
5e01cb695d x86, ftrace: fix section mismatch in hw-branch-tracer
Fix an invalid memory reference problem when cpu hotplug support is
disabled and the hw-branch-tracer is set as current tracer.

Initializing the tracer calls bts_trace_init() which has already
been freed at this time.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-24 18:23:34 +01:00
Ingo Molnar
a7f4463e03 Merge branch 'tracing/ftrace'; commit 'v2.6.29-rc6' into tracing/core 2009-02-24 18:22:39 +01:00
David S. Miller
e70049b9e7 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2009-02-24 03:50:29 -08:00
Benjamin Herrenschmidt
35f88e6b06 Merge commit 'ftrace/function-graph' into next 2009-02-23 10:47:23 +11:00
Ingo Molnar
fc6fc7f1b1 Merge branch 'linus' into x86/apic
Conflicts:
	arch/x86/mach-default/setup.c

Semantic conflict resolution:
	arch/x86/kernel/setup.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-22 20:05:19 +01:00
Rafael J. Wysocki
770824bdc4 PM: Split up sysdev_[suspend|resume] from device_power_[down|up]
Move the sysdev_suspend/resume from the callee to the callers, with
no real change in semantics, so that we can rework the disabling of
interrupts during suspend/hibernation.

This is based on an earlier patch from Linus.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-22 10:33:44 -08:00
Ingo Molnar
c478f87869 Merge branch 'tip/x86/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace
Conflicts:
	include/linux/ftrace.h
	kernel/trace/ftrace.c
2009-02-22 18:12:01 +01:00
Linus Torvalds
adfafefd10 Merge branch 'hibernate'
* hibernate:
  PM: Fix suspend_console and resume_console to use only one semaphore
  PM: Wait for console in resume
  PM: Fix pm_notifiers during user mode hibernation
  swsusp: clean up shrink_all_zones()
  swsusp: dont fiddle with swappiness
  PM: fix build for CONFIG_PM unset
  PM/hibernate: fix "swap breaks after hibernation failures"
  PM/resume: wait for device probing to finish
  Consolidate driver_probe_done() loops into one place
2009-02-21 14:17:26 -08:00
Arve Hjønnevåg
403f307576 PM: Fix suspend_console and resume_console to use only one semaphore
This fixes a race where a thread acquires the console while the
console is suspended, and the console is resumed before this
thread releases it. In this case, the secondary console
semaphore would be left locked, and the primary semaphore would
be released twice. This in turn would cause the console switch
on suspend or resume to hang forever.

Note that suspend_console does not actually lock the console
for clients that use acquire_console_sem, it only locks it for
clients that use try_acquire_console_sem. If we change
suspend_console to fully lock the console, then the kernel
may deadlock on suspend. One client of try_acquire_console_sem
is acquire_console_semaphore_for_printk, which uses it to
prevent printk from using the console while it is suspended.

Signed-off-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21 14:17:18 -08:00
Arve Hjønnevåg
b090f9fa53 PM: Wait for console in resume
Avoids later waking up to a blinking cursor if the device woke up and
returned to sleep before the console switch happened.

Signed-off-by: Brian Swetland <swetland@google.com>
Signed-off-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21 14:17:17 -08:00
Andrey Borzenkov
ebae2604f2 PM: Fix pm_notifiers during user mode hibernation
Snapshot device is opened with O_RDONLY during suspend and O_WRONLY durig
resume.  Make sure we also call notifiers with correct parameter telling
them what we are really doing.

Signed-off-by: Andrey Borzenkov <arvidjaar@mail.ru>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21 14:17:17 -08:00
Rafael J. Wysocki
09664fda48 PM: fix build for CONFIG_PM unset
Compilation of kprobes.c with CONFIG_PM unset is broken due to some broken
config dependncies.  Fix that.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Reported-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21 14:17:17 -08:00
Arjan van de Ven
eed3ee0829 PM/resume: wait for device probing to finish
the resume code does not currently wait for device probing to finish.
Even without async function calls this is dicey and not correct,
but with async function calls during the boot sequence this is going
to get hit more...

This patch adds the synchronization using the newly introduced helper.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Acked-by: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21 14:17:17 -08:00
Steven Rostedt
4377245aa9 ftrace: break out modify loop immediately on detection of error
Impact: added precaution on failure detection

Break out of the modifying loop as soon as a failure is detected.
This is just an added precaution found by code review and was not
found by any bug chasing.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-20 14:30:20 -05:00
Steven Rostedt
000ab69117 ftrace: allow archs to preform pre and post process for code modification
This patch creates the weak functions: ftrace_arch_code_modify_prepare
and ftrace_arch_code_modify_post_process that are called before and
after the stop machine is called to modify the kernel text.

If the arch needs to do pre or post processing, it only needs to define
these functions.

[ Update: Ingo Molnar suggested using the name ftrace_arch_code_modify_*
          over using ftrace_arch_modify_* ]

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-20 13:16:18 -05:00
Ingo Molnar
3b6f7b9beb Merge branch 'x86/urgent' into x86/core 2009-02-20 17:40:43 +01:00
Frederic Weisbecker
f9349a8f97 tracing/function-graph-tracer: make set_graph_function file support ftrace regex
Impact: trace only functions matching a pattern

The set_graph_function file let one to trace only one or several
chosen functions and follow all their code flow.

Currently, only a constant function name is allowed so this patch
allows the ftrace_regex functions:

- matches all functions that end with "name":
  echo *name > set_graph_function

- matches all functions that begin with "name":
  echo name* > set_graph_function

- matches all functions that contains "name":
  echo *name* > set_graph_function

Example:

echo mutex* > set_graph_function

 0)               |  mutex_lock_nested() {
 0)   0.563 us    |    __might_sleep();
 0)   2.072 us    |  }
 0)               |  mutex_unlock() {
 0)   1.036 us    |    __mutex_unlock_slowpath();
 0)   2.433 us    |  }
 0)               |  mutex_unlock() {
 0)   0.691 us    |    __mutex_unlock_slowpath();
 0)   1.787 us    |  }
 0)               |  mutex_lock_interruptible_nested() {
 0)   0.548 us    |    __might_sleep();
 0)   1.945 us    |  }

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-20 11:36:24 +01:00
Tejun Heo
fbf59bc9d7 percpu: implement new dynamic percpu allocator
Impact: new scalable dynamic percpu allocator which allows dynamic
        percpu areas to be accessed the same way as static ones

Implement scalable dynamic percpu allocator which can be used for both
static and dynamic percpu areas.  This will allow static and dynamic
areas to share faster direct access methods.  This feature is optional
and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
arch.  Please read comment on top of mm/percpu.c for details.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
2009-02-20 16:29:08 +09:00
Rusty Russell
b36128c830 alloc_percpu: change percpu_ptr to per_cpu_ptr
Impact: cleanup

There are two allocated per-cpu accessor macros with almost identical
spelling.  The original and far more popular is per_cpu_ptr (44
files), so change over the other 4 files.

tj: kill percpu_ptr() and update UP too

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: mingo@redhat.com
Cc: lenb@kernel.org
Cc: cpufreq@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-20 16:29:08 +09:00
Tejun Heo
6b588c18f8 module: reorder module pcpu related functions
Impact: cleanup

Move percpu_modinit() upwards.  This is to ease further changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-20 16:29:07 +09:00
Linus Torvalds
f54b2fe4ae Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing: limit the number of loops the ring buffer self test can make
  tracing: have function trace select kallsyms
  tracing: disable tracing while testing ring buffer
  tracing/function-graph-tracer: trace the idle tasks
2009-02-19 09:14:22 -08:00
Ingo Molnar
00a8bf8593 tracing/function-graph-tracer: fix merge
Merge artifact: pid got changed to ent->pid meanwhile.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-19 13:01:37 +01:00
Frederic Weisbecker
d1f9cbd788 tracing/function-graph-tracer: fix traces weirdness while absolute time printing
Impact: trace output cleanup/reordering

When an interrupt occurs and and the abstime option is selected:

  echo funcgraph-abstime > /debug/tracing/trace_options

then we observe broken traces:

30581.025422 |   0)   Xorg-4291    |   0.503 us    |      idle_cpu();
30581.025424 |   0)   Xorg-4291    |   2.576 us    |    }
30581.025424 |   0)   Xorg-4291    | + 75.771 us   |  }
 0)   Xorg-4291    |   <========== |
30581.025425 |   0)   Xorg-4291    |               |  schedule() {
30581.025426 |   0)   Xorg-4291    |               |    __schedule() {
30581.025426 |   0)   Xorg-4291    |   0.705 us    |      _spin_lock_irq();

With this patch, the interrupts output better adapts
to absolute time printing:

  414.856543 |   1)   Xorg-4279    |   8.816 us    |                        }
  414.856544 |   1)   Xorg-4279    |   0.525 us    |                        rcu_irq_exit();
  414.856545 |   1)   Xorg-4279    |   0.526 us    |                        idle_cpu();
  414.856546 |   1)   Xorg-4279    | + 12.157 us   |                      }
  414.856549 |   1)   Xorg-4279    | ! 104.114 us  |                    }
  414.856549 |   1)   Xorg-4279    |   <========== |
  414.856549 |   1)   Xorg-4279    | ! 107.944 us  |                  }
  414.856550 |   1)   Xorg-4279    | ! 137.010 us  |                }
  414.856551 |   1)   Xorg-4279    |   0.624 us    |                _read_unlock();
  414.856552 |   1)   Xorg-4279    | ! 140.930 us  |              }
  414.856552 |   1)   Xorg-4279    | ! 166.159 us  |            }

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-19 12:33:21 +01:00
Ingo Molnar
4cd0332db7 Merge branch 'mainline/function-graph' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/function-graph-tracer 2009-02-19 12:13:33 +01:00
Ingo Molnar
40999096e8 Merge branches 'tracing/blktrace', 'tracing/ftrace' and 'tracing/urgent' into tracing/core 2009-02-19 10:20:17 +01:00
john stultz
fdcedf7b75 time: apply NTP frequency/tick changes immediately
Since the GENERIC_TIME changes landed, the adjtimex behavior changed
for struct timex.tick and .freq changed. When the tick or freq value
is set, we adjust the tick_length_base in ntp_update_frequency().
However, this new value doesn't get applied to tick_length until the
next second (via second_overflow).

This means some applications that do quick time tweaking do not see the
requested change made as quickly as expected.

I've run a few tests with this change, and ntpd still functions fine.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-19 10:10:08 +01:00
Ingo Molnar
72c26c9a26 Merge branch 'linus' into tracing/blktrace
Conflicts:
	block/blktrace.c

Semantic merge:
	kernel/trace/blktrace.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-19 09:00:35 +01:00
Steven Rostedt
4b3e3d2284 tracing: limit the number of loops the ring buffer self test can make
Impact: prevent deadlock if ring buffer gets corrupted

This patch adds a paranoid check to make sure the ring buffer consumer
does not go into an infinite loop. Since the ring buffer has been set
to read only, the consumer should not loop for more than the ring buffer
size. A check is added to make sure the consumer does not loop more than
the ring buffer size.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-18 22:50:01 -05:00
Steven Rostedt
4d7a077c0c tracing: have function trace select kallsyms
Impact: fix output of function tracer to be useful

The function tracer is pretty useless if KALLSYMS is not configured.
Unless you are good at reading hex values, the function tracer should
select the KALLSYMS configuration.

Also, the dynamic function tracer will fail its self test if KALLSYMS
is not selected.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-18 22:06:18 -05:00
Steven Rostedt
0c5119c1e6 tracing: disable tracing while testing ring buffer
Impact: fix to prevent hard lockup on self tests

If one of the tracers are broken and is constantly filling the ring
buffer while the test of the ring buffer is running, it will hang
the box. The reason is that the test is a consumer that will not
stop till the ring buffer is empty. But if the tracer is broken and
is constantly producing input to the buffer, this test will never
end. The result is a lockup of the box.

This happened when KALLSYMS was not defined and the dynamic ftrace
test constantly filled the ring buffer, because the filter failed
and all functions were being traced. Something was being called
that constantly filled the buffer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-18 22:04:01 -05:00
Linus Torvalds
ba95fd47d1 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list
  block: fix booting from partitioned md array
  block: revert part of 18ce3751cc
  cciss: PCI power management reset for kexec
  paride/pg.c: xs(): &&/|| confusion
  fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
  block: fix bad definition of BIO_RW_SYNC
  bsg: Fix sense buffer bug in SG_IO
2009-02-18 18:33:04 -08:00
Rafael J. Wysocki
42f5e039c3 pm: fix build for CONFIG_PM unset
Compilation of kprobes.c with CONFIG_PM unset is broken due to some broken
config dependncies.  Fix that.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:54 -08:00
Li Zefan
67e055d144 cgroups: fix possible use after free
In cgroup_kill_sb(), root is freed before sb is detached from the list, so
another sget() may find this sb and call cgroup_test_super(), which will
access the root that has been freed.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:54 -08:00
Steven Rostedt
712406a6bf tracing/function-graph-tracer: make arch generic push pop functions
There is nothing really arch specific of the push and pop functions
used by the function graph tracer. This patch moves them to generic
code.

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-18 13:43:04 -05:00
Ingo Molnar
74019224ac timers: add mod_timer_pending()
Impact: new timer API

Based on an idea from Martin Josefsson with the help of
Patrick McHardy and Stephen Hemminger:

introduce the mod_timer_pending() API which is a mod_timer()
offspring that is an invariant on already removed timers.

(regular mod_timer() re-activates non-pending timers.)

This is useful for the networking code in that it can
allow unserialized mod_timer_pending() timer-forwarding
calls, but a single del_timer*() will stop the timer
from being reactivated again.

Also while at it:

- optimize the regular mod_timer() path some more, the
  timer-stat and a debug check was needlessly duplicated
  in __mod_timer().

- make the exports come straight after the function, as
  most other exports in timer.c already did.

- eliminate __mod_timer() as an external API, change the
  users to mod_timer().

The regular mod_timer() code path is not impacted
significantly, due to inlining optimizations and due to
the simplifications.

Based-on-patch-from: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-18 19:26:33 +01:00
Jens Axboe
93dbb39350 block: fix bad definition of BIO_RW_SYNC
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fe.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-18 10:32:00 +01:00
Frederic Weisbecker
fa7c7f6e11 tracing/core: remove unused parameter in tracing_fill_pipe_page()
Impact: cleanup

The struct page *pages parameter is unused.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-18 01:40:20 +01:00
Frederic Weisbecker
6eaaa5d57e tracing/core: use appropriate waiting on trace_pipe
Impact: api and pipe waiting change

Currently, the waiting used in tracing_read_pipe() is done through a
100 msecs schedule_timeout() loop which periodically check if there
are traces on the buffer.

This can cause small latencies for programs which are reading the incoming
events.

This patch makes the reader waiting for the trace_wait waitqueue except
for few tracers such as the sched and functions tracers which might be
already hold the runqueue lock while waking up the reader.

This is performed through a new callback wait_pipe() on struct tracer.
If none is implemented on a specific tracer, the default waiting for
trace_wait queue is attached.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-18 01:40:20 +01:00
Ingo Molnar
ac07bcaa82 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-02-18 01:09:07 +01:00
Ingo Molnar
37bd824a35 Merge branches 'tracing/ftrace' and 'tracing/urgent' into tracing/core 2009-02-18 01:08:13 +01:00
Linus Torvalds
e78ac4b9de Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: cpu hotplug fix
2009-02-17 14:30:06 -08:00
Linus Torvalds
29a0c5ce5d Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  timers: more consistently use clock vs timer
2009-02-17 14:29:42 -08:00
Linus Torvalds
f8effd1a4a Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  doc: mmiotrace.txt, buffer size control change
  trace: mmiotrace to the tracer menu in Kconfig
  mmiotrace: count events lost due to not recording
2009-02-17 14:29:15 -08:00
Linus Torvalds
8ce9a75a30 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  iommu: fix Intel IOMMU write-buffer flushing
  futex: fix reference leak

Trivial conflicts fixed manually in drivers/pci/intel-iommu.c
2009-02-17 14:26:35 -08:00
Ingo Molnar
f17c75453b irq: name 'p' variables a bit better
'p' stands for pointer - make it clear in setup_irq() and free_irq()
what kind of pointer it is.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-17 20:44:47 +01:00
Ingo Molnar
8316e38100 irq: further clean up the free_irq() code flow
Linus noticed that the 'pp' variable can be eliminated
altogether, and the loop can be cleaned up further.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-17 20:28:29 +01:00
Frederic Weisbecker
5b058bcde9 tracing/function-graph-tracer: trace the idle tasks
When the function graph tracer is activated, it iterates over the task_list
to allocate a stack to store the return addresses.

But the per cpu idle tasks are not iterated by using
do_each_thread / while_each_thread.

So we have to iterate on them manually.

This fixes somes weirdness in the traces and many losses of traces.
Examples on two cpus:

 0)   Xorg-4287    |   2.906 us    |              }
 0)   Xorg-4287    |   3.965 us    |            }
 0)   Xorg-4287    |   5.302 us    |          }
 ------------------------------------------
 0)   Xorg-4287    =>    <idle>-0
 ------------------------------------------

 0)    <idle>-0    |   2.861 us    |                        }
 0)    <idle>-0    |   0.526 us    |                        set_normalized_timespec();
 0)    <idle>-0    |   7.201 us    |                      }
 0)    <idle>-0    |   8.214 us    |                    }
 0)    <idle>-0    |               |                    clockevents_program_event() {
 0)    <idle>-0    |               |                      lapic_next_event() {
 0)    <idle>-0    |   0.510 us    |                        native_apic_mem_write();
 0)    <idle>-0    |   1.546 us    |                      }
 0)    <idle>-0    |   2.583 us    |                    }
 0)    <idle>-0    | + 12.435 us   |                  }
 0)    <idle>-0    | + 13.470 us   |                }
 0)    <idle>-0    |   0.608 us    |                _spin_unlock_irqrestore();
 0)    <idle>-0    | + 23.270 us   |              }
 0)    <idle>-0    | + 24.336 us   |            }
 0)    <idle>-0    | + 25.417 us   |          }
 0)    <idle>-0    |   0.593 us    |          _spin_unlock();
 0)    <idle>-0    | + 41.869 us   |        }
 0)    <idle>-0    | + 42.906 us   |      }
 0)    <idle>-0    | + 95.035 us   |    }
 0)    <idle>-0    |   0.540 us    |    menu_reflect();
 0)    <idle>-0    | ! 100.404 us  |  }
 0)    <idle>-0    |   0.564 us    |  mce_idle_callback();
 0)    <idle>-0    |               |  enter_idle() {
 0)    <idle>-0    |   0.526 us    |    mce_idle_callback();
 0)    <idle>-0    |   1.757 us    |  }
 0)    <idle>-0    |               |  cpuidle_idle_call() {
 0)    <idle>-0    |               |    menu_select() {
 0)    <idle>-0    |   0.525 us    |      pm_qos_requirement();
 0)    <idle>-0    |   0.518 us    |      tick_nohz_get_sleep_length();
 0)    <idle>-0    |   2.621 us    |    }
[...]
 1)    <idle>-0    |   0.518 us    |              touch_softlockup_watchdog();
 1)    <idle>-0    | + 14.355 us   |            }
 1)    <idle>-0    | + 22.840 us   |          }
 1)    <idle>-0    | + 25.949 us   |        }
 1)    <idle>-0    |               |        handle_irq() {
 1)    <idle>-0    |   0.511 us    |          irq_to_desc();
 1)    <idle>-0    |               |          handle_edge_irq() {
 1)    <idle>-0    |   0.638 us    |            _spin_lock();
 1)    <idle>-0    |               |            ack_apic_edge() {
 1)    <idle>-0    |   0.510 us    |              irq_to_desc();
 1)    <idle>-0    |               |              move_native_irq() {
 1)    <idle>-0    |   0.510 us    |                irq_to_desc();
 1)    <idle>-0    |   1.532 us    |              }
 1)    <idle>-0    |   0.511 us    |              native_apic_mem_write();
 ------------------------------------------
 1)    <idle>-0    =>    cat-5073
 ------------------------------------------

 1)    cat-5073    |   3.731 us    |                    }
 1)    cat-5073    |               |                    run_local_timers() {
 1)    cat-5073    |   0.533 us    |                      hrtimer_run_queues();
 1)    cat-5073    |               |                      raise_softirq() {
 1)    cat-5073    |               |                        __raise_softirq_irqoff() {
 1)    cat-5073    |               |                          /* nr: 1 */
 1)    cat-5073    |   2.718 us    |                        }
 1)    cat-5073    |   3.814 us    |                      }

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-17 19:20:17 +01:00
Steven Rostedt
35ebf1caa4 ftrace: show unlimited when traceon or traceoff has no counter
Impact: clean up

The traceon and traceoff function probes are confusing to developers
to what happens when a counter is not specified. This should help
clear things up.

 # echo "*:traceoff" > set_ftrace_filter
 # cat /debug/tracing/set_ftrace_filter

  #### all functions enabled ####
  do_fork:traceoff:unlimited

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-17 13:12:12 -05:00
Wenji Huang
73d8b8bc4f tracing: fix typing mistake in hint message and comments
Impact: cleanup

Fix incorrect hint message in code and typos in comments.

Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-17 12:38:24 -05:00
Wenji Huang
d2ef7c2f0f tracing: fix the return value of trace selftest
This patch is to fix the return value of trace_selftest_startup_sysprof
and trace_selftest_startup_branch on failure.

Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-17 12:38:24 -05:00
Wenji Huang
af51309845 tracing: use the more proper parameter
Pass tsk to tracing_record_cmdline instead of current.

Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-17 12:38:23 -05:00
Steven Rostedt
b6887d7916 ftrace: rename _hook to _probe
Impact: clean up

Ingo Molnar did not like the _hook naming convention used by the
select function tracer. Luis Claudio R. Goncalves suggested using
the "_probe" extension. This patch implements the change of
calling the functions and variables "_hook" and replacing them
with "_probe".

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-17 12:32:04 -05:00
Steven Rostedt
6a24a244cd ftrace: clean up coding style
Ingo Molnar pointed out some coding style issues with the recent ftrace
updates. This patch cleans them up.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-17 11:20:26 -05:00
Ingo Molnar
494df596f9 Merge branches 'x86/acpi', 'x86/apic', 'x86/cpudetect', 'x86/headers', 'x86/paravirt', 'x86/urgent' and 'x86/xen'; commit 'v2.6.29-rc5' into x86/core 2009-02-17 12:07:00 +01:00
Ingo Molnar
73d3fd96e7 ftrace: fix !CONFIG_DYNAMIC_FTRACE ftrace_swapper_pid definition
Impact: build fix

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-17 11:50:42 +01:00
Ingo Molnar
f492d3f838 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-02-17 11:31:01 +01:00
Ingo Molnar
c4e2b432d5 Merge branches 'tracing/hw-branch-tracing' and 'tracing/power-tracer' into tracing/core 2009-02-17 11:29:53 +01:00
Steven Rostedt
e110e3d1ea ftrace: add pretty print function for traceon and traceoff hooks
This patch adds a pretty print version of traceon and traceoff
output for set_ftrace_filter.

  # echo 'sys_open:traceon:4' > set_ftrace_filter
  # cat set_ftrace_filter

 #### all functions enabled ####
 sys_open:traceon:count=4

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 23:38:13 -05:00
Steven Rostedt
809dcf29ce ftrace: add pretty print to selected fuction traces
This patch adds a call back for the tracers that have hooks to
selected functions. This allows the tracer to show better output
in the set_ftrace_filter file.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 23:06:01 -05:00
Steven Rostedt
8fc0c701c5 ftrace: show selected functions in set_ftrace_filter
This patch adds output to show what functions have tracer hooks
attached to them.

  # echo 'sys_open:traceon:4' > /debug/tracing/set_ftrace_filter
  # cat set_ftrace_filter

 #### all functions enabled ####
 sys_open:ftrace_traceon:0000000000000004

  # echo 'do_fork:traceoff:' > set_ftrace_filter
  # cat set_ftrace_filter

 #### all functions enabled ####
 sys_open:ftrace_traceon:0000000000000002
 do_fork:ftrace_traceoff:ffffffffffffffff

Note the 4 changed to a 2. This is because The code was executed twice
since the traceoff was added. If a cat is done again:

 #### all functions enabled ####
 sys_open:ftrace_traceon
 do_fork:ftrace_traceoff:ffffffffffffffff

The number disappears. That is because it will not print a NULL.

Callbacks to allow the tracer to pretty print will be implemented soon.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 22:50:06 -05:00
Steven Rostedt
23b4ff3aa4 ftrace: add traceon traceoff commands to enable/disable the buffers
This patch adds the new function selection commands traceon and
traceoff. traceon sets the function to enable the ring buffers
while traceoff disables the ring buffers.  You can pass in the
number of times you want the command to be executed when the function
is hit. It will only execute if the state of the buffers are not
already in that state.

Example:

 # echo do_fork:traceon:4

Will enable the ring buffers if they are disabled every time it
hits do_fork, up to 4 times.

 # echo sys_close:traceoff

This will disable the ring buffers every time (unlimited) when
sys_close is called.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 22:50:04 -05:00
Steven Rostedt
988ae9d6b2 ring-buffer: add tracing_is_on to test if ring buffer is enabled
This patch adds the tracing_is_on() interface to tell if the ring
buffer is turned on or not.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 22:50:01 -05:00
Steven Rostedt
59df055f19 ftrace: trace different functions with a different tracer
Impact: new feature

Currently, the function tracer only gives you an ability to hook
a tracer to all functions being traced. The dynamic function trace
allows you to pick and choose which of those functions will be
traced, but all functions being traced will call all tracers that
registered with the function tracer.

This patch adds a new feature that allows a tracer to hook to specific
functions, even when all functions are being traced. It allows for
different functions to call different tracer hooks.

The way this is accomplished is by a special function that will hook
to the function tracer and will set up a hash table knowing which
tracer hook to call with which function. This is the most general
and easiest method to accomplish this. Later, an arch may choose
to supply their own method in changing the mcount call of a function
to call a different tracer. But that will be an exercise for the
future.

To register a function:

 struct ftrace_hook_ops {
	void			(*func)(unsigned long ip,
					unsigned long parent_ip,
					void **data);
	int			(*callback)(unsigned long ip, void **data);
	void			(*free)(void **data);
 };

 int register_ftrace_function_hook(char *glob, struct ftrace_hook_ops *ops,
				  void *data);

glob is a simple glob to search for the functions to hook.
ops is a pointer to the operations (listed below)
data is the default data to be passed to the hook functions when traced

ops:
 func is the hook function to call when the functions are traced
 callback is a callback function that is called when setting up the hash.
   That is, if the tracer needs to do something special for each
   function, that is being traced, and wants to give each function
   its own data. The address of the entry data is passed to this
   callback, so that the callback may wish to update the entry to
   whatever it would like.
 free is a callback for when the entry is freed. In case the tracer
   allocated any data, it is give the chance to free it.

To unregister we have three functions:

  void
  unregister_ftrace_function_hook(char *glob, struct ftrace_hook_ops *ops,
				void *data)

This will unregister all hooks that match glob, point to ops, and
have its data matching data. (note, if glob is NULL, blank or '*',
all functions will be tested).

  void
  unregister_ftrace_function_hook_func(char *glob,
				 struct ftrace_hook_ops *ops)

This will unregister all functions matching glob that has an entry
pointing to ops.

  void unregister_ftrace_function_hook_all(char *glob)

This simply unregisters all funcs.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 22:44:09 -05:00
Steven Rostedt
e6ea44e9b4 ftrace: consolidate mutexes
Impact: clean up

Now that ftrace_lock is a mutex, there is no reason to have three
different mutexes protecting similar data. All the mutex paths
are not in hot paths, so having a mutex to cover more data is
not a problem.

This patch removes the ftrace_sysctl_lock and ftrace_start_lock
and uses the ftrace_lock to protect the locations that were protected
by these locks. By doing so, this change also removes some of
the lock nesting that was taking place.

There are still more mutexes in ftrace.c that can probably be
consolidated, but they can be dealt with later. We need to be careful
about the way the locks are nested, and by consolidating, we can cause
a recursive deadlock.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 18:15:31 -05:00
Steven Rostedt
52baf11922 ftrace: convert ftrace_lock from a spinlock to mutex
Impact: clean up

The older versions of ftrace required doing the ftrace list
search under atomic context. Now all the calls are in non-atomic
context. There is no reason to keep the ftrace_lock as a spinlock.

This patch converts it to a mutex.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 17:33:14 -05:00
Steven Rostedt
f6180773d9 ftrace: add command interface for function selection
Allow for other tracers to add their own commands for function
selection. This interface gives a trace the ability to name a
command for function selection. Right now it is pretty limited
in what it offers, but this is a building step for more features.

The :mod: command is converted to this interface and also serves
as a template for other implementations.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 17:06:02 -05:00
Steven Rostedt
e68746a271 ftrace: enable filtering only when a function is filtered on
Impact: fix to prevent empty set_ftrace_filter and no ftrace output

The function filter is used to only trace a given set of functions.
The filter is enabled when a function name is echoed into the
set_ftrace_filter file. But if the name has a typo and the function
is not found, the filter is enabled, but no function is listed.

This makes a confusing situation where set_ftrace_filter is empty
but no functions ever get enabled for tracing.

For example:

 # cat /debug/tracing/set_ftrace_filter

  #### all functions enabled ####

 # echo bad_name > set_ftrace_filter
 # cat /debug/tracing/set_ftrace_filter

 # echo function > current_tracer
 # cat trace

  # tracer: nop
  #
  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
  #              | |       |          |         |

This patch changes that to only enable filtering if a function
is set to be filtered on. Now, the filter is not enabled if
a bad name is echoed into set_ftrace_filter.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 17:03:49 -05:00
Steven Rostedt
64e7c44061 ftrace: add module command function filter selection
This patch adds a "command" syntax to the function filtering files:

  /debugfs/tracing/set_ftrace_filter
  /debugfs/tracing/set_ftrace_notrace

Of the format:  <function>:<command>:<parameter>

The command is optional, and dependent on the command, so are
the parameters.

 echo do_fork > set_ftrace_filter

Will only trace 'do_fork'.

 echo 'sched_*' > set_ftrace_filter

Will only trace functions starting with the letters 'sched_'.

 echo '*:mod:ext3' > set_ftrace_filter

Will trace only the ext3 module functions.

 echo '*write*:mod:ext3' > set_ftrace_notrace

Will prevent the ext3 functions with the letters 'write' in
the name from being traced.

 echo '!*_allocate:mod:ext3' > set_ftrace_filter

Will remove the functions in ext3 that end with the letters
'_allocate' from the ftrace filter.

Although this patch implements the 'command' format, only the
'mod' command is supported. More commands to follow.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 16:55:50 -05:00
Steven Rostedt
9f4801e30a ftrace: break up ftrace_match_records into smaller components
Impact: clean up

ftrace_match_records does a lot of things that other features
can use. This patch breaks up ftrace_match_records and pulls
out ftrace_setup_glob and ftrace_match_record.

ftrace_setup_glob prepares a simple glob expression for use with
ftrace_match_record. ftrace_match_record compares a single record
with a glob type.

Breaking this up will allow for more features to run on individual
records.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 16:49:57 -05:00
Steven Rostedt
7f24b31b01 ftrace: rename ftrace_match to ftrace_match_records
Impact: clean up

ftrace_match is too generic of a name. What it really does is
search all records and matches the records with the given string,
and either sets or unsets the functions to be traced depending
on if the parameter 'enable' is set or not.

This allows us to make another function called ftrace_match that
can be used to test a single record.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 16:33:15 -05:00
Steven Rostedt
265c831cb0 ftrace: add do_for_each_ftrace_rec and while_for_each_ftrace_rec
Impact: clean up

To iterate over all the functions that dynamic trace knows about
it requires two for loops. One to iterate over the pages and the
other to iterate over the records within the page.

There are several duplications of these loops in ftrace.c. This
patch creates the macros do_for_each_ftrace_rec and
while_for_each_ftrace_rec to handle this logic, and removes the
duplicate code.

While making this change, I also discovered and fixed a small
bug that one of the iterations should exit the loop after it found the
record it was searching for. This used a break when it should have
used a goto, since there were two loops it needed to break out
from.  No real harm was done by this bug since it would only continue
to search the other records, and the code was in a slow path anyway.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 16:25:12 -05:00
Steven Rostedt
0c75a3ed63 ftrace: state that all functions are enabled in set_ftrace_filter
Impact: clean up, make set_ftrace_filter less confusing

The set_ftrace_filter shows only the functions that will be traced.
But when it is empty, it will trace all functions. This can be a bit
confusing.

This patch makes set_ftrace_filter show:

  #### all functions enabled ####

When all functions will be traced, and we do not filter only a select
few.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-16 16:21:35 -05:00
Américo Wang
2b8f836fb1 sched: use TASK_NICE for task_struct
#define TASK_NICE(p)            PRIO_TO_NICE((p)->static_prio)

So it's better to use TASK_NICE here.

Signed-off-by: WANG Cong <wangcong@zeuux.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-16 13:45:52 +01:00
Patrick Ohly
a75244c3d5 timecompare: generic infrastructure to map between two time bases
Mapping from a struct timecounter to a time returned by functions like
ktime_get_real() is implemented. This is sufficient to use this code
in a network device driver which wants to support hardware time
stamping and transformation of hardware time stamps to system time.

The interface could have been made more versatile by not depending on
a time counter, but this wasn't done to avoid writing glue code
elsewhere.

The method implemented here is the one used and analyzed under the name
"assisted PTP" in the LCI PTP paper:
http://www.linuxclustersinstitute.org/conferences/archive/2008/PDF/Ohly_92221.pdf

Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-15 22:43:32 -08:00
Patrick Ohly
a038a353c3 clocksource: allow usage independent of timekeeping.c
So far struct clocksource acted as the interface between time/timekeeping.c
and hardware. This patch generalizes the concept so that a similar
interface can also be used in other contexts. For that it introduces
new structures and related functions *without* touching the existing
struct clocksource.

The reasons for adding these new structures to clocksource.[ch] are
* the APIs are clearly related
* struct clocksource could be cleaned up to use the new structs
* avoids proliferation of files with similar names (timesource.h?
  timecounter.h?)

As outlined in the discussion with John Stultz, this patch adds
* struct cyclecounter: stateless API to hardware which counts clock cycles
* struct timecounter: stateful utility code built on a cyclecounter which
  provides a nanosecond counter
* only the function to read the nanosecond counter; deltas are used internally
  and not exposed to users of timecounter

The code does no locking of the shared state. It must be called at least
as often as the cycle counter wraps around to detect these wrap arounds.
Both is the responsibility of the timecounter user.

Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-15 22:43:31 -08:00
Henrik Austad
a0a522ce3d sched: idle_at_tick is only used when CONFIG_SMP is set
Impact: struct rq size optimization

The idle_at_tick in struct rq is only used in SMP settings
and it does not make sense to have this in the rq in an UP setup.

Signed-off-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15 21:16:10 +01:00
Ingo Molnar
5274f8354d Merge branch 'sched/urgent'; commit 'v2.6.29-rc5' into sched/core 2009-02-15 21:15:16 +01:00
Ingo Molnar
72b623c736 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/power-tracer 2009-02-15 20:43:03 +01:00
Rakib Mullick
a234aa9ecd tracing: fix section mismatch in trace_hw_branches.c
The function bts_trace_init() references a variable
bts_hotcpu_notifier which is marked
as __cpuinitdata. Thus causes section mismatch. This patch fixes it.

   LD      kernel/trace/built-in.o
 WARNING: kernel/trace/built-in.o(.text+0xc90c): Section mismatch in
 reference from the function bts_trace_init() to the variable
 .cpuinit.data:bts_hotcpu_notifier
 The function bts_trace_init() references
 the variable __cpuinitdata bts_hotcpu_notifier.
 This is often because bts_trace_init lacks a __cpuinitdata
 annotation or the annotation of bts_hotcpu_notifier is wrong.

 WARNING: kernel/trace/built-in.o(.text+0xc92a): Section mismatch in
 reference from the function bts_trace_reset() to the variable
 .cpuinit.data:bts_hotcpu_notifier
 The function bts_trace_reset() references
 the variable __cpuinitdata bts_hotcpu_notifier.
 This is often because bts_trace_reset lacks a __cpuinitdata
 annotation or the annotation of bts_hotcpu_notifier is wrong.

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Cc: markus.t.metzger@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15 20:41:08 +01:00
Ingo Molnar
9abd603048 Merge branches 'tracing/ftrace' and 'tracing/urgent' into tracing/core
Conflicts:
	kernel/trace/trace_mmiotrace.c
2009-02-15 20:08:55 +01:00
Pekka Paalanen
6bc5c366b1 trace: mmiotrace to the tracer menu in Kconfig
Impact: cosmetic change in Kconfig menu layout

This patch was originally suggested by Peter Zijlstra, but seems it
was forgotten.

CONFIG_MMIOTRACE and CONFIG_MMIOTRACE_TEST were selectable
directly under the Kernel hacking / debugging menu in the kernel
configuration system. They were present only for x86 and x86_64.

Other tracers that use the ftrace tracing framework are in their own
sub-menu. This patch moves the mmiotrace configuration options there.
Since the Kconfig file, where the tracer menu is, is not architecture
specific, HAVE_MMIOTRACE_SUPPORT is introduced and provided only by
x86/x86_64. CONFIG_MMIOTRACE now depends on it.

Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15 20:03:28 +01:00
Pekka Paalanen
391b170f10 mmiotrace: count events lost due to not recording
Impact: enhances lost events counting in mmiotrace

The tracing framework, or the ring buffer facility it uses, has a switch
to stop recording data. When recording is off, the trace events will be
lost. The framework does not count these, so mmiotrace has to count them
itself.

Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15 20:02:42 +01:00
Ingo Molnar
ae88a23b32 irq: refactor and clean up the free_irq() code flow
Impact: cleanup

- separate out the loop from the actual freeing logic, this wins us
  two indentation levels allowing a number of followup prettifications

- turn the WARN_ON() into a more informative WARN().

- clean up the comments and the code flow some more

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15 11:36:49 +01:00
Ingo Molnar
327ec5699c irq: clean up manage.c
- make printk message git-greppable
- fix a few style details

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15 11:21:37 +01:00
Ingo Molnar
b69bc39674 Merge commit 'v2.6.29-rc5' into x86/apic 2009-02-15 09:00:18 +01:00
David S. Miller
5e30589521 Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn.c
	drivers/net/wireless/iwlwifi/iwl3945-base.c
2009-02-14 23:12:00 -08:00
Peter Zijlstra
868a23a804 lockdep: build fix for !PROVE_LOCKING
The __GFP_FS annotations fail to build with CONFIG_LOCKDEP=y,
CONFIG_PROVE_LOCKING=n, ammend that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-15 00:30:26 +01:00
Peter Zijlstra
9833f8cb95 lockstat: warn about disabled lock debugging
Avoid confusion and clearly state lock debugging got disabled.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:28 +01:00
Peter Zijlstra
b4b136f44b lockdep: use stringify.h
Arnd pointed out we have the stringify macro magic already in-kernel.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:26 +01:00
Peter Zijlstra
4f367d8adc lockdep: simplify check_prev_add_irq()
Remove the manual state iteration thingy.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:24 +01:00
Peter Zijlstra
f510b233cf lockdep: get_user_chars() redo
Generic, states independent, get_user_chars().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:22 +01:00
Peter Zijlstra
3ff176ca47 lockdep: simplify get_user_chars()
there's too much repetition of code..

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:21 +01:00
Peter Zijlstra
38aa271438 lockdep: add comments to mark_lock_irq()
re-add some of the comments that got lost in the refactoring.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:19 +01:00
Peter Zijlstra
cf2ad4d13c lockdep: remove macro usage from mark_held_locks()
Now that we have nice numerical relations for the states, remove the macro
magics.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:17 +01:00
Peter Zijlstra
9d3651a23d lockdep: fully reduce mark_lock_irq()
Now what its only two functions, they again look rather similar.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:15 +01:00
Peter Zijlstra
42c50d544e lockdep: merge the !_READ mark_lock_irq() helpers
These two are also remakably similar

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:13 +01:00
Peter Zijlstra
780e820b2d lockdep: merge the _READ mark_lock_irq() helpers
The _READ helpers show remarkable similarity, merge them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:12 +01:00
Peter Zijlstra
cd95302d25 lockdep: simplify mark_lock_irq() helpers #3
Kill another argument

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:10 +01:00
Peter Zijlstra
f989209e2f lockdep: further simplify mark_lock_irq() helpers
take away another parameter

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:08 +01:00
Peter Zijlstra
604de3b5b6 lockdep: simplify the mark_lock_irq() helpers
In order to unify them, take some arguments away

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:06 +01:00
Peter Zijlstra
6a6904d347 lockdep: split up mark_lock_irq()
split mark_lock_irq() into 4 simple helper functions

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:04 +01:00
Peter Zijlstra
fabe9c42c6 lockdep: generate usage strings
generate the usage strings

XXX capital invasion :-(

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:02 +01:00
Peter Zijlstra
d7b1b02134 lockdep: generate the state bit definitions
Generate the state bit definitions from the lockdep_states.h file.

Also, move LOCK_USED to last, so that the

 USED_IN
 USED_IN_READ
 ENABLED
 ENABLED_READ

states are nicely bit aligned -- we're going to use that property

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:28:01 +01:00
Peter Zijlstra
9851673bc3 lockdep: move state bit definitions around
For convenience later.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:27:59 +01:00
Peter Zijlstra
5346417e17 lockdep: simplify mark_lock()
remove the state iteration

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:27:57 +01:00
Peter Zijlstra
36bfb9bb03 lockdep: simplify mark_held_locks
remove the explicit state iteration

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:27:56 +01:00
Peter Zijlstra
9fe51abf7a lockdep: lockdep_states.h
Introduce a header file to generate all the states from.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:27:54 +01:00
Peter Zijlstra
a652d7081b lockdep: sanitize reclaim bit names
s/HELD_OVER/ENABLED/g

so that its similar to the hard and soft-irq names.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:27:52 +01:00
Peter Zijlstra
4fc95e867f lockdep: sanitize bit names
s/\(LOCKF\?_ENABLED_[^ ]*\)S\(_READ\)\?\>/\1\2/g

So that the USED_IN and ENABLED have the same names.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:27:51 +01:00
Nick Piggin
cf40bd16fd lockdep: annotate reclaim context (__GFP_NOFS)
Here is another version, with the incremental patch rolled up, and
added reclaim context annotation to kswapd, and allocation tracing
to slab allocators (which may only ever reach the page allocator
in rare cases, so it is good to put annotations here too).

Haven't tested this version as such, but it should be getting closer
to merge worthy ;)

--
After noticing some code in mm/filemap.c accidentally perform a __GFP_FS
allocation when it should not have been, I thought it might be a good idea to
try to catch this kind of thing with lockdep.

I coded up a little idea that seems to work. Unfortunately the system has to
actually be in __GFP_FS page reclaim, then take the lock, before it will mark
it. But at least that might still be some orders of magnitude more common
(and more debuggable) than an actual deadlock condition, so we have some
improvement I hope (the concept is no less complete than discovery of a lock's
interrupt contexts).

I guess we could even do the same thing with __GFP_IO (normal reclaim), and
even GFP_NOIO locks too... but filesystems will have the most locks and fiddly
code paths, so let's start there and see how it goes.

It *seems* to work. I did a quick test.

=================================
[ INFO: inconsistent lock state ]
2.6.28-rc6-00007-ged31348-dirty #26
---------------------------------
inconsistent {in-reclaim-W} -> {ov-reclaim-W} usage.
modprobe/8526 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd]
{in-reclaim-W} state was registered at:
  [<ffffffff80267bdb>] __lock_acquire+0x75b/0x1a60
  [<ffffffff80268f71>] lock_acquire+0x91/0xc0
  [<ffffffff8070f0e1>] mutex_lock_nested+0xb1/0x310
  [<ffffffffa002002b>] brd_init+0x2b/0x216 [brd]
  [<ffffffff8020903b>] _stext+0x3b/0x170
  [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0
  [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b
  [<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 3929
hardirqs last  enabled at (3929): [<ffffffff8070f2b5>] mutex_lock_nested+0x285/0x310
hardirqs last disabled at (3928): [<ffffffff8070f089>] mutex_lock_nested+0x59/0x310
softirqs last  enabled at (3732): [<ffffffff8061f623>] sk_filter+0x83/0xe0
softirqs last disabled at (3730): [<ffffffff8061f5b6>] sk_filter+0x16/0xe0

other info that might help us debug this:
1 lock held by modprobe/8526:
 #0:  (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd]

stack backtrace:
Pid: 8526, comm: modprobe Not tainted 2.6.28-rc6-00007-ged31348-dirty #26
Call Trace:
 [<ffffffff80265483>] print_usage_bug+0x193/0x1d0
 [<ffffffff80266530>] mark_lock+0xaf0/0xca0
 [<ffffffff80266735>] mark_held_locks+0x55/0xc0
 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
 [<ffffffff802667ca>] trace_reclaim_fs+0x2a/0x60
 [<ffffffff80285005>] __alloc_pages_internal+0x475/0x580
 [<ffffffff8070f29e>] ? mutex_lock_nested+0x26e/0x310
 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
 [<ffffffffa002006a>] brd_init+0x6a/0x216 [brd]
 [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
 [<ffffffff8020903b>] _stext+0x3b/0x170
 [<ffffffff8070f8b9>] ? mutex_unlock+0x9/0x10
 [<ffffffff8070f83d>] ? __mutex_unlock_slowpath+0x10d/0x180
 [<ffffffff802669ec>] ? trace_hardirqs_on_caller+0x12c/0x190
 [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0
 [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-14 23:27:49 +01:00
Johannes Berg
6f2b9b9a9d timer: implement lockdep deadlock detection
This modifies the timer code in a way to allow lockdep to detect
deadlocks resulting from a lock being taken in the timer function
as well as around the del_timer_sync() call.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
2009-02-14 23:25:52 +01:00
Serge E. Hallyn
fb5ae64fdd User namespaces: Only put the userns when we unhash the uid
uids in namespaces other than init don't get a sysfs entry.

For those in the init namespace, while we're waiting to remove
the sysfs entry for the uid the uid is still hashed, and
alloc_uid() may re-grab that uid without getting a new
reference to the user_ns, which we've already put in free_user
before scheduling remove_user_sysfs_dir().

Reported-and-tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-13 08:07:40 -08:00
Jason Baron
b5f9fd0f8a tracing: convert c/p state power tracer to use tracepoints
Convert the c/p state "power" tracer to use tracepoints. Avoids a
function call when the tracer is disabled.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-13 09:06:18 -05:00
Peter Zijlstra
3997ad317f timers: more consistently use clock vs timer
While reviewing the manpages, I noticed I'd missed some clock vs timer sites.

Make sure that all timer functions call cpu_timer_sample_group() and not
cpu_clock_sample_group(). This ensures that we enable the process wide timer
in time, and therefore pay the O(n) thread group cost from the syscall.

Not doing it here, will result in the first jiffy tick after setting the timer
doing this, resulting in a very expensive tick (but only once) and a delay in
actually starting the timer.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-13 13:04:05 +01:00
Ingo Molnar
8f8573ae9f Merge branches 'irq/genirq', 'irq/sparseirq' and 'irq/urgent' into irq/core 2009-02-13 11:57:18 +01:00
Johannes Weiner
0e43785c57 irq: use GFP_KERNEL for action allocation in request_irq()
request_irq() calls into proc code via __setup_irq() which is not safe
in an atomic context, so request_irq() can itself use the more
reliable GFP_KERNEL allocation for the action descriptor.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-13 10:52:07 +01:00
Ingo Molnar
d351c8db95 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-02-13 10:26:45 +01:00
Ingo Molnar
1c511f740f Merge branches 'tracing/ftrace', 'tracing/ring-buffer', 'tracing/sysprof', 'tracing/urgent' and 'linus' into tracing/core 2009-02-13 10:25:18 +01:00
Ingo Molnar
ab639f3593 Merge branch 'core/percpu' into x86/core 2009-02-13 09:45:09 +01:00
Ingo Molnar
f8a6b2b9ce Merge branch 'linus' into x86/apic
Conflicts:
	arch/x86/kernel/acpi/boot.c
	arch/x86/mm/fault.c
2009-02-13 09:44:22 +01:00
Steven Rostedt
45141d4667 ring-buffer: rename label out_unlock to out_reset
Impact: clean up

While reviewing the ring buffer code, I thougth I saw a bug with

	if (!__raw_spin_trylock(&cpu_buffer->lock))
		goto out_unlock;

But I forgot that we use a variable "lock_taken" that is set if
the spinlock is taken, and only unlock it if that variable is set.

To avoid further confusion from other reviewers, this patch
renames the label out_unlock with out_reset, which is the more
appropriate name.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-12 13:39:46 -05:00
Ingo Molnar
9f8d979f08 softlockup: move 'one' to the softlockup section in sysctl.c
CONFIG_SOFTLOCKUP=y || CONFIG_DETECT_HUNG_TASKS=y is now the only user
of the 'one' constant in kernel/sysctl.c. Move it to the softlockup
block of constants.

This fixes a GCC warning.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-12 17:08:15 +01:00
Ingo Molnar
871cafcc96 Merge branch 'linus' into core/softlockup 2009-02-12 13:08:57 +01:00
Ingo Molnar
a0490fa35d sched: cpu hotplug fix
rq_attach_root() does a kfree() with the runqueue lock held.

That's not a very wise move, fix it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-12 11:57:36 +01:00
Li Zefan
cfebe563bd cgroups: fix lockdep subclasses overflow
I enabled all cgroup subsystems when compiling kernel, and then:
 # mount -t cgroup -o net_cls xxx /mnt
 # mkdir /mnt/0

This showed up immediately:
 BUG: MAX_LOCKDEP_SUBCLASSES too low!
 turning off the locking correctness validator.

It's caused by the cgroup hierarchy lock:
	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
		struct cgroup_subsys *ss = subsys[i];
		if (ss->root == root)
			mutex_lock_nested(&ss->hierarchy_mutex, i);
	}

Now we have 9 cgroup subsystems, and the above 'i' for net_cls is 8, but
MAX_LOCKDEP_SUBCLASSES is 8.

This patch uses different lockdep keys for different subsystems.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:36 -08:00
Sven Wegener
fc3501d411 mm: fix dirty_bytes/dirty_background_bytes sysctls on 64bit arches
We need to pass an unsigned long as the minimum, because it gets casted
to an unsigned long in the sysctl handler. If we pass an int, we'll
access four more bytes on 64bit arches, resulting in a random minimum
value.

[rientjes@google.com: fix type of `old_bytes']
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:35 -08:00
Peter Zijlstra
2fff78c784 futex: fix reference leak
Catalin noticed that (38d47c1b70: futex: rely on get_user_pages() for
shared futexes) caused an mm_struct leak.

Some tracing with the function graph tracer quickly pointed out that
futex_wait() has exit paths with unbalanced reference counts.

This regression was discovered by kmemleak.

Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 18:24:08 +01:00
Linus Torvalds
6c6f1f0f4d Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: revert recent sync wakeup changes
2009-02-11 08:25:06 -08:00
Linus Torvalds
94dba89533 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  timers: fix TIMER_ABSTIME for process wide cpu timers
  timers: split process wide cpu clocks/timers, fix
  x86: clean up hpet timer reinit
  timers: split process wide cpu clocks/timers, remove spurious warning
  timers: split process wide cpu clocks/timers
  signal: re-add dead task accumulation stats.
  x86: fix hpet timer reinit for x86_64
  sched: fix nohz load balancer on cpu offline
2009-02-11 08:24:32 -08:00
Linus Torvalds
9ce04f9238 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  ptrace, x86: fix the usage of ptrace_fork()
  i8327: fix outb() parameter order
  x86: fix math_emu register frame access
  x86: math_emu info cleanup
  x86: include correct %gs in a.out core dump
  x86, vmi: put a missing paravirt_release_pmd in pgd_dtor
  x86: find nr_irqs_gsi with mp_ioapic_routing
  x86: add clflush before monitor for Intel 7400 series
  x86: disable intel_iommu support by default
  x86: don't apply __supported_pte_mask to non-present ptes
  x86: fix grammar in user-visible BIOS warning
  x86/Kconfig.cpu: make Kconfig help readable in the console
  x86, 64-bit: print DMI info in the oops trace
2009-02-11 08:23:22 -08:00
Peter Zijlstra
fc631c82e1 sched: revert recent sync wakeup changes
Intel reported a 10% regression (mysql+sysbench) on a 16-way machine
with these patches:

  1596e29: sched: symmetric sync vs avg_overlap
  d942fb6: sched: fix sync wakeups

Revert them.

Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Bisected-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 14:43:35 +01:00
Peter Zijlstra
4da94d49b2 timers: fix TIMER_ABSTIME for process wide cpu timers
The POSIX timer interface allows for absolute time expiry values through the
TIMER_ABSTIME flag, therefore we have to synchronize the timer to the clock
every time we start it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 14:04:21 +01:00
Peter Zijlstra
3fccfd67df timers: split process wide cpu clocks/timers, fix
To decrease the chance of a missed enable, always enable the timer when we
sample it, we'll always disable it when we find that there are no active timers
in the jiffy tick.

This fixes a flood of warnings reported by Mike Galbraith.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 14:04:19 +01:00
Arnaldo Carvalho de Melo
00f62f614b ring_buffer: pahole struct ring_buffer
While fixing some bugs in pahole (built-in.o files were not being
processed due to relocation problems) I found out about these packable
structures:

$ pahole --packable kernel/trace/ring_buffer.o  | grep ring
ring_buffer	72	64	8
ring_buffer_per_cpu	112	104	8

If we take a look at the current layout of struct ring_buffer we can see
that we have two 4 bytes holes.

$ pahole -C ring_buffer kernel/trace/ring_buffer.o
struct ring_buffer {
	unsigned int               pages;           /*     0     4 */
	unsigned int               flags;           /*     4     4 */
	int                        cpus;            /*     8     4 */

	/* XXX 4 bytes hole, try to pack */

	cpumask_var_t              cpumask;         /*    16     8 */
	atomic_t                   record_disabled; /*    24     4 */

	/* XXX 4 bytes hole, try to pack */

	struct mutex               mutex;           /*    32    32 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct ring_buffer_per_cpu * * buffers;     /*    64     8 */

	/* size: 72, cachelines: 2, members: 7 */
	/* sum members: 64, holes: 2, sum holes: 8 */
	/* last cacheline: 8 bytes */
};

So, if I ask pahole to reorganize it:

$ pahole -C ring_buffer --reorganize kernel/trace/ring_buffer.o

struct ring_buffer {
	unsigned int               pages;           /*     0     4 */
	unsigned int               flags;           /*     4     4 */
	int                        cpus;            /*     8     4 */
	atomic_t                   record_disabled; /*    12     4 */
	cpumask_var_t              cpumask;         /*    16     8 */
	struct mutex               mutex;           /*    24    32 */
	struct ring_buffer_per_cpu * * buffers;     /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */

	/* size: 64, cachelines: 1, members: 7 */
};   /* saved 8 bytes and 1 cacheline! */

We get it using just one 64 bytes cacheline.

To see what it did:

$ pahole -C ring_buffer --reorganize --show_reorg_steps \
	kernel/trace/ring_buffer.o | grep \/
/* Moving 'record_disabled' from after 'cpumask' to after 'cpus' */

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 13:21:40 +01:00
Frederic Weisbecker
b22f485812 tracing/sysprof: add missing tracing_{start,stop}_record_cmdline()
Add the missing pair tracing_{start,stop}_record_cmdline() to record well
the cmdline associated with pid.

Changes in v2:

- fix a build error, the sched_switch tracer is needed to record the
  cmdline.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 12:55:19 +01:00
Frederic Weisbecker
cf2592f59c softlockup: ensure the task has been switched out once
When we check if a task has been switched out since the last scan, we might
have a race condition on the following scenario:

- the task is freshly created and scheduled

- it puts its state to TASK_UNINTERRUPTIBLE and is not yet switched out

- check_hung_task() scans this task and will report a false positive because
  t->nvcsw + t->nivcsw == t->last_switch_count == 0

Add a check for such cases.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 11:04:16 +01:00
Oleg Nesterov
06eb23b1ba ptrace, x86: fix the usage of ptrace_fork()
I noticed by pure accident we have ptrace_fork() and friends. This was
added by "x86, bts: add fork and exit handling", commit
bf53de907d.

I can't test this, ds_request_bts() returns -EOPNOTSUPP, but I strongly
believe this needs the fix. I think something like this program

	int main(void)
	{
		int pid = fork();

		if (!pid) {
			ptrace(PTRACE_TRACEME, 0, NULL, NULL);
			kill(getpid(), SIGSTOP);
			fork();
		} else {
			struct ptrace_bts_config bts = {
				.flags = PTRACE_BTS_O_ALLOC,
				.size  = 4 * 4096,
			};

			wait(NULL);

			ptrace(PTRACE_SETOPTIONS, pid, NULL, PTRACE_O_TRACEFORK);
			ptrace(PTRACE_BTS_CONFIG, pid, &bts, sizeof(bts));
			ptrace(PTRACE_CONT, pid, NULL, NULL);

			sleep(1);
		}

		return 0;
	}

should crash the kernel.

If the task is traced by its natural parent ptrace_reparented() returns 0
but we should clear ->btsxxx anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 10:32:46 +01:00
Arjan van de Ven
ad0b0fd554 sched, latencytop: incorporate review feedback from Andrew Morton
Andrew had some suggestions for the latencytop file; this patch takes care
of most of these:

* Add documentation
* Turn account_scheduler_latency into an inline function
* Don't report negative values to userspace
* Make the file operations struct const
* Fix a few checkpatch.pl warnings

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 10:18:04 +01:00
Ingo Molnar
f437e8b53e Merge commit 'v2.6.29-rc4' into sched/core 2009-02-11 10:17:42 +01:00
Hannes Eder
e7669b8e32 tracing: fix sparse warning: attribute function with __acquires/__releases
Fix this sparse warning:

  kernel/trace/trace.c:458:9: warning: context imbalance in 'register_tracer' - unexpected unlock

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 10:15:42 +01:00
Hannes Eder
5e39841c45 tracing: fix sparse warnings: fix (un-)signedness
Fix these sparse warnings:

  kernel/trace/ring_buffer.c:70:37: warning: incorrect type in argument 2 (different signedness)
  kernel/trace/ring_buffer.c:84:39: warning: incorrect type in argument 2 (different signedness)
  kernel/trace/ring_buffer.c:96:43: warning: incorrect type in argument 2 (different signedness)
  kernel/trace/ring_buffer.c:2475:13: warning: incorrect type in argument 2 (different signedness)
  kernel/trace/ring_buffer.c:2475:13: warning: incorrect type in argument 2 (different signedness)
  kernel/trace/ring_buffer.c:2478:42: warning: incorrect type in argument 2 (different signedness)
  kernel/trace/ring_buffer.c:2478:42: warning: incorrect type in argument 2 (different signedness)
  kernel/trace/ring_buffer.c:2500:40: warning: incorrect type in argument 3 (different signedness)
  kernel/trace/ring_buffer.c:2505:44: warning: incorrect type in argument 2 (different signedness)
  kernel/trace/ring_buffer.c:2507:46: warning: incorrect type in argument 2 (different signedness)
  kernel/trace/trace.c:2130:40: warning: incorrect type in argument 3 (different signedness)
  kernel/trace/trace.c:2280:40: warning: incorrect type in argument 3 (different signedness)

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 10:15:42 +01:00
Hannes Eder
4fd2735881 tracing: fix sparse warnings: make symbols static
Impact: make global variables and a global function static

The function '__trace_userstack' does not seem to have a caller, so it
is commented out.

Fix this sparse warnings:
  kernel/trace/trace.c:82:5: warning: symbol 'tracing_disabled' was not declared. Should it be static?
  kernel/trace/trace.c:600:10: warning: symbol 'trace_record_cmdline_disabled' was not declared. Should it be static?
  kernel/trace/trace.c:957:6: warning: symbol '__trace_userstack' was not declared. Should it be static?
  kernel/trace/trace.c:1694:5: warning: symbol 'tracing_release' was not declared. Should it be static?

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 10:15:41 +01:00
Ingo Molnar
4040068dce Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-02-11 10:03:53 +01:00
Wenji Huang
c3706f005c tracing: fix typos in comments
Impact: clean up.

Fix typos in the comments.

Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-10 12:32:35 -05:00
Wenji Huang
810dc73265 tracing: provide correct return value after outputting the event
This patch is to make the function return early on failure, and give
correct return value on success.

Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-10 12:32:33 -05:00
Wenji Huang
f54fc98aa6 tracing: remove unneeded variable
Impact: clean up.

Remove the unnecessary variable ret.

Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-10 12:32:18 -05:00
Tobias Klauser
4543ae7ce1 tracing: storage class should be before const qualifier
The C99 specification states in section 6.11.5:

The placement of a storage-class specifier other than at the beginning
of the declaration specifiers in a declaration is an obsolescent
feature.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-10 11:58:45 -05:00
Lai Jiangshan
667d241258 ring_buffer: fix ring_buffer_read_page()
Impact: change API and init bpage when copy

ring_buffer_read_page()/rb_remove_entries() may be called for
a partially consumed page.

Add a parameter for rb_remove_entries() and make it update
cpu_buffer->entries correctly for partially consumed pages.

ring_buffer_read_page() now returns the offset to the next event.

Init the bpage's time_stamp when return value is 0.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-10 09:17:37 -05:00
Lai Jiangshan
b85fa01ed9 ring_buffer: fix typing mistake
Impact: Fix bug

I found several very very curious line.
It's so curious that it may be brought by typing mistake.

When (cpu_buffer->reader_page == cpu_buffer->commit_page):

1) We haven't copied it for bpage is changed:
   bpage = cpu_buffer->reader_page->page;
   memcpy(bpage->data, cpu_buffer->reader_page->page->data + read ... )
2) We need update cpu_buffer->reader_page->read, but
   "cpu_buffer->reader_page += read;" is not right.

[
  This bug was a typo. The commit->reader_page is a page pointer
  and not an index into the page. The line should have been
  commit->reader_page->read += read.  The other changes
  by Lai are nice clean ups to the code.  - SDR
]

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-10 09:17:19 -05:00
Ingo Molnar
ae216dd239 Merge branch 'tip/tracing/ftrace' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace 2009-02-10 13:58:28 +01:00
Ingo Molnar
f9915bfef3 Merge branches 'tracing/ftrace' and 'tracing/urgent' into tracing/core 2009-02-10 13:25:42 +01:00
Hugh Dickins
acd895795d profiling: fix broken profiling regression
Impact: fix broken /proc/profile on UP machines

Commit c309b917ca "cpumask: convert
kernel/profile.c" broke profiling.  prof_cpu_mask was previously
initialized to CPU_MASK_ALL, but left uninitialized in that commit.
We need to copy cpu_possible_mask (cpu_online_mask is not enough).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-10 00:50:37 +01:00
Tejun Heo
5d707e9c8e stackprotector: update make rules
Impact: no default -fno-stack-protector if stackp is enabled, cleanup

Stackprotector make rules had the following problems.

* cc support test and warning are scattered across makefile and
  kernel/panic.c.

* -fno-stack-protector was always added regardless of configuration.

Update such that cc support test and warning are contained in makefile
and -fno-stack-protector is added iff stackp is turned off.  While at
it, prepare for 32bit support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-10 00:41:54 +01:00
Tejun Heo
6cd61c0baa elf: add ELF_CORE_COPY_KERNEL_REGS()
ELF core dump is used for both user land core dump and kernel crash
dump.  Depending on architecture, register might need to be accessed
differently for userland and kernel.  Allow architectures to define
ELF_CORE_COPY_KERNEL_REGS() and use different operation for kernel
register dump.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-10 00:41:26 +01:00
Steven Rostedt
34cd4998d3 tracing: clean up splice code
Ingo Molnar suggested a series of clean ups for the splice code.
This patch implements those suggestions.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-09 12:24:58 -05:00
Eduard - Gabriel Munteanu
ff98781bab tracing: Move pipe waiting code out of tracing_read_pipe().
This moves the pipe waiting code from tracing_read_pipe() into
tracing_wait_pipe(), which is useful to implement other fops, like
splice_read.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-09 12:24:51 -05:00
Eduard - Gabriel Munteanu
3c56819b14 tracing: splice support for tracing_pipe
Added and implemented tracing_pipe_fops->splice_read(). This allows
userspace programs to get tracing data more efficiently.

Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-09 12:24:34 -05:00
Ingo Molnar
249d51b53a Merge commit 'v2.6.29-rc4' into core/percpu
Conflicts:
	arch/x86/mach-voyager/voyager_smp.c
	arch/x86/mm/fault.c
2009-02-09 14:58:11 +01:00
Frederic Weisbecker
b91facc367 tracing/function-graph-tracer: handle the leaf functions from trace_pipe
When one cats the trace file, the leaf functions are printed without brackets:

 function();

whereas in the trace_pipe file we'll see the following:

 function() {
 }

This is because the ring_buffer handling is not the same between those two files.
On the trace file, when an entry is printed, the iterator advanced and then we can
check the next entry.

There is no iterator with trace_pipe, the current entry to print has been peeked
and not consumed. So checking the next entry will still return the current one while
we don't consume it.

This patch introduces a new value for the output callbacks to ask the tracing
core to not consume the current entry after printing it.

We need it because we will have to consume the current entry ourself to check
the next one.

Now the trace_pipe is able to handle well the leaf functions.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 12:37:27 +01:00
Ingo Molnar
1dfba05d0f tracing/blktrace: move the tracing file to kernel/trace, fix
Impact: build fix

The BLK_DEV_IO_TRACE entry used to be in block/Kconfig - which
file itself was dependent on CONFIG_BLOCK. But now the entry is
in kernel/trace/Kconfig - which is present even on !CONFIG_BLOCK.

So add a 'depends on BLOCK' to BLK_DEV_IO_TRACE.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 12:07:28 +01:00
Mandeep Singh Baines
17406b82d6 softlockup: remove timestamp checking from hung_task
Impact: saves sizeof(long) bytes per task_struct

By guaranteeing that sysctl_hung_task_timeout_secs have elapsed between
tasklist scans we can avoid using timestamps.

Signed-off-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 11:03:49 +01:00
Arnaldo Carvalho de Melo
b5db03c435 tracing: handle unregistering the current tracer
Impact: simplification

Instead of requiring that plugins have the sequence:

  my_tracer_stop(my_trace_array);
  unregister_tracer(my_tracer);

it should be possible just do a:

  unregister_tracer(my_tracer);

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 10:56:53 +01:00
Frederic Weisbecker
3861a17bcc tracing/function-graph-tracer: drop the kernel_text_address check
When the function graph tracer picks a return address, it ensures this address
is really a kernel text one by calling __kernel_text_address()

Actually this path has never been taken.Its role was more likely to debug the tracer
on the beginning of its development but this function is wasteful since it is called
for every traced function.

The fault check is already sufficient.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 10:51:38 +01:00
Frederic Weisbecker
1292211058 tracing/power: move the power trace headers to a dedicated file
Impact: cleanup

Move the power tracer headers to trace/power.h to keep ftrace.h and power bits
more easy to maintain as separated topics.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 10:51:38 +01:00
Frederic Weisbecker
7447dce96f tracing/function-graph-tracer: provide a selftest for the function graph tracer
Making it more easy to do a basic regression test for this tracer.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 10:51:37 +01:00
Frederic Weisbecker
2db270a80b tracing/blktrace: move the tracing file to kernel/trace
Impact: cleanup

Move blktrace.c to kernel/trace, also move its config entry.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 10:51:02 +01:00
Ingo Molnar
44b0635481 Merge branch 'tip/tracing/core/devel' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/ftrace
Conflicts:
	kernel/trace/trace_hw_branches.c
2009-02-09 10:35:12 +01:00
Ingo Molnar
4ad476e11f Merge commit 'v2.6.29-rc4' into tracing/core 2009-02-09 10:32:48 +01:00
Hannes Eder
548c893380 kernel/irq: fix sparse warning: make symbol static
While being at it make every occurrence of 'do_irq_select_affinity'
have the same signature in terms of signedness of the first argument.

Fix this sparse warning:
  kernel/irq/manage.c:112:5: warning: symbol 'do_irq_select_affinity' was not declared. Should it be static?

Also rename do_irq_select_affinity() to setup_affinity() - shorter name
and clearer naming.

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Acked-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 10:14:05 +01:00
Yinghai Lu
005bf0e6fa irq: optimize init_kstat_irqs/init_copy_kstat_irqs
Simplify and make init_kstat_irqs etc more type proof, suggested by
Andrew.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 09:02:34 +01:00
Yinghai Lu
0f3c2a89c1 irq: clear kstat_irqs
Impact: get correct kstat_irqs [/proc/interrupts] for msi/msi-x etc

need to call clear_kstat_irqs(), so when we reuse that irq_desc,
we get correct kstat in /proc/interrupts.

This makes /proc/interrupts not have <NULL> entries.

Don't need to worry about arch that doesn't support genirq, because they
will not call dynamic_irq_cleanup().

v2: simplify and make clear_kstat_irqs more robust

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-09 08:55:08 +01:00
Ingo Molnar
5ba1ae92b6 Merge branches 'timers/clockevents', 'timers/hpet', 'timers/hrtimers' and 'timers/urgent' into timers/core 2009-02-08 20:14:11 +01:00
Ingo Molnar
140573d33b Merge branches 'sched/rt' and 'sched/urgent' into sched/core 2009-02-08 20:12:46 +01:00
Stefan Richter
f7de7621f0 async: use list_move_tail
list.h provides a dedicated primitive for
"list_del followed by list_add_tail"... list_move_tail.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
2009-02-08 10:00:26 -08:00
Cornelia Huck
766ccb9ed4 async: Rename _special -> _domain for clarity.
Rename the async_*_special() functions to async_*_domain(), which
describes the purpose of these functions much better.
[Broke up long lines to silence checkpatch]

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-08 09:56:11 -08:00
Cornelia Huck
f30d5b307c async: Add some documentation.
Add some kerneldoc to the async interface.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-08 09:56:11 -08:00
Cornelia Huck
86532d8b16 async: Handle kthread_run() return codes.
If we fail to create the manager thread, fall back to non-fastboot.
If we fail to create an async thread, try again after waiting for
a bit.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-08 09:56:10 -08:00
Cornelia Huck
7a89bbc749 async: Fix running list handling.
async_schedule() should pass in async_running as the running
list, and run_one_entry() should put the entry to be run on
the provided running list instead of always on the generic one.

Reported-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
2009-02-08 09:56:10 -08:00
Wenji Huang
57794a9d48 trace: trivial fixes in comment typos.
Impact: clean up

Fixed several typos in the comments.

Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-07 20:03:36 -05:00
Steven Rostedt
a81bd80a0b ring-buffer: use generic version of in_nmi
Impact: clean up

Now that a generic in_nmi is available, this patch removes the
special code in the ring_buffer and implements the in_nmi generic
version instead.

With this change, I was also able to rename the "arch_ftrace_nmi_enter"
back to "ftrace_nmi_enter" and remove the code from the ring buffer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-07 20:03:33 -05:00
Steven Rostedt
78d904b46a ring-buffer: add NMI protection for spinlocks
Impact: prevent deadlock in NMI

The ring buffers are not yet totally lockless with writing to
the buffer. When a writer crosses a page, it grabs a per cpu spinlock
to protect against a reader. The spinlocks taken by a writer are not
to protect against other writers, since a writer can only write to
its own per cpu buffer. The spinlocks protect against readers that
can touch any cpu buffer. The writers are made to be reentrant
with the spinlocks disabling interrupts.

The problem arises when an NMI writes to the buffer, and that write
crosses a page boundary. If it grabs a spinlock, it can be racing
with another writer (since disabling interrupts does not protect
against NMIs) or with a reader on the same CPU. Luckily, most of the
users are not reentrant and protects against this issue. But if a
user of the ring buffer becomes reentrant (which is what the ring
buffers do allow), if the NMI also writes to the ring buffer then
we risk the chance of a deadlock.

This patch moves the ftrace_nmi_enter called by nmi_enter() to the
ring buffer code. It replaces the current ftrace_nmi_enter that is
used by arch specific code to arch_ftrace_nmi_enter and updates
the Kconfig to handle it.

When an NMI is called, it will set a per cpu variable in the ring buffer
code and will clear it when the NMI exits. If a write to the ring buffer
crosses page boundaries inside an NMI, a trylock is used on the spin
lock instead. If the spinlock fails to be acquired, then the entry
is discarded.

This bug appeared in the ftrace work in the RT tree, where event tracing
is reentrant. This workaround solved the deadlocks that appeared there.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-07 20:00:17 -05:00
Steven Rostedt
1830b52d0d trace: remove deprecated entry->cpu
Impact: fix to prevent developers from using entry->cpu

With the new ring buffer infrastructure, the cpu for the entry is
implicit with which CPU buffer it is on.

The original code use to record the current cpu into the generic
entry header, which can be retrieved by entry->cpu. When the
ring buffer was introduced, the users were convert to use the
the cpu number of which cpu ring buffer was in use (this was passed
to the tracers by the iterator: iter->cpu).

Unfortunately, the cpu item in the entry structure was never removed.
This allowed for developers to use it instead of the proper iter->cpu,
unknowingly, using an uninitialized variable. This was not the fault
of the developers, since it would seem like the logical place to
retrieve the cpu identifier.

This patch removes the cpu item from the entry structure and fixes
all the users that should have been using iter->cpu.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
2009-02-07 19:38:43 -05:00
Ingo Molnar
673f820591 Merge branch 'linus' into core/locking
Conflicts:
	fs/btrfs/locking.c
2009-02-07 18:31:54 +01:00
Len Brown
2d29c6a075 Merge branches 'release', 'asus', 'bugzilla-12450', 'cpuidle', 'debug', 'ec', 'misc', 'printk' and 'processor' into release 2009-02-07 01:34:56 -05:00
Li Zefan
04ec93fe9b fork.c: fix NULL pointer dereference when nr_threads == threads-max
I happened to forked lots of processes, and hit NULL pointer dereference.
It is because in copy_process() after checking max_threads, 0 is returned
but not -EAGAIN.

The bug is introduced by "CRED: Detach the credentials from task_struct"
(commit f1752eec61).

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-06 08:43:11 -08:00
Arnaldo Carvalho de Melo
b6f11df26f trace: Call tracing_reset_online_cpus before tracer->init()
Impact: cleanup

To make it easy for ftrace plugin writers, as this was open coded in
the existing plugins

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-06 01:01:41 +01:00
Arnaldo Carvalho de Melo
51a763dd84 tracing: Introduce trace_buffer_{lock_reserve,unlock_commit}
Impact: new API

These new functions do what previously was being open coded, reducing
the number of details ftrace plugin writers have to worry about.

It also standardizes the handling of stacktrace, userstacktrace and
other trace options we may introduce in the future.

With this patch, for instance, the blk tracer (and some others already
in the tree) can use the "userstacktrace" /d/tracing/trace_options
facility.

$ codiff /tmp/vmlinux.before /tmp/vmlinux.after
linux-2.6-tip/kernel/trace/trace.c:
  trace_vprintk              |   -5
  trace_graph_return         |  -22
  trace_graph_entry          |  -26
  trace_function             |  -45
  __ftrace_trace_stack       |  -27
  ftrace_trace_userstack     |  -29
  tracing_sched_switch_trace |  -66
  tracing_stop               |   +1
  trace_seq_to_user          |   -1
  ftrace_trace_special       |  -63
  ftrace_special             |   +1
  tracing_sched_wakeup_trace |  -70
  tracing_reset_online_cpus  |   -1
 13 functions changed, 2 bytes added, 355 bytes removed, diff: -353

linux-2.6-tip/block/blktrace.c:
  __blk_add_trace |  -58
 1 function changed, 58 bytes removed, diff: -58

linux-2.6-tip/kernel/trace/trace.c:
  trace_buffer_lock_reserve  |  +88
  trace_buffer_unlock_commit |  +86
 2 functions changed, 174 bytes added, diff: +174

/tmp/vmlinux.after:
 16 functions changed, 176 bytes added, 413 bytes removed, diff: -237

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-06 01:01:41 +01:00
Arnaldo Carvalho de Melo
0a9877514c ring_buffer: remove unused flags parameter
Impact: API change, cleanup

>From ring_buffer_{lock_reserve,unlock_commit}.

$ codiff /tmp/vmlinux.before /tmp/vmlinux.after
linux-2.6-tip/kernel/trace/trace.c:
  trace_vprintk              |  -14
  trace_graph_return         |  -14
  trace_graph_entry          |  -10
  trace_function             |   -8
  __ftrace_trace_stack       |   -8
  ftrace_trace_userstack     |   -8
  tracing_sched_switch_trace |   -8
  ftrace_trace_special       |  -12
  tracing_sched_wakeup_trace |   -8
 9 functions changed, 90 bytes removed, diff: -90

linux-2.6-tip/block/blktrace.c:
  __blk_add_trace |   -1
 1 function changed, 1 bytes removed, diff: -1

/tmp/vmlinux.after:
 10 functions changed, 91 bytes removed, diff: -91

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-06 01:01:40 +01:00
Ingo Molnar
9d45cf9e36 Merge branch 'x86/urgent' into x86/apic
Conflicts:
	arch/x86/mach-default/setup.c

Semantic merge:
	arch/x86/kernel/irqinit_32.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05 22:30:01 +01:00
Johannes Weiner
777c6c5f1f wait: prevent exclusive waiter starvation
With exclusive waiters, every process woken up through the wait queue must
ensure that the next waiter down the line is woken when it has finished.

Interruptible waiters don't do that when aborting due to a signal.  And if
an aborting waiter is concurrently woken up through the waitqueue, noone
will ever wake up the next waiter.

This has been observed with __wait_on_bit_lock() used by
lock_page_killable(): the first contender on the queue was aborting when
the actual lock holder woke it up concurrently.  The aborted contender
didn't acquire the lock and therefor never did an unlock followed by
waking up the next waiter.

Add abort_exclusive_wait() which removes the process' wait descriptor from
the waitqueue, iff still queued, or wakes up the next waiter otherwise.
It does so under the waitqueue lock.  Racing with a wake up means the
aborting process is either already woken (removed from the queue) and will
wake up the next waiter, or it will remove itself from the queue and the
concurrent wake up will apply to the next waiter after it.

Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
__wait_on_bit_lock() when they were interrupted by other means than a wake
up through the queue.

[akpm@linux-foundation.org: coding-style fixes]
Reported-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Mentored-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chuck Lever <cel@citi.umich.edu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>		["after some testing"]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 12:56:48 -08:00
Andrew Morton
60fd760fb9 revert "rlimit: permit setting RLIMIT_NOFILE to RLIM_INFINITY"
Revert commit 0c2d64fb6c because it causes
(arguably poorly designed) existing userspace to spend interminable
periods closing billions of not-open file descriptors.

We could bring this back, with some sort of opt-in tunable in /proc, which
defaults to "off".

Peter's alanysis follows:

: I spent several hours trying to get to the bottom of a serious
: performance issue that appeared on one of our servers after upgrading to
: 2.6.28.  In the end it's what could be considered a userspace bug that
: was triggered by a change in 2.6.28.  Since this might also affect other
: people I figured I'd at least document what I found here, and maybe we
: can even do something about it:
:
:
: So, I upgraded some of debian.org's machines to 2.6.28.1 and immediately
: the team maintaining our ftp archive complained that one of their
: scripts that previously ran in a few minutes still hadn't even come
: close to being done after an hour or so.  Downgrading to 2.6.27 fixed
: that.
:
: Turns out that script is forking a lot and something in it or python or
: whereever closes all the file descriptors it doesn't want to pass on.
: That is, it starts at zero and goes up to ulimit -n/RLIMIT_NOFILE and
: closes them all with a few exceptions.
:
: Turns out that takes a long time when your limit -n is now 2^20 (1048576).
:
: With 2.6.27.* the ulimit -n was the standard 1024, but with 2.6.28 it is
: now a thousand times that.
:
: 2.6.28 included a patch titled "rlimit: permit setting RLIMIT_NOFILE to
: RLIM_INFINITY" (0c2d64fb6c)[1] that
: allows, as the title implies, to set the limit for number of files to
: infinity.
:
: Closer investigation showed that the broken default ulimit did not apply
: to "system" processes (like stuff started from init).  In the end I
: could establish that all processes that passed through pam_limit at one
: point had the bad resource limit.
:
: Apparently the pam library in Debian etch (4.0) initializes the limits
: to some default values when it doesn't have any settings in limit.conf
: to override them.  Turns out that for nofiles this is RLIM_INFINITY.
: Commenting out "case RLIMIT_NOFILE" in pam_limit.c:267 of our pam
: package version 0.79-5 fixes that - tho I'm not sure what side effects
: that has.
:
: Debian lenny (the upcoming 5.0 version) doesn't have this issue as it
: uses a different pam (version).

Reported-by: Peter Palfrader <weasel@debian.org>
Cc: Adam Tkac <vonsch@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <stable@kernel.org>		[2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 12:56:47 -08:00
Andrew Morton
58763a2974 kernel/async.c: fix printk warnings
alpha:

kernel/async.c: In function 'run_one_entry':
kernel/async.c:141: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t'
kernel/async.c:149: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t'
kernel/async.c:149: warning: format '%lld' expects type 'long long int', but argument 4 has type 's64'
kernel/async.c: In function 'async_synchronize_cookie_special':
kernel/async.c:250: warning: format '%lli' expects type 'long long int', but argument 3 has type 's64'

Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 12:56:46 -08:00
Mandeep Singh Baines
94be52dc07 softlockup: convert read_lock in hung_task to rcu_read_lock
Since the tasklist is protected by rcu list operations, it is safe
to convert the read_lock()s to rcu_read_lock().

Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05 19:55:31 +01:00
Mandeep Singh Baines
ce9dbe244b softlockup: check all tasks in hung_task
Impact: extend the scope of hung-task checks

Changed the default value of hung_task_check_count to PID_MAX_LIMIT.
hung_task_batch_count added to put an upper bound on the critical
section. Every hung_task_batch_count checks, the rcu lock is never
held for a too long time.

Keeping the critical section small minimizes time preemption is disabled
and keeps rcu grace periods small.

To prevent following a stale pointer, get_task_struct is called on g and t.
To verify that g and t have not been unhashed while outside the critical
section, the task states are checked.

The design was proposed by Frédéric Weisbecker.

Signed-off-by: Mandeep Singh Baines <msb@google.com>
Suggested-by: Frédéric Weisbecker <fweisbec@gmail.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05 19:54:57 +01:00
Steven Rostedt
dac7494028 trace: code style clean up
Ingo Molnar suggested using goto logic to keep the indentation
down and to be able to remove the nasty line breaks. This actually
makes the code a bit more readable.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05 14:35:47 +01:00
Arnaldo Carvalho de Melo
7be421510b trace: Remove unused trace_array_cpu parameter
Impact: cleanup

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05 14:35:47 +01:00
Arnaldo Carvalho de Melo
97e5b191ae trace_branch: Remove unused function
Impact: cleanup

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05 14:35:46 +01:00
Arnaldo Carvalho de Melo
268ccda0cb trace: assign defaults at register_ftrace_event
Impact: simplification of tracers

As all tracers are doing this we might as well do it in
register_ftrace_event and save one branch each time we call these
callbacks.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05 14:35:46 +01:00
Peter Zijlstra
4cd4c1b40d timers: split process wide cpu clocks/timers
Change the process wide cpu timers/clocks so that we:

 1) don't mess up the kernel with too many threads,
 2) don't have a per-cpu allocation for each process,
 3) have no impact when not used.

In order to accomplish this we're going to split it into two parts:

 - clocks; which can take all the time they want since they run
           from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID)

 - timers; which need constant time sampling but since they're
           explicity used, the user can pay the overhead.

The clock readout will go back to a full sum of the thread group, while the
timers will run of a global 'clock' that only runs when needed, so only
programs that make use of the facility pay the price.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05 13:04:33 +01:00
Peter Zijlstra
32bd671d6c signal: re-add dead task accumulation stats.
We're going to split the process wide cpu accounting into two parts:

 - clocks; which can take all the time they want since they run
           from user context.

 - timers; which need constant time tracing but can affort the overhead
           because they're default off -- and rare.

The clock readout will go back to a full sum of the thread group, for this
we need to re-add the exit stats that were removed in the initial itimer
rework (f06febc9: timers: fix itimer/many thread hang).

Furthermore, since that full sum can be rather slow for large thread groups
and we have the complete dead task stats, revert the do_notify_parent time
computation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-05 13:04:33 +01:00
Linus Torvalds
647802d6db Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: APIC: enable workaround on AMD Fam10h CPUs
  xen: disable interrupts before saving in percpu
  x86: add x86@kernel.org to MAINTAINERS
  x86: push old stack address on irqstack for unwinder
  irq, x86: fix lock status with numa_migrate_irq_desc
  x86: add cache descriptors for Intel Core i7
  x86/Voyager: make it build and boot
2009-02-04 13:58:50 -08:00
Suresh Siddha
483b4ee60e sched: fix nohz load balancer on cpu offline
Christian Borntraeger reports:

> After a logical cpu offline, even on a complete idle system, there
> is one cpu with full ticks. It turns out that nohz.cpu_mask has the
> the offlined cpu still set.
>
> In select_nohz_load_balancer() we check if the system is completely
> idle to turn of load balancing. We compare cpu_online_map with
> nohz.cpu_mask.  Since cpu_online_map is updated on cpu unplug,
> but nohz.cpu_mask is not, the check fails and the scheduler believes
> that we need an "idle load balancer" even on a fully idle system.
> Since the ilb cpu does not deactivate the timer tick this breaks NOHZ.

Fix the select_nohz_load_balancer() to not set the nohz.cpu_mask
while a cpu is going offline.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-04 22:31:19 +01:00
Arnaldo Carvalho de Melo
ae7462b4f1 trace: make the trace_event callbacks return enum print_line_t
As they actually all return these enumerators.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-04 20:48:39 +01:00
Arnaldo Carvalho de Melo
d9793bd801 trace: judicious error checking of trace_seq results
Impact: bugfix and cleanup

Some callsites were returning either TRACE_ITER_PARTIAL_LINE if the
trace_seq routines (trace_seq_printf, etc) returned 0 meaning its buffer
was full, or zero otherwise.

But...

/* Return values for print_line callback */
enum print_line_t {
        TRACE_TYPE_PARTIAL_LINE = 0,    /* Retry after flushing the seq */
        TRACE_TYPE_HANDLED      = 1,
        TRACE_TYPE_UNHANDLED    = 2     /* Relay to other output functions */
};

In other cases the return value was not being relayed at all.

Most of the time it didn't hurt because the page wasn't get filled, but
for correctness sake, handle the return values everywhere.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-04 20:48:30 +01:00
Ingo Molnar
ce70a0b472 Merge branches 'tracing/blktrace', 'tracing/ftrace', 'tracing/urgent' and 'linus' into tracing/core 2009-02-04 20:45:41 +01:00
Ingo Molnar
bb960a1e42 Merge branch 'core/xen' into x86/urgent 2009-02-04 14:54:56 +01:00
Oleg Nesterov
229c4ef8ae ftrace: do_each_pid_task() needs rcu lock
"ftrace: use struct pid" commit 978f3a45d9
converted ftrace_pid_trace to "struct pid*".

But we can't use do_each_pid_task() without rcu_read_lock() even if
we know the pid itself can't go away (it was pinned in ftrace_pid_write).
The exiting task can detach itself from this pid at any moment.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-03 22:50:58 +01:00
Arnaldo Carvalho de Melo
2c9b238eb3 trace: Change struct trace_event callbacks parameter list
Impact: API change

The trace_seq and trace_entry are in trace_iterator, where there are
more fields that may be needed by tracers, so just pass the
tracer_iterator as is already the case for struct tracer->print_line.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-03 14:03:52 +01:00
Frederic Weisbecker
c4a8e8be2d trace: better manage the context info for events
Impact: make trace_event more convenient for tracers

All tracers (for the moment) that use the struct trace_event want to
have the context info printed before their own output: the pid/cmdline,
cpu, and timestamp.

But some other tracers that want to implement their trace_event
callbacks will not necessary need these information or they may want to
format them as they want.

This patch adds a new default-enabled trace option:
TRACE_ITER_CONTEXT_INFO When disabled through:

echo nocontext-info > /debugfs/tracing/trace_options

The pid, cpu and timestamps headers will not be printed.

IE with the sched_switch tracer with context-info (default):

     bash-2935 [001] 100.356561: 2935:120:S ==> [001]  0:140:R <idle>
   <idle>-0    [000] 100.412804:    0:140:R   + [000] 11:115:S events/0
   <idle>-0    [000] 100.412816:    0:140:R ==> [000] 11:115:R events/0
 events/0-11   [000] 100.412829:   11:115:S ==> [000]  0:140:R <idle>

Without context-info:

 2935:120:S ==> [001]  0:140:R <idle>
    0:140:R   + [000] 11:115:S events/0
    0:140:R ==> [000] 11:115:R events/0
   11:115:S ==> [000]  0:140:R <idle>

A tracer can disable it at runtime by clearing the bit
TRACE_ITER_CONTEXT_INFO in trace_flags.

The print routines were renamed to trace_print_context and
trace_print_lat_context, so that they can be used by tracers if they
want to use them for one of the trace_event callbacks.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-03 14:03:52 +01:00
David S. Miller
b3ff29d2cc Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/Kconfig
2009-02-03 00:15:35 -08:00
Steven Rostedt
79fb0768fb trace: let boot trace be chosen by command line
Now that we have a working ftrace=<tracer> function, make the boot
tracer get activated by it. This way we can turn it on or off without
recompiling the kernel, as well as keeping the selftests on. The
selftests are disabled whenever a default tracer starts running.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-03 06:26:12 +01:00
Steven Rostedt
b2821ae68b trace: fix default boot up tracer
Peter Zijlstra started the functionality to start up a default
tracing at bootup. This patch finishes the work.

Now if you add 'ftrace=<tracer>' to the command line, when that tracer
is registered on bootup, that tracer is selected and starts tracing.

Note, all selftests for tracers that are registered after this tracer
is disabled. This prevents the selftests from disturbing the running
tracer, or the running tracer from disturbing the selftest.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-03 06:26:12 +01:00
Ingo Molnar
dc573f9b20 Merge branches 'tracing/ftrace', 'tracing/kmemtrace' and 'linus' into tracing/core 2009-02-03 06:25:38 +01:00
Linus Torvalds
31c952dcf8 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched_rt: don't use first_cpu on cpumask created with cpumask_and
  sched: fix buddie group latency
  sched: clear buddies more aggressively
  sched: symmetric sync vs avg_overlap
  sched: fix sync wakeups
  cpuset: fix possible deadlock in async_rebuild_sched_domains
2009-02-02 19:26:29 -08:00
Eric Dumazet
720eba31f4 modules: Use a better scheme for refcounting
Current refcounting for modules (done if CONFIG_MODULE_UNLOAD=y) is
using a lot of memory.

Each 'struct module' contains an [NR_CPUS] array of full cache lines.

This patch uses existing infrastructure (percpu_modalloc() &
percpu_modfree()) to allocate percpu space for the refcount storage.

Instead of wasting NR_CPUS*128 bytes (on i386), we now use
nr_cpu_ids*sizeof(local_t) bytes.

On a typical distro, where NR_CPUS=8, shiping 2000 modules, we reduce
size of module files by about 2 Mbytes. (1Kb per module)

Instead of having all refcounters in the same memory node - with TLB misses
because of vmalloc() - this new implementation permits to have better
NUMA properties, since each  CPU will use storage on its preferred node,
thanks to percpu storage.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-02 19:17:55 -08:00
Yinghai Lu
10b888d6ce irq, x86: fix lock status with numa_migrate_irq_desc
Eric Paris reported:

> I have an hp dl785g5 which is unable to successfully run
> 2.6.29-0.66.rc3.fc11.x86_64 or 2.6.29-rc2-next-20090126.  During bootup
> (early in userspace daemons starting) I get the below BUG, which quickly
> renders the machine dead.  I assume it is because sparse_irq_lock never
> gets released when the BUG kills that task.

Adjust lock sequence when migrating a descriptor with
CONFIG_NUMA_MIGRATE_IRQ_DESC enabled.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01 11:36:31 +01:00
Rusty Russell
3d398703ef sched_rt: don't use first_cpu on cpumask created with cpumask_and
cpumask_and() only initializes nr_cpu_ids bits, so the (deprecated)
first_cpu() might find one of those uninitialized bits if nr_cpu_ids
is less than NR_CPUS (as it can be for CONFIG_CPUMASK_OFFSTACK).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01 10:49:52 +01:00
Peter Zijlstra
a571bbeafb sched: fix buddie group latency
Similar to the previous patch, by not clearing buddies we can select entities
past their run quota, which can increase latency. This means we have to clear
group buddies as well.

Do not use the group clear for pick_next_task(), otherwise that'll get O(n^2).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01 10:49:51 +01:00
Mike Galbraith
a9f3e2b549 sched: clear buddies more aggressively
It was noticed that a task could get re-elected past its run quota due to buddy
affinities. This could increase latency a little. Cure it by more aggresively
clearing buddy state.

We do so in two situations:
 - when we force preempt
 - when we select a buddy to run

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01 10:49:50 +01:00
Peter Zijlstra
1596e29773 sched: symmetric sync vs avg_overlap
Reinstate the weakening of the sync hint if set. This yields a more
symmetric usage of avg_overlap.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01 10:49:49 +01:00
Peter Zijlstra
d942fb6c7d sched: fix sync wakeups
Pawel Dziekonski reported that the openssl benchmark and his
quantum chemistry application both show slowdowns due to the
scheduler under-parallelizing execution.

The reason are pipe wakeups still doing 'sync' wakeups which
overrides the normal buddy wakeup logic - even if waker and
wakee are loosely coupled.

Fix an inversion of logic in the buddy wakeup code.

Reported-by: Pawel Dziekonski <dzieko@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-01 10:49:06 +01:00
Stephen Hemminger
eefef1cf76 net: add ARP notify option for devices
This adds another inet device option to enable gratuitous ARP
when device is brought up or address change. This is handy for
clusters or virtualization.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-02-01 01:04:33 -08:00
Linus Torvalds
1347e965f5 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  generic-ipi: use per cpu data for single cpu ipi calls
  cpumask: convert lib/smp_processor_id to new cpumask ops
  signals, debug: fix BUG: using smp_processor_id() in preemptible code in print_fatal_signal()
2009-01-31 15:55:05 -08:00
Linus Torvalds
ac56b94f80 Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  irq: export __set_irq_handler() and handle_level_irq()
2009-01-31 15:54:30 -08:00
Linus Torvalds
5b2d3e6d54 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  hrtimer: prevent negative expiry value after clock_was_set()
  hrtimers: allow the hot-unplugging of all cpus
  hrtimers: increase clock min delta threshold while interrupt hanging
2009-01-31 15:54:06 -08:00
Linus Torvalds
f6490438fc Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, ds, bts: cleanup/fix DS configuration
  ring-buffer: reset timestamps when ring buffer is reset
  trace: set max latency variable to zero on default
  trace: stop all recording to ring buffer on ftrace_dump
  trace: print ftrace_dump at KERN_EMERG log level
  ring_buffer: reset write when reserve buffer fail
  tracing/function-graph-tracer: fix a regression while suspend to disk
  ring-buffer: fix alignment problem
2009-01-31 15:53:30 -08:00
Thomas Gleixner
b0a9b5111a hrtimer: prevent negative expiry value after clock_was_set()
Impact: prevent false positive WARN_ON() in clockevents_program_event()

clock_was_set() changes the base->offset of CLOCK_REALTIME and
enforces the reprogramming of the clockevent device to expire timers
which are based on CLOCK_REALTIME. If the clock change is large enough
then the subtraction of the timer expiry value and base->offset can
become negative which triggers the warning in
clockevents_program_event().

Check the subtraction result and set a negative value to 0.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-01-30 22:35:34 +01:00
Sebastien Dugue
94df7de028 hrtimers: allow the hot-unplugging of all cpus
Impact: fix CPU hotplug hang on Power6 testbox

On architectures that support offlining all cpus (at least powerpc/pseries),
hot-unpluging the tick_do_timer_cpu can result in a system hang.

This comes from the fact that if the cpu going down happens to be the
cpu doing the tick, then as the tick_do_timer_cpu handover happens after the
cpu is dead (via the CPU_DEAD notification), we're left without ticks,
jiffies are frozen and any task relying on timers (msleep, ...) is stuck.
That's particularly the case for the cpu looping in __cpu_die() waiting
for the dying cpu to be dead.

This patch addresses this by having the tick_do_timer_cpu handover happen
earlier during the CPU_DYING notification. For this, a new clockevent
notification type is introduced (CLOCK_EVT_NOTIFY_CPU_DYING) which is triggered
in hrtimer_cpu_notify().

Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-30 22:35:29 +01:00
Frederic Weisbecker
7f22391cbe hrtimers: increase clock min delta threshold while interrupt hanging
Impact: avoid timer IRQ hanging slow systems

While using the function graph tracer on a virtualized system, the
hrtimer_interrupt can hang the system on an infinite loop.

This can be caused in several situations:

 - the hardware is very slow and HZ is set too high

 - something intrusive is slowing the system down (tracing under emulation)

... and the next clock events to program are always before the current time.

This patch implements a reasonable compromise: if such a situation is
detected, we share the CPUs time in 1/4 to process the hrtimer interrupts.
This is enough to let the system running without serious starvation.

It has been successfully tested under VirtualBox with 1000 HZ and 100 HZ
with function graph tracer launched. On both cases, the clock events were
increased until about 25 ms periodic ticks, which means 40 HZ.

So we change a hard to debug hang into a warning message and a system that
still manages to limp along.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-30 22:35:10 +01:00
Steven Rostedt
d7240b9880 generic-ipi: use per cpu data for single cpu ipi calls
The smp_call_function can be passed a wait parameter telling it to
wait for all the functions running on other CPUs to complete before
returning, or to return without waiting. Unfortunately, this is
currently just a suggestion and not manditory. That is, the
smp_call_function can decide not to return and wait instead.

The reason for this is because it uses kmalloc to allocate storage
to send to the called CPU and that CPU will free it when it is done.
But if we fail to allocate the storage, the stack is used instead.
This means we must wait for the called CPU to finish before
continuing.

Unfortunatly, some callers do no abide by this hint and act as if
the non-wait option is mandatory. The MTRR code for instance will
deadlock if the smp_call_function is set to wait. This is because
the smp_call_function will wait for the other CPUs to finish their
called functions, but those functions are waiting on the caller to
continue.

This patch changes the generic smp_call_function code to use per cpu
variables if the allocation of the data fails for a single CPU call. The
smp_call_function_many will fall back to the smp_call_function_single
if it fails its alloc. The smp_call_function_single is modified
to not force the wait state.

Since we now are using a single data per cpu we must synchronize the
callers to prevent a second caller modifying the data before the
first called IPI functions complete. To do so, I added a flag to
the call_single_data called CSD_FLAG_LOCK. When the single CPU is
called (which can be called when a many call fails an alloc), we
set the LOCK bit on this per cpu data. When the caller finishes
it clears the LOCK bit.

The caller must wait till the LOCK bit is cleared before setting
it. When it is cleared, there is no IPI function using it.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-30 18:31:08 +01:00
Ingo Molnar
c43e0e46ad Merge branch 'linus' into core/percpu
Conflicts:
	kernel/irq/handle.c
2009-01-30 18:23:30 +01:00
Randy Dunlap
ecf441b593 kmemtrace: fix printk formats, fix
Geert Uytterhoeven wrote:

> %4zu?

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-30 16:12:33 +01:00
Paul Menage
839ec5452e cgroup: fix root_count when mount fails due to busy subsystem
root_count was being incremented in cgroup_get_sb() after all error
checking was complete, but decremented in cgroup_kill_sb(), which can be
called on a superblock that we gave up on due to an error.  This patch
changes cgroup_kill_sb() to only decrement root_count if the root was
previously linked into the list of roots.

Signed-off-by: Paul Menage <menage@google.com>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 18:04:45 -08:00
Paul Menage
804b3c28a4 cgroups: add cpu_relax() calls in css_tryget() and cgroup_clear_css_refs()
css_tryget() and cgroup_clear_css_refs() contain polling loops; these
loops should have cpu_relax calls in them to reduce cross-cache traffic.

Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 18:04:45 -08:00
Li Zefan
1404f06565 cgroups: fix lock inconsistency in cgroup_clone()
I fixed a bug in cgroup_clone() in Linus' tree in commit 7b574b7
("cgroups: fix a race between cgroup_clone and umount") without noticing
there was a cleanup patch in -mm tree that should be rebased (now commit
104cbd5, "cgroups: use task_lock() for access tsk->cgroups safe in
cgroup_clone()"), thus resulted in lock inconsistency.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 18:04:45 -08:00
KAMEZAWA Hiroyuki
baef99a08a cgroups: use hierarchy mutex in creation failure path
Now, cgrp->sibling is handled under hierarchy mutex.
error route should do so, too.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 18:04:43 -08:00
Arnaldo Carvalho de Melo
b3a8c34886 trace_sched_wakeup: Remove unused variable
Impact: cleanup

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-29 14:31:03 +01:00
Arnaldo Carvalho de Melo
f04109bf1b trace: Use tracing_reset_online_cpus in more places
Impact: cleanup

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-29 14:28:31 +01:00
Ingo Molnar
6a385db5ce Merge branch 'core/percpu' into x86/core
Conflicts:
	kernel/irq/handle.c
2009-01-28 23:12:55 +01:00
David Daney
97179fd46d cpumask fallout: Initialize irq_default_affinity earlier
Move the initialization of irq_default_affinity to early_irq_init as
core_initcall is too late.

irq_default_affinity can be used in init_IRQ and potentially timer and
SMP init as well.  All of these happen before core_initcall.  Moving
the initialization to early_irq_init ensures that it is initialized
before it is used.

Signed-off-by: David Daney <ddaney@caviumnetworks.com>
Acked-by: Mike Travis <travis@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-27 16:06:55 -08:00
David Daney
1267a8df20 Make irq_*_affinity depend on CONFIG_GENERIC_HARDIRQS too.
In interrupt.h these functions are declared only if
CONFIG_GENERIC_HARDIRQS is set.  We should define them under identical
conditions.

Signed-off-by: David Daney <ddaney@caviumnetworks.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-27 16:06:49 -08:00
Linus Torvalds
490a8d70cd Merge branch 'hibern_fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'hibern_fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
  SATA PIIX: Blacklist system that spins off disks during ACPI power off
  SATA Sil: Blacklist system that spins off disks during ACPI power off
  SATA AHCI: Blacklist system that spins off disks during ACPI power off
  SATA: Blacklisting of systems that spin off disks during ACPI power off
  DMI: Introduce dmi_first_match to make the interface more flexible
  Hibernation: Introduce system_entering_hibernation
2009-01-27 07:50:41 -08:00
Ingo Molnar
4a66a82be7 Merge branches 'tracing/blktrace', 'tracing/kmemtrace' and 'tracing/urgent' into tracing/core 2009-01-27 14:30:57 +01:00
Ingo Molnar
3ddeb51d9c Merge branch 'linus' into core/percpu
Conflicts:
	arch/x86/kernel/setup_percpu.c
2009-01-27 12:01:51 +01:00
Rafael J. Wysocki
abfe2d7b91 Hibernation: Introduce system_entering_hibernation
Introduce boolean function system_entering_hibernation() returning
'true' during the last phase of hibernation, in which devices are
being put into low power states and the sleep state (for example,
ACPI S4) is finally entered.

Some device drivers need such a function to check if the system is
in the final phase of hibernation.  In particular, some SATA drivers
are going to use it for blacklisting systems in which the disks
should not be spun down during the last phase of hibernation (the
BIOS will do that anyway).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
2009-01-27 02:15:45 -05:00
Ed Swierk
3a9f84d354 signals, debug: fix BUG: using smp_processor_id() in preemptible code in print_fatal_signal()
With print-fatal-signals=1 on a kernel with CONFIG_PREEMPT=y, sending an
unexpected signal to a process causes a BUG: using smp_processor_id() in
preemptible code.

get_signal_to_deliver() releases the siglock before calling
print_fatal_signal(), which calls show_regs(), which calls
smp_processor_id(), which is not supposed to be called from a
preemptible thread.

Make sure show_regs() runs with preemption disabled.

Signed-off-by: Ed Swierk <eswierk@aristanetworks.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-27 00:36:19 +01:00
Linus Torvalds
2034563ca3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild-fixes:
  kbuild: fix kbuild.txt typos
  kbuild: print usage with no arguments in scripts/config
  Revert "kbuild: strip generated symbols from *.ko"
2009-01-26 15:10:37 -08:00
Linus Torvalds
37f5fed555 Merge branch 'sh/for-2.6.29' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6
* 'sh/for-2.6.29' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (22 commits)
  dma-coherent: Restore dma_alloc_from_coherent() large alloc fall back policy.
  dma-coherent: per-device coherent area is in pages, not bytes.
  sh: fix unaligned and nonexistent address handling
  nommu: Stub in vm_map_ram()/vm_unmap_ram()/vm_unmap_aliases().
  sh: fix sh-sci / early printk build on sh7723
  sh: export the sh7343 JPU to user space
  sh: update defconfigs.
  serial: sh-sci: Fix up SH7720/SH7721 SCI build.
  sh: Kill off obsolete busses from arch/sh/Kconfig.
  sh: sh7785lcr/highlander/hp6xx need linux/irq.h.
  sh: Migo-R MMC support using spi_gpio and mmc_spi.
  sh: ap325rxa MMC support using spi_gpio and mmc_spi
  sh: mach-x3proto: needs linux/irq.h.
  sh: Drop the BKL from sys_execve() on SH-5.
  sh: convert rsk7203 to use smsc911x.
  sh: convert magicpanelr2 platform to use smsc911x.
  sh: convert ap325rxa platform to use smsc911x.
  sh: mach-migor: Add tw9910 support.
  sh: mach-migor: Delete soc_camera_platform setup.
  sh: mach-migor: Add ov772x support.
  ...
2009-01-26 10:12:08 -08:00
Linus Torvalds
3386c05bdb Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  debugobjects: add and use INIT_WORK_ON_STACK
  rcu: remove duplicate CONFIG_RCU_CPU_STALL_DETECTOR
  relay: fix lock imbalance in relay_late_setup_files
  oprofile: fix uninitialized use of struct op_entry
  rcu: move Kconfig menu
  softlock: fix false panic which can occur if softlockup_thresh is reduced
  rcu: add __cpuinit to rcu_init_percpu_data()
2009-01-26 09:47:56 -08:00
Linus Torvalds
1e70c7f7a9 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  hrtimers: fix inconsistent lock state on resume in hres_timers_resume
  time-sched.c: tick_nohz_update_jiffies should be static
  locking, hpet: annotate false positive warning
  kernel/fork.c: unused variable 'ret'
  itimers: remove the per-cpu-ish-ness
2009-01-26 09:47:43 -08:00
Linus Torvalds
810ee58de2 Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits)
  xen: unitialised return value in xenbus_write_transaction
  x86: fix section mismatch warning
  x86: unmask CPUID levels on Intel CPUs, fix
  x86: work around PAGE_KERNEL_WC not getting WC in iomap_atomic_prot_pfn.
  x86: use standard PIT frequency
  xen: handle highmem pages correctly when shrinking a domain
  x86, mm: fix pte_free()
  xen: actually release memory when shrinking domain
  x86: unmask CPUID levels on Intel CPUs
  x86: add MSR_IA32_MISC_ENABLE bits to <asm/msr-index.h>
  x86: fix PTE corruption issue while mapping RAM using /dev/mem
  x86: mtrr fix debug boot parameter
  x86: fix page attribute corruption with cpa()
  Revert "x86: signal: change type of paramter for sys_rt_sigreturn()"
  x86: use early clobbers in usercopy*.c
  x86: remove kernel_physical_mapping_init() from init section
  fix: crash: IP: __bitmap_intersects+0x48/0x73
  cpufreq: use work_on_cpu in acpi-cpufreq.c for drv_read and drv_write
  work_on_cpu: Use our own workqueue.
  work_on_cpu: don't try to get_online_cpus() in work_on_cpu.
  ...
2009-01-26 09:47:28 -08:00
Arnaldo Carvalho de Melo
c71a896154 blktrace: add ftrace plugin
Impact: New way of using the blktrace infrastructure

This drops the requirement of userspace utilities to use the blktrace
facility.

Configuration is done thru sysfs, adding a "trace" directory to the
partition directory where blktrace can be enabled for the associated
request_queue.

The same filters present in the IOCTL interface are present as sysfs
device attributes.

The /sys/block/sdX/sdXN/trace/enable file allows tracing without any
filters.

The other files in this directory: pid, act_mask, start_lba and end_lba
can be used with the same meaning as with the IOCTL interface.

Using the sysfs interface will only setup the request_queue->blk_trace
fields, tracing will only take place when the "blk" tracer is selected
via the ftrace interface, as in the following example:

To see the trace, one can use the /d/tracing/trace file or the
/d/tracign/trace_pipe file, with semantics defined in the ftrace
documentation in Documentation/ftrace.txt.

[root@f10-1 ~]# cat /t/trace
       kjournald-305   [000]  3046.491224:   8,1    A WBS 6367 + 8 <- (8,1) 6304
       kjournald-305   [000]  3046.491227:   8,1    Q   R 6367 + 8 [kjournald]
       kjournald-305   [000]  3046.491236:   8,1    G  RB 6367 + 8 [kjournald]
       kjournald-305   [000]  3046.491239:   8,1    P  NS [kjournald]
       kjournald-305   [000]  3046.491242:   8,1    I RBS 6367 + 8 [kjournald]
       kjournald-305   [000]  3046.491251:   8,1    D  WB 6367 + 8 [kjournald]
       kjournald-305   [000]  3046.491610:   8,1    U  WS [kjournald] 1
          <idle>-0     [000]  3046.511914:   8,1    C  RS 6367 + 8 [6367]
[root@f10-1 ~]#

The default line context (prefix) format is the one described in the ftrace
documentation, with the blktrace specific bits using its existing format,
described in blkparse(8).

If one wants to have the classic blktrace formatting, this is possible by
using:

[root@f10-1 ~]# echo blk_classic > /t/trace_options
[root@f10-1 ~]# cat /t/trace
  8,1    0  3046.491224   305  A WBS 6367 + 8 <- (8,1) 6304
  8,1    0  3046.491227   305  Q   R 6367 + 8 [kjournald]
  8,1    0  3046.491236   305  G  RB 6367 + 8 [kjournald]
  8,1    0  3046.491239   305  P  NS [kjournald]
  8,1    0  3046.491242   305  I RBS 6367 + 8 [kjournald]
  8,1    0  3046.491251   305  D  WB 6367 + 8 [kjournald]
  8,1    0  3046.491610   305  U  WS [kjournald] 1
  8,1    0  3046.511914     0  C  RS 6367 + 8 [6367]
[root@f10-1 ~]#

Using the ftrace standard format allows more flexibility, such
as the ability of asking for backtraces via trace_options:

[root@f10-1 ~]# echo noblk_classic > /t/trace_options
[root@f10-1 ~]# echo stacktrace > /t/trace_options

[root@f10-1 ~]# cat /t/trace
       kjournald-305   [000]  3318.826779:   8,1    A WBS 6375 + 8 <- (8,1) 6312
       kjournald-305   [000]  3318.826782:
 <= submit_bio
 <= submit_bh
 <= sync_dirty_buffer
 <= journal_commit_transaction
 <= kjournald
 <= kthread
 <= child_rip
       kjournald-305   [000]  3318.826836:   8,1    Q   R 6375 + 8 [kjournald]
       kjournald-305   [000]  3318.826837:
 <= generic_make_request
 <= submit_bio
 <= submit_bh
 <= sync_dirty_buffer
 <= journal_commit_transaction
 <= kjournald
 <= kthread

Please read the ftrace documentation to use aditional, standardized
tracing filters such as /d/tracing/trace_cpumask, etc.

See also /d/tracing/trace_mark to add comments in the trace stream,
that is equivalent to the /d/block/sdaN/msg interface.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-26 14:40:53 +01:00
Arnaldo Carvalho de Melo
9011262a37 ftrace: add ftrace_vprintk
Impact: new helper function

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-26 14:40:53 +01:00
Randy Dunlap
cc2f6d90e9 kmemtrace: fix printk format warnings
Fix kmemtrace printk warnings:

  kernel/trace/kmemtrace.c:142: warning: format '%4ld' expects type 'long int', but argument 3 has type 'size_t'
  kernel/trace/kmemtrace.c:147: warning: format '%4ld' expects type 'long int', but argument 3 has type 'size_t'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-26 14:03:51 +01:00
Ingo Molnar
5ce1b1ed27 Merge branches 'tracing/ftrace' and 'tracing/function-graph-tracer' into tracing/core 2009-01-26 14:01:52 +01:00
Thomas Gleixner
6626bff245 hrtimer: prevent negative expiry value after clock_was_set()
Impact: prevent false positive WARN_ON() in clockevents_program_event()

clock_was_set() changes the base->offset of CLOCK_REALTIME and
enforces the reprogramming of the clockevent device to expire timers
which are based on CLOCK_REALTIME. If the clock change is large enough
then the subtraction of the timer expiry value and base->offset can
become negative which triggers the warning in
clockevents_program_event(). 

Check the subtraction result and set a negative value to 0.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-01-25 11:54:57 +01:00
Thomas Gleixner
5b74f9e0e0 Merge branch 'linus' into timers/hrtimers 2009-01-25 11:54:33 +01:00
Frederic Weisbecker
9005f3ebeb tracing/function-graph-tracer: various fixes and features
This patch brings various bugfixes:

- Drop the first irrelevant task switch on the very beginning of a trace.

- Drop the OVERHEAD word from the headers, the DURATION word is sufficient
  and will not overlap other columns.

- Make the headers fit well their respective columns whatever the
  selected options.

Ie, default options:

 # tracer: function_graph
 #
 # CPU  DURATION                  FUNCTION CALLS
 # |     |   |                     |   |   |   |

  1)   0.646 us    |                    }
  1)               |                    mem_cgroup_del_lru_list() {
  1)   0.624 us    |                      lookup_page_cgroup();
  1)   1.970 us    |                    }

 echo funcgraph-proc > trace_options

 # tracer: function_graph
 #
 # CPU  TASK/PID        DURATION                  FUNCTION CALLS
 # |    |    |           |   |                     |   |   |   |

  0)   bash-2937    |   0.895 us    |                }
  0)   bash-2937    |   0.888 us    |                __rcu_read_unlock();
  0)   bash-2937    |   0.864 us    |                conv_uni_to_pc();
  0)   bash-2937    |   1.015 us    |                __rcu_read_lock();

 echo nofuncgraph-cpu > trace_options
 echo nofuncgraph-proc > trace_options

 # tracer: function_graph
 #
 #   DURATION                  FUNCTION CALLS
 #    |   |                     |   |   |   |

   3.752 us    |                  native_pud_val();
   0.616 us    |                  native_pud_val();
   0.624 us    |                  native_pmd_val();

About features, one can now disable the duration (this will hide the
overhead too for convenient reasons and because on  doesn't need
overhead if it hasn't the duration):

 echo nofuncgraph-duration > trace_options

 # tracer: function_graph
 #
 #                FUNCTION CALLS
 #                |   |   |   |

           cap_vm_enough_memory() {
             __vm_enough_memory() {
               vm_acct_memory();
             }
           }
         }

And at last, an option to print the absolute time:

 //Restart from default options
 echo funcgraph-abstime > trace_options

 # tracer: function_graph
 #
 #      TIME       CPU  DURATION                  FUNCTION CALLS
 #       |         |     |   |                     |   |   |   |

   261.339774 |   1) + 42.823 us   |    }
   261.339775 |   1)   1.045 us    |    _spin_lock_irq();
   261.339777 |   1)   0.940 us    |    _spin_lock_irqsave();
   261.339778 |   1)   0.752 us    |    _spin_unlock_irqrestore();
   261.339780 |   1)   0.857 us    |    _spin_unlock_irq();
   261.339782 |   1)               |    flush_to_ldisc() {
   261.339783 |   1)               |      tty_ldisc_ref() {
   261.339783 |   1)               |        tty_ldisc_try() {
   261.339784 |   1)   1.075 us    |          _spin_lock_irqsave();
   261.339786 |   1)   0.842 us    |          _spin_unlock_irqrestore();
   261.339788 |   1)   4.211 us    |        }
   261.339788 |   1)   5.662 us    |      }

The format is seconds.usecs.

I guess no one needs the nanosec precision here, the main goal is to have
an overview about the general timings of events, and to see the place when
the trace switches from one cpu to another.

ie:

   274.874760 |   1)   0.676 us    |      _spin_unlock();
   274.874762 |   1)   0.609 us    |      native_load_sp0();
   274.874763 |   1)   0.602 us    |      native_load_tls();
   274.878739 |   0)   0.722 us    |                  }
   274.878740 |   0)   0.714 us    |                  native_pmd_val();
   274.878741 |   0)   0.730 us    |                  native_pmd_val();

Here there is a 4000 usecs difference when we switch the cpu.

Changes in V2:

- Completely fix the first pointless task switch.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-23 11:18:08 +01:00
Steven Rostedt
7e49fcce1b trace, lockdep: manual preempt count adding for local_bh_disable
Impact: fix to preempt trace triggering lockdep check_flag failure

In local_bh_disable, the use of add_preempt_count causes the
preempt tracer to start recording the time preemption is off.
But because it already modified the preempt_count to show
softirqs disabled, and before it called the lockdep code to
handle this, it causes a state that lockdep can not handle.

The preempt tracer will reset the ring buffer on start of a trace,
and the ring buffer reset code does a spin_lock_irqsave. This
calls into lockdep and lockdep will fail when it detects the
invalid state of having softirqs disabled but the internal
current->softirqs_enabled is still set.

The fix is to manually add the SOFTIRQ_OFFSET to preempt count
and call the preempt tracer code outside the lockdep critical
area.

Thanks to Peter Zijlstra for suggesting this solution.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-23 11:10:57 +01:00
Steven Rostedt
b06a830183 trace: fix logic to start/stop counting
The logic in the tracing_start/stop code prevents the WARN_ON
from ever detecting if a start/stop pair was mismatched.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-23 11:10:45 +01:00
Steven Rostedt
94523e818f trace: remove internal irqsoff disabling for trace output
Impact: cleanup of duplicate features

The trace output disables the ring buffer and prevents tracing to
occur. The code in irqsoff to do the same thing is no longer needed.
This patch removes it.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-23 11:10:36 +01:00
Steven Rostedt
91a8d07d82 ring-buffer: reset timestamps when ring buffer is reset
Impact: fix bad times of recent resets

The ring buffer needs to reset its timestamps when reseting of the
buffer, otherwise the timestamps are stale and might be used to
calculate times in the buffer causing funny timestamps to appear.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-22 10:31:58 +01:00
Steven Rostedt
69507c0653 ring-buffer: reset timestamps when ring buffer is reset
Impact: fix bad times of recent resets

The ring buffer needs to reset its timestamps when reseting of the
buffer, otherwise the timestamps are stale and might be used to
calculate times in the buffer causing funny timestamps to appear.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-22 10:27:54 +01:00
Steven Rostedt
f8ec1062f5 wakeup-tracer: show scheduling data in output
Impact: better data for wakeup tracer

This patch adds the wakeup and schedule calls that are used by
the scheduler tracer to make the wakeup tracer more readable.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-22 10:27:39 +01:00
Steven Rostedt
3244351c31 trace: separate out rt tasks from wakeup tracer
Impact: add option to trace all tasks or just RT tasks

The current wakeup tracer only traces RT task wakeups. This is
fine for those interested in wake up timings of RT tasks, but
it is useless for those that are interested in the causes
of long wakeups for non RT tasks.

This patch creates a "wakeup_rt" to implement the tracing of just
RT tasks (as the current "wakeup" does). And makes "wakeup" now
trace all tasks as an average developer would expect.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-22 10:27:22 +01:00
Steven Rostedt
97b17efe45 ring-buffer: do not swap if recording is disabled
If the ring buffer recording has been disabled. Do not let
swapping of ring buffers occur. Simply return -EAGAIN.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-22 10:27:16 +01:00
Steven Rostedt
5bc4564b22 trace: do not disable wake up tracer on output of trace
Impact: fix to erased trace output

To try not to have the outputing of a trace interfere with the wakeup
tracer, it would disable tracing while the output was printing. But
if a trace had started when it was disabled, it can show a partial
trace. To try to solve this, on closing of the tracer, it would
clear the trace buffer.

The latency tracers (wakeup and irqsoff) have two buffers. One for
recording and one for holding the max trace that is printed. The
clearing of the trace above should only affect the recording buffer.
But for some reason it would move the erased trace to the print
buffer. Probably due to a race with the closing of the trace and
the saving ofhe max race.

The above is all pretty useless, and if the user does not want the
printing of the trace to be traced itself, then the user can manual
disable tracing. This patch removes all the code that tries to keep
the output of the tracer from modifying the trace.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-22 10:26:50 +01:00
Thomas Gleixner
6552ebae25 Merge branch 'core/debugobjects' into core/urgent 2009-01-22 10:03:02 +01:00
Steven Rostedt
1092307d58 trace: set max latency variable to zero on default
Impact: trace max latencies on start of latency tracing

This patch sets the max latency to zero whenever one of the
irq variant tracers or the wakeup tracer is set to current tracer.

Most developers expect to see output when starting up a latency
tracer. But since the max_latency is already set to max, and
it takes a latency greater than max_latency to be recorded, there
is no trace. This is not the expected behavior and has even confused
myself.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-21 15:21:30 +01:00
Steven Rostedt
a442e5e0a2 trace: stop all recording to ring buffer on ftrace_dump
Impact: limit ftrace dump output

Currently ftrace_dump only calls ftrace_kill that is a fast way
to prevent the function tracer functions from being called (just sets
a flag and clears the function to call, nothing else). It is better
to also turn off any recording to the ring buffers as well.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-21 15:21:30 +01:00
Steven Rostedt
faf6861ebd trace: print ftrace_dump at KERN_EMERG log level
Impact: fix to print out ftrace_dump when expected

I was debugging a hard race condition to only find out that
after I hit the race, my log level was not at level to show
KERN_INFO. The time it took to trigger the race was wasted because
I did not capture the trace.

Since ftrace_dump is only called from kernel oops (and only when
it is set in the kernel command line to do so), or when a
developer adds it to their own local tree, the log level of
the print should be at KERN_EMERG to make sure the print appears.

ftrace_dump is not called by a normal user setup, and will not
add extra unwanted print out to the console. There is no reason
it should be at KERN_INFO.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-21 15:21:30 +01:00
Lai Jiangshan
551b4048b3 ring_buffer: reset write when reserve buffer fail
Impact: reset struct buffer_page.write when interrupt storm

if struct buffer_page.write is not reset, any succedent committing
will corrupted ring_buffer:

static inline void
rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
{
	......
		cpu_buffer->commit_page->commit =
			cpu_buffer->commit_page->write;
	......
}

when "if (RB_WARN_ON(cpu_buffer, next_page == reader_page))", ring_buffer
is disabled, but some reserved buffers may haven't been committed.
we need reset struct buffer_page.write.

when "if (unlikely(next_page == cpu_buffer->commit_page))", ring_buffer
is still available, we should not corrupt it.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-21 15:21:30 +01:00
Frederic Weisbecker
00f57f545a tracing/function-graph-tracer: fix a regression while suspend to disk
Impact: fix a crash while kernel image restore

When the function graph tracer is running and while suspend to disk, some racy
and dangerous things happen against this tracer.

The current task will save its registers including the stack pointer which
contains the return address hooked by the tracer. But the current task will
continue to enter other functions after that to save the memory, and then
it will store other return addresses, and finally loose the old depth which
matches the return address saved in the old stack (during the registers saving).

So on image restore, the code will return to wrong addresses.
And there are other things: on restore, the task will have it's "current"
pointer overwritten during registers restoring....switching from one task to
another... That would be insane to try to trace function graphs at these
stages.

This patch makes the function graph tracer listening on power events, making
it's tracing disabled for the current task (the one that performs the
hibernation work) while suspend/resume to disk, making the tracing safe
during hibernation.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-21 15:21:30 +01:00
Paul Mundt
0609697eab dma-coherent: Restore dma_alloc_from_coherent() large alloc fall back policy.
When doing large allocations (larger than the per-device coherent area)
the generic memory allocators are silently fallen back on regardless of
consideration for the per-device constraints.

In the DMA_MEMORY_EXCLUSIVE case falling back on generic memory is not
an option, as it tends not to be addressable by the DMA hardware in
question. This issue showed up with the 8139too breakage on the
Dreamcast, where non-addressable buffers were silently allocated due to
the size mismatch calculation -- while it should have simply errored out
upon being unable to satisfy the allocation with the given device
constraints.

This restores fall back behaviour to what it was before the oversized
request change caused multiple regressions.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2009-01-21 18:51:53 +09:00
Adrian McMenamin
cdf57cab27 dma-coherent: per-device coherent area is in pages, not bytes.
Commit 58c6d3dfe4 ("dma-coherent: catch
oversized requests to dma_alloc_from_coherent()") attempted to add a
sanity check to bail out on allocations larger than the coherent area.

Unfortunately when this was implemented, the fact the coherent area
is tracked in pages rather than bytes was overlooked, which subsequently
broke every single dma_alloc_from_coherent() user, forcing the allocation
silently through generic memory instead.

Signed-off-by: Adrian McMenamin <adrian@mcmen.demon.co.uk >
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2009-01-21 18:47:38 +09:00
Ingo Molnar
198030782c Merge branch 'x86/mm' into core/percpu
Conflicts:
	arch/x86/mm/fault.c
2009-01-21 10:39:51 +01:00
Ingo Molnar
3eb3963fd1 Merge branch 'cpus4096' into core/percpu
Conflicts:
	arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
	arch/x86/kernel/tlb_32.c

Merge it here because both the cpumask changes and the ongoing percpu
work is touching the TLB code. The percpu changes take precedence, as
they eliminate tlb_32.c altogether.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-21 10:14:17 +01:00
Steven Rostedt
082605de5f ring-buffer: fix alignment problem
Impact: fix to allow some archs to use the ring buffer

Commits in the ring buffer are checked by pointer arithmetic.
If the calculation is incorrect, then the commits will never take
place and the buffer will simply fill up and report an error.

Each page in the ring buffer has a small header:

struct buffer_data_page {
	u64		time_stamp;
	local_t		commit;
	unsigned char	data[];
};

Unfortuntely, some of the calculations used sizeof(struct buffer_data_page)
to know the size of the header. But this is incorrect on some archs,
where sizeof(struct buffer_data_page) does not equal
offsetof(struct buffer_data_page, data), and on those archs, the commits
are never processed.

This patch replaces the sizeof with offsetof.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-20 13:09:06 +01:00
Lai Jiangshan
3690b5e6fd trace_workqueue: use percpu data for workqueue stat
Impact: use percpu data instead of a global structure

Use:

   static DEFINE_PER_CPU(struct workqueue_global_stats, all_workqueue_stat);

instead of allocating a global structure.

percpu data also works well on NUMA.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-20 13:06:59 +01:00
Markus Metzger
11edda0628 x86, ftrace, hw-branch-tracer: change trace format
Change the hw-branch-tracer format to be more readable.

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-20 13:04:16 +01:00
Markus Metzger
e23b8ad834 x86, ftrace, hw-branch-tracer: reset trace buffer on close
Reset the ftrace buffer on close. Since we use cyclic buffers, the
trace is not contiguous, anyway.

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-20 13:03:55 +01:00
Markus Metzger
b1818748b0 x86, ftrace, hw-branch-tracer: dump trace on oops
Dump the branch trace on an oops (based on ftrace_dump_on_oops).

Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-20 13:03:48 +01:00
Markus Metzger
5c5317de14 x86, ftrace, hw-branch-tracer: support hotplug cpus
Support hotplug cpus.

Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-20 13:03:38 +01:00
Rusty Russell
8ccad40df8 work_on_cpu: Use our own workqueue.
Impact: remove potential clashes with generic kevent workqueue

Annoyingly, some places we want to use work_on_cpu are already in
workqueues.  As per Ingo's suggestion, we create a different workqueue
for work_on_cpu.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-19 22:36:07 +01:00
Rusty Russell
31ad908120 work_on_cpu: don't try to get_online_cpus() in work_on_cpu.
Impact: remove potential circular lock dependency with cpu hotplug lock

This has caused more problems than it solved, with a pile of cpu
hotplug locking issues.

Followup patches will get_online_cpus() in callers that need it, but
if they don't do it they're no worse than before when they were using
set_cpus_allowed without locking.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-19 22:36:02 +01:00
Miao Xie
f90d4118ba cpuset: fix possible deadlock in async_rebuild_sched_domains
Lockdep reported some possible circular locking info when we tested cpuset on
NUMA/fake NUMA box.

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.29-rc1-00224-ga652504 #111
-------------------------------------------------------
bash/2968 is trying to acquire lock:
 (events){--..}, at: [<ffffffff8024c8cd>] flush_work+0x24/0xd8

but task is already holding lock:
 (cgroup_mutex){--..}, at: [<ffffffff8026ad1e>] cgroup_lock_live_group+0x12/0x29

which lock already depends on the new lock.
......
-------------------------------------------------------

Steps to reproduce:
# mkdir /dev/cpuset
# mount -t cpuset xxx /dev/cpuset
# mkdir /dev/cpuset/0
# echo 0 > /dev/cpuset/0/cpus
# echo 0 > /dev/cpuset/0/mems
# echo 1 > /dev/cpuset/0/memory_migrate
# cat /dev/zero > /dev/null &
# echo $! > /dev/cpuset/0/tasks

This is because async_rebuild_sched_domains has the following lock sequence:
run_workqueue(async_rebuild_sched_domains)
	-> do_rebuild_sched_domains -> cgroup_lock

But, attaching tasks when memory_migrate is set has following:
cgroup_lock_live_group(cgroup_tasks_write)
	-> do_migrate_pages -> flush_work

This patch fixes it by using a separate workqueue thread.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-19 02:44:00 +01:00
Peter Zijlstra
1d4a7f1c4f hrtimers: fix inconsistent lock state on resume in hres_timers_resume
Andrey Borzenkov reported this lockdep assert:

> [17854.688347] =================================
> [17854.688347] [ INFO: inconsistent lock state ]
> [17854.688347] 2.6.29-rc2-1avb #1
> [17854.688347] ---------------------------------
> [17854.688347] inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
> [17854.688347] pm-suspend/18240 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [17854.688347]  (&cpu_base->lock){++..}, at: [<c0136fcc>] retrigger_next_event+0x5c/0xa0
> [17854.688347] {in-hardirq-W} state was registered at:
> [17854.688347]   [<c01443cd>] __lock_acquire+0x79d/0x1930
> [17854.688347]   [<c01455bc>] lock_acquire+0x5c/0x80
> [17854.688347]   [<c03092e5>] _spin_lock+0x35/0x70
> [17854.688347]   [<c0136e61>] hrtimer_run_queues+0x31/0x140
> [17854.688347]   [<c0128d98>] run_local_timers+0x8/0x20
> [17854.688347]   [<c0128dd3>] update_process_times+0x23/0x60
> [17854.688347]   [<c013e274>] tick_periodic+0x24/0x80
> [17854.688347]   [<c013e2e2>] tick_handle_periodic+0x12/0x70
> [17854.688347]   [<c0104e24>] timer_interrupt+0x14/0x20
> [17854.688347]   [<c01607b9>] handle_IRQ_event+0x29/0x60
> [17854.688347]   [<c0161c59>] handle_level_irq+0x69/0xe0
> [17854.688347]   [<ffffffff>] 0xffffffff
> [17854.688347] irq event stamp: 55771
> [17854.688347] hardirqs last  enabled at (55771): [<c0309125>] _spin_unlock_irqrestore+0x35/0x60
> [17854.688347] hardirqs last disabled at (55770): [<c0309419>] _spin_lock_irqsave+0x19/0x80
> [17854.688347] softirqs last  enabled at (54836): [<c0124f54>] __do_softirq+0xc4/0x110
> [17854.688347] softirqs last disabled at (54831): [<c01049ae>] do_softirq+0x8e/0xe0
> [17854.688347]
> [17854.688347] other info that might help us debug this:
> [17854.688347] 3 locks held by pm-suspend/18240:
> [17854.688347]  #0:  (&buffer->mutex){--..}, at: [<c01dd4c5>] sysfs_write_file+0x25/0x100
> [17854.688347]  #1:  (pm_mutex){--..}, at: [<c015056f>] enter_state+0x4f/0x140
> [17854.688347]  #2:  (dpm_list_mtx){--..}, at: [<c027880f>] device_pm_lock+0xf/0x20
> [17854.688347]
> [17854.688347] stack backtrace:
> [17854.688347] Pid: 18240, comm: pm-suspend Not tainted 2.6.29-rc2-1avb #1
> [17854.688347] Call Trace:
> [17854.688347]  [<c0306248>] ? printk+0x18/0x20
> [17854.688347]  [<c0141fac>] print_usage_bug+0x16c/0x1d0
> [17854.688347]  [<c0142bcf>] mark_lock+0x8bf/0xc90
> [17854.688347]  [<c0106b8f>] ? pit_next_event+0x2f/0x40
> [17854.688347]  [<c01441b0>] __lock_acquire+0x580/0x1930
> [17854.688347]  [<c030916d>] ? _spin_unlock+0x1d/0x20
> [17854.688347]  [<c0106b8f>] ? pit_next_event+0x2f/0x40
> [17854.688347]  [<c013dd38>] ? clockevents_program_event+0x98/0x160
> [17854.688347]  [<c0142fe8>] ? mark_held_locks+0x48/0x90
> [17854.688347]  [<c0309125>] ? _spin_unlock_irqrestore+0x35/0x60
> [17854.688347]  [<c0143229>] ? trace_hardirqs_on_caller+0x139/0x190
> [17854.688347]  [<c014328b>] ? trace_hardirqs_on+0xb/0x10
> [17854.688347]  [<c01455bc>] lock_acquire+0x5c/0x80
> [17854.688347]  [<c0136fcc>] ? retrigger_next_event+0x5c/0xa0
> [17854.688347]  [<c03092e5>] _spin_lock+0x35/0x70
> [17854.688347]  [<c0136fcc>] ? retrigger_next_event+0x5c/0xa0
> [17854.688347]  [<c0136fcc>] retrigger_next_event+0x5c/0xa0
> [17854.688347]  [<c013711a>] hres_timers_resume+0xa/0x10
> [17854.688347]  [<c013aa8e>] timekeeping_resume+0xee/0x150
> [17854.688347]  [<c0273384>] __sysdev_resume+0x14/0x50
> [17854.688347]  [<c0273407>] sysdev_resume+0x47/0x80
> [17854.688347]  [<c02791ab>] device_power_up+0xb/0x20
> [17854.688347]  [<c015043f>] suspend_devices_and_enter+0xcf/0x150
> [17854.688347]  [<c0150c2f>] ? freeze_processes+0x3f/0x90
> [17854.688347]  [<c0150614>] enter_state+0xf4/0x140
> [17854.688347]  [<c01506dd>] state_store+0x7d/0xc0
> [17854.688347]  [<c0150660>] ? state_store+0x0/0xc0
> [17854.688347]  [<c0202da4>] kobj_attr_store+0x24/0x30
> [17854.688347]  [<c01dd53c>] sysfs_write_file+0x9c/0x100
> [17854.688347]  [<c019916c>] vfs_write+0x9c/0x160
> [17854.688347]  [<c0103494>] ? restore_nocheck_notrace+0x0/0xe
> [17854.688347]  [<c01dd4a0>] ? sysfs_write_file+0x0/0x100
> [17854.688347]  [<c01992ed>] sys_write+0x3d/0x70
> [17854.688347]  [<c0103371>] sysenter_do_call+0x12/0x31

Andrey's analysis:

> timekeeping_resume() is called via class ->resume
> method; and according to comments in sysdev_resume() and
> device_power_up(), they are called with interrupts disabled.
>
> Looking at suspend_enter, irqs *are* disabled at this point.
>
> So it actually looks like something (may be some driver)
> unconditionally enabled irqs in resume path.

Add a debug check to test this theory. If it triggers then it
triggers because the resume code calls it with irqs enabled,
which is a no-no not just for timekeeping_resume(), but also
bad for a number of other resume handlers.

Reported-by: Andrey Borzenkov <arvidjaar@mail.ru>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-18 21:31:37 +01:00
Jiri Slaby
b786c6a98e relay: fix lock imbalance in relay_late_setup_files
One fail path in relay_late_setup_files() omits
mutex_unlock(&relay_channels_mutex);
Add it.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-18 20:29:35 +01:00
Ingo Molnar
4092762aeb Merge branch 'tracing/ftrace'; commit 'v2.6.29-rc2' into tracing/core 2009-01-18 20:15:05 +01:00
Mandeep Singh Baines
603a148f43 softlockup: fix potential race in hung_task when resetting timeout
Impact: fix potential false panic

A potential race exists if sysctl_hung_task_timeout_secs is reset to 0
while inside check_hung_uniterruptible_tasks(). If check_task() is
entered, a comparison with 0 will result in a false hung_task being
detected.

If sysctl_hung_task_panic is set, the system will panic.

Signed-off-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-18 19:20:17 +01:00
Ingo Molnar
b2b062b816 Merge branch 'core/percpu' into stackprotector
Conflicts:
	arch/x86/include/asm/pda.h
	arch/x86/include/asm/system.h

Also, moved include/asm-x86/stackprotector.h to arch/x86/include/asm.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-18 18:37:14 +01:00
Rusty Russell
e1d9ec6246 work_on_cpu: Use our own workqueue.
Impact: remove potential clashes with generic kevent workqueue

Annoyingly, some places we want to use work_on_cpu are already in
workqueues.  As per Ingo's suggestion, we create a different workqueue
for work_on_cpu.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
2009-01-16 15:31:15 -08:00
Rusty Russell
68564a4697 work_on_cpu: don't try to get_online_cpus() in work_on_cpu.
Impact: remove potential circular lock dependency with cpu hotplug lock

This has caused more problems than it solved, with a pile of cpu
hotplug locking issues.

Followup patches will get_online_cpus() in callers that need it, but
if they don't do it they're no worse than before when they were using
set_cpus_allowed without locking.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
2009-01-16 15:31:15 -08:00
Rafael J. Wysocki
091d71e023 PM: Fix compilation warning in kernel/power/main.c
Reorder the code in kernel/power/main.c to fix compilation warning
triggered by unsetting CONFIG_SUSPEND.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2009-01-16 18:13:41 -05:00
Len Brown
88d998c264 Merge branch 'misc' into release 2009-01-16 14:45:34 -05:00
Masami Hiramatsu
5a4ccaf37f kprobes: check CONFIG_FREEZER instead of CONFIG_PM
Check CONFIG_FREEZER instead of CONFIG_PM because kprobe booster
depends on freeze_processes() and thaw_processes() when CONFIG_PREEMPT=y.

This fixes a linkage error which occurs when CONFIG_PREEMPT=y, CONFIG_PM=y
and CONFIG_FREEZER=n.

Reported-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Len Brown <len.brown@intel.com>
2009-01-16 14:32:17 -05:00
Rafael J. Wysocki
33f1d7ecc6 PM: Fix freezer compilation if PM_SLEEP is unset
Freezer fails to compile if with the following configuration
settings:

CONFIG_CGROUPS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_MODULES=y
CONFIG_FREEZER=y
CONFIG_PM=y
CONFIG_PM_SLEEP=n

Fix this by making process.o compilation depend on CONFIG_FREEZER.

Reported-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2009-01-16 14:32:17 -05:00
Mandeep Singh Baines
af432eb1cc softlockup: fix to allow compiling with !DETECT_HUNG_TASK
Fixes the following compile error:

 kernel/fork.c:1049: error: 'struct task_struct' has no member named 'last_switch_count'
 kernel/fork.c:1050: error: 'struct task_struct' has no member named 'last_switch_timestamp'

Signed-off-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 18:17:34 +01:00
Ingo Molnar
74296a8ed6 irq: provide debug_poll_all_shared_irqs() method under CONFIG_DEBUG_SHIRQ
Provide a shared interrupt debug facility under CONFIG_DEBUG_SHIRQ:
it uses the existing irqpoll facilities to iterate through all
registered interrupt handlers and call those which can handle shared
IRQ lines.

This can be handy for suspend/resume debugging: if we call this function
early during resume we can trigger crashes in those drivers which have
incorrect assumptions about when exactly their ISRs will be called
during suspend/resume.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 17:46:49 +01:00
Ingo Molnar
5a2dd72abd Merge branch 'linus' into irq/genirq 2009-01-16 17:46:22 +01:00
Peter Zijlstra
ceacc2c1c8 sched: make plist a library facility
Ingo Molnar wrote:

> here's a new build failure with tip/sched/rt:
>
>   LD      .tmp_vmlinux1
> kernel/built-in.o: In function `set_curr_task_rt':
> sched.c:(.text+0x3675): undefined reference to `plist_del'
> kernel/built-in.o: In function `pick_next_task_rt':
> sched.c:(.text+0x37ce): undefined reference to `plist_del'
> kernel/built-in.o: In function `enqueue_pushable_task':
> sched.c:(.text+0x381c): undefined reference to `plist_del'

Eliminate the plist library kconfig and make it available
unconditionally.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 15:01:31 +01:00
Mandeep Singh Baines
e162b39a36 softlockup: decouple hung tasks check from softlockup detection
Decoupling allows:

* hung tasks check to happen at very low priority

* hung tasks check and softlockup to be enabled/disabled independently
  at compile and/or run-time

* individual panic settings to be enabled disabled independently
  at compile and/or run-time

* softlockup threshold to be reduced without increasing hung tasks
  poll frequency (hung task check is expensive relative to softlock watchdog)

* hung task check to be zero over-head when disabled at run-time

Signed-off-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 14:06:04 +01:00
Magnus Damm
2d68259db2 clockevents: let set_mode() setup delta information
Allow the set_mode() clockevent callback to decide and fill in delta
details such as shift, mult, max_delta_ns and min_delta_ns.

With this change the clockevent can be registered without delta details
which allows us to keep the parent clock disabled until the clockevent
gets setup using set_mode().

Letting set_mode() fill in or update delta details allows us to save
power by disabling the parent clock while the clockevent is unused.
This may however make the parent clock rate change, so next time the
clockevent gets enabled we need let set_mode() to update the detla
details accordingly. Doing it at registration time is not enough.

Furthermore, the delta details seem unused in the case of periodic-only
clockevent drivers, so this change also allows registration of such
drivers without the delta details filled in.

Signed-off-by: Magnus Damm <damm@igel.co.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-01-16 12:27:39 +01:00
Steven Rostedt
745b1626dd trace: set max latency variable to zero on default
Impact: trace max latencies on start of latency tracing

This patch sets the max latency to zero whenever one of the
irq variant tracers or the wakeup tracer is set to current tracer.

Most developers expect to see output when starting up a latency
tracer. But since the max_latency is already set to max, and
it takes a latency greater than max_latency to be recorded, there
is no trace. This is not the expected behavior and has even confused
myself.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 12:18:09 +01:00
Steven Rostedt
a225cdd263 ftrace: remove static from function tracer functions
Impact: clean up

After reorganizing the functions in trace.c and trace_function.c,
they no longer need to be in global context. This patch makes the
functions and one variable into static.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 12:17:58 +01:00
Steven Rostedt
3eb36aa053 ftrace: combine stack trace in function call
Impact: less likely to interleave function and stack traces

This patch does replaces the separate stack trace on function with
a record function and stack trace together. This will switch between
the function only recording to a function and stack recording.

Also some whitespace fix ups as well.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 12:17:46 +01:00
Steven Rostedt
5e4abc9839 trace: clean up format errors in calls to trace_seq_printf
After adding the printf format checking for trace_seq_printf, several
warnings now show up. This patch cleans them up.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 12:17:38 +01:00
Steven Rostedt
c37abc5515 trace: add gcc printf check to trace_seq_printf
Andrew Morton suggested adding a printf checker to trace_seq_printf
since there are a number of users that have improper format arguments.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 12:17:29 +01:00
Steven Rostedt
bb3c3c95f3 ftrace: move function tracer functions out of trace.c
Impact: clean up of trace.c

The function tracer functions were put in trace.c because it needed
to share static variables that were in trace.c.  Since then, those
variables have become global for various reasons. This patch moves
the function tracer functions into trace_function.c where they belong.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 12:17:10 +01:00
Steven Rostedt
5361499101 ftrace: add stack trace to function tracer
Impact: new feature to stack trace any function

Chris Mason asked about being able to pick and choose a function
and get a stack trace from it. This feature enables his request.

 # echo io_schedule > /debug/tracing/set_ftrace_filter
 # echo function > /debug/tracing/current_tracer
 # echo func_stack_trace > /debug/tracing/trace_options

Produces the following in /debug/tracing/trace:

       kjournald-702   [001]   135.673060: io_schedule <-sync_buffer
       kjournald-702   [002]   135.673671:
 <= sync_buffer
 <= __wait_on_bit
 <= out_of_line_wait_on_bit
 <= __wait_on_buffer
 <= sync_dirty_buffer
 <= journal_commit_transaction
 <= kjournald

Note, be careful about turning this on without filtering the functions.
You may find that you have a 10 second lag between typing and seeing
what you typed. This is why the stack trace for the function tracer
does not use the same stack_trace flag as the other tracers use.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 12:15:32 +01:00
Lai Jiangshan
6c1a99afbd ftrace: fix trace_output
Impact: fix bug for handling partial line

trace_seq_printf(), seq_print_userip_objs(), ... return
0          -- partial line was written
other(>0)  -- success

duplicate output is also removed in trace_print_raw().

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-16 12:15:20 +01:00
Linus Torvalds
7cb36b6ccd Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: sched_slice() fixlet
  sched: fix update_min_vruntime
  sched: SCHED_OTHER vs SCHED_IDLE isolation
  sched: SCHED_IDLE weight change
  sched: fix bandwidth validation for UID grouping
  Revert "sched: improve preempt debugging"
2009-01-15 16:55:00 -08:00
Randy Dunlap
6ae301e85c resources: fix parameter name and kernel-doc
Fix __request_region() parameter kernel-doc notation and parameter name:

Warning(linux-2.6.28-git10//kernel/resource.c:627): No description found for parameter 'flags'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-15 16:39:38 -08:00
Li Zefan
45ce80fb6b cgroups: consolidate cgroup documents
Move Documentation/cpusets.txt and Documentation/controllers/* to
Documentation/cgroups/

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-15 16:39:37 -08:00
Lin Ming
6272d68cc6 sched: sched_slice() fixlet
Mike's change: 0a582440f "sched: fix sched_slice())" broke group
scheduling by forgetting to reload cfs_rq on each loop.

This patch fixes aim7 regression and specjbb2005 regression becomes
less than 1.5% on 8-core stokley.

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Jayson King <dev@jaysonking.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 21:07:57 +01:00
Doug Chapman
88fc241f54 [IA64] dump stack on kernel unaligned warnings
Often the cause of kernel unaligned access warnings is not
obvious from just the ip displayed in the warning.  This adds
the option via proc to dump the stack in addition to the warning.
The default is off (just display the 1 line warning).  To enable
the stack to be shown: echo 1 > /proc/sys/kernel/unaligned-dump-stack

Signed-off-by: Doug Chapman <doug.chapman@hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2009-01-15 10:38:56 -08:00
Ingo Molnar
49a93bc978 Merge branch 'linus' into cpus4096 2009-01-15 15:45:31 +01:00
Peter Zijlstra
e17036dac1 sched: fix update_min_vruntime
Impact: fix SCHED_IDLE latency problems

OK, so we have 1 running task A (which is obviously curr and the tree is
equally obviously empty).

'A' nicely chugs along, doing its thing, carrying min_vruntime along as it
goes.

Then some whacko speed freak SCHED_IDLE task gets inserted due to SMP
balancing, which is very likely far right, in that case

update_curr
  update_min_vruntime
    cfs_rq->rb_leftmost := true (the crazy task sitting in a tree)
      vruntime = se->vruntime

and voila, min_vruntime is waaay right of where it ought to be.

OK, so why did I write it like that to begin with...

Aah, yes.

Say we've just dequeued current

schedule
  deactivate_task(prev)
    dequeue_entity
      update_min_vruntime

Then we'll set

  vruntime = cfs_rq->min_vruntime;

we find !cfs_rq->curr, but do find someone in the tree. Then we _must_
do vruntime = se->vruntime, because

 vruntime = min_vruntime(vruntime := cfs_rq->min_vruntime, se->vruntime)

will not advance vruntime, and cause lags the other way around (which we
fixed with that initial patch: 1af5f730fc
(sched: more accurate min_vruntime accounting).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Mike Galbraith <efault@gmx.de>
Acked-by: Mike Galbraith <efault@gmx.de>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 15:12:19 +01:00
Peter Zijlstra
6bc912b71b sched: SCHED_OTHER vs SCHED_IDLE isolation
Stronger SCHED_IDLE isolation:

 - no SCHED_IDLE buddies
 - never let SCHED_IDLE preempt on wakeup
 - always preempt SCHED_IDLE on wakeup
 - limit SLEEPER fairness for SCHED_IDLE.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 15:07:29 +01:00
Peter Zijlstra
cce7ade803 sched: SCHED_IDLE weight change
Increase the SCHED_IDLE weight from 2 to 3, this gives much more stable
vruntime numbers.

time advanced in 100ms:

 weight=2

 64765.988352
 67012.881408
 88501.412352

 weight=3

 35496.181411
 34130.971298
 35497.411573

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 15:07:28 +01:00
Peter Zijlstra
98a4826b99 sched: fix bandwidth validation for UID grouping
Impact: make rt-limit tunables work again

Mark Glines reported:

> I've got an issue on x86-64 where I can't configure the system to allow
> RT tasks for a non-root user.
>
> In 2.6.26.5, I was able to do the following to set things up nicely:
> echo 450000 >/sys/kernel/uids/0/cpu_rt_runtime
> echo 450000 >/sys/kernel/uids/1000/cpu_rt_runtime
>
> Seems like every value I try to echo into the /sys files returns EINVAL.

For UID grouping we initialize the root group with infinite bandwidth
which by default is actually more than the global limit, therefore the
bandwidth check always fails.

Because the root group is a phantom group (for UID grouping) we cannot
runtime adjust it, therefore we let it reflect the global bandwidth
settings.

Reported-by: Mark Glines <mark@glines.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 15:07:27 +01:00
Ingo Molnar
7f268f4352 Merge branches 'cpus4096', 'x86/cleanups' and 'x86/urgent' into x86/percpu 2009-01-15 13:18:57 +01:00
Jaswinder Singh Rajput
934d96eafa time-sched.c: tick_nohz_update_jiffies should be static
Impact: cleanup, reduce kernel size a bit, avoid sparse warning

Fixes sparse warning:

 kernel/time/tick-sched.c:137:6: warning: symbol 'tick_nohz_update_jiffies' was not declared. Should it be static?

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 12:06:56 +01:00
Peter Zijlstra
e52fb7c097 sched: prefer wakers
Prefer tasks that wake other tasks to preempt quickly. This improves
performance because more work is available sooner.

The workload that prompted this patch was a kernel build over NFS4 (for some
curious and not understood reason we had to revert commit:
18de973530 to make any progress at all)

Without this patch a make -j8 bzImage (of x86-64 defconfig) would take
3m30-ish, with this patch we're down to 2m50-ish.

psql-sysbench/mysql-sysbench show a slight improvement in peak performance as
well, tbench and vmark seemed to not care.

It is possible to improve upon the build time (to 2m20-ish) but that seriously
destroys other benchmarks (just shows that there's more room for tinkering).

Much thanks to Mike who put in a lot of effort to benchmark things and proved
a worthy opponent with a competing patch.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 12:00:09 +01:00
Peter Zijlstra
831451ac4e sched: introduce avg_wakeup
Introduce a new avg_wakeup statistic.

avg_wakeup is a measure of how frequently a task wakes up other tasks, it
represents the average time between wakeups, with a limit of avg_runtime
for when it doesn't wake up anybody.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 12:00:08 +01:00
Ingo Molnar
55922173f1 tracing: trace_stat.c cleanup
Impact: cleanup

- whitespace / code alignment cleanups
- avoid unnecessary forward prototype by reordering functions

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 11:31:21 +01:00
Li Zefan
42fab4b2cd tracing/ftrace: add missing unlock in register_stat_tracer()
We should unlock all_stat_sessions_mutex before returning failure.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 11:28:32 +01:00
Frederic Weisbecker
4a2b8dda3f tracing/function-graph-tracer: fix a regression while suspend to disk
Impact: fix a crash while kernel image restore

When the function graph tracer is running and while suspend to disk, some racy
and dangerous things happen against this tracer.

The current task will save its registers including the stack pointer which
contains the return address hooked by the tracer. But the current task will
continue to enter other functions after that to save the memory, and then
it will store other return addresses, and finally loose the old depth which
matches the return address saved in the old stack (during the registers saving).

So on image restore, the code will return to wrong addresses.
And there are other things: on restore, the task will have it's "current"
pointer overwritten during registers restoring....switching from one task to
another... That would be insane to try to trace function graphs at these
stages.

This patch makes the function graph tracer listening on power events, making
it's tracing disabled for the current task (the one that performs the
hibernation work) while suspend/resume to disk, making the tracing safe
during hibernation.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 09:47:40 +01:00
Steven Rostedt
0ee6b6cf5b trace: stop all recording to ring buffer on ftrace_dump
Impact: limit ftrace dump output

Currently ftrace_dump only calls ftrace_kill that is a fast way
to prevent the function tracer functions from being called (just sets
a flag and clears the function to call, nothing else). It is better
to also turn off any recording to the ring buffers as well.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 09:47:13 +01:00
Lai Jiangshan
6f3b34402e ring_buffer: reset write when reserve buffer fail
Impact: reset struct buffer_page.write when interrupt storm

if struct buffer_page.write is not reset, any succedent committing
will corrupted ring_buffer:

static inline void
rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
{
	......
		cpu_buffer->commit_page->commit =
			cpu_buffer->commit_page->write;
	......
}

when "if (RB_WARN_ON(cpu_buffer, next_page == reader_page))", ring_buffer
is disabled, but some reserved buffers may haven't been committed.
we need reset struct buffer_page.write.

when "if (unlikely(next_page == cpu_buffer->commit_page))", ring_buffer
is still available, we should not corrupt it.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 09:46:40 +01:00
Steven Rostedt
428aee1460 trace: print ftrace_dump at KERN_EMERG log level
Impact: fix to print out ftrace_dump when expected

I was debugging a hard race condition to only find out that
after I hit the race, my log level was not at level to show
KERN_INFO. The time it took to trigger the race was wasted because
I did not capture the trace.

Since ftrace_dump is only called from kernel oops (and only when
it is set in the kernel command line to do so), or when a
developer adds it to their own local tree, the log level of
the print should be at KERN_EMERG to make sure the print appears.

ftrace_dump is not called by a normal user setup, and will not
add extra unwanted print out to the console. There is no reason
it should be at KERN_INFO.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-15 09:45:09 +01:00
Linus Torvalds
bca268565f Merge branch 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'syscalls' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (44 commits)
  [CVE-2009-0029] s390 specific system call wrappers
  [CVE-2009-0029] System call wrappers part 33
  [CVE-2009-0029] System call wrappers part 32
  [CVE-2009-0029] System call wrappers part 31
  [CVE-2009-0029] System call wrappers part 30
  [CVE-2009-0029] System call wrappers part 29
  [CVE-2009-0029] System call wrappers part 28
  [CVE-2009-0029] System call wrappers part 27
  [CVE-2009-0029] System call wrappers part 26
  [CVE-2009-0029] System call wrappers part 25
  [CVE-2009-0029] System call wrappers part 24
  [CVE-2009-0029] System call wrappers part 23
  [CVE-2009-0029] System call wrappers part 22
  [CVE-2009-0029] System call wrappers part 21
  [CVE-2009-0029] System call wrappers part 20
  [CVE-2009-0029] System call wrappers part 19
  [CVE-2009-0029] System call wrappers part 18
  [CVE-2009-0029] System call wrappers part 17
  [CVE-2009-0029] System call wrappers part 16
  [CVE-2009-0029] System call wrappers part 15
  ...
2009-01-14 19:58:40 -08:00
Sam Ravnborg
2ea038917b Revert "kbuild: strip generated symbols from *.ko"
This reverts commit ad7a953c52.

And commit: ("allow stripping of generated symbols under CONFIG_KALLSYMS_ALL")
            9bb482476c

These stripping patches has caused a set of issues:

1) People have reported compatibility issues with binutils due to
   lack of support for `--strip-unneeded-symbols' with objcopy 2.15.92.0.2
   Reported by: Wenji
2) ccache and distcc no longer works as expeced
   Reported by: Ted, Roland, + others
3) The installed modules increased a lot in size
   Reported by: Ted, Davej + others

Reported-by: Wenji Huang <wenji.huang@oracle.com>
Reported-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2009-01-14 21:38:20 +01:00
Chris Mason
ac6e60ee40 mutex: adaptive spinnning, performance tweaks
Spin more agressively. This is less fair but also markedly faster.

The numbers:

 * dbench 50 (higher is better):
  spin        1282MB/s
  v10         548MB/s
  v10 no wait 1868MB/s

 * 4k creates (numbers in files/second higher is better):
  spin        avg 200.60 median 193.20 std 19.71 high 305.93 low 186.82
  v10         avg 180.94 median 175.28 std 13.91 high 229.31 low 168.73
  v10 no wait avg 232.18 median 222.38 std 22.91 high 314.66 low 209.12

 * File stats (numbers in seconds, lower is better):
  spin        2.27s
  v10         5.1s
  v10 no wait 1.6s

( The source changes are smaller than they look, I just moved the
  need_resched checks in __mutex_lock_common after the cmpxchg. )

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 19:03:54 +01:00
Andrew Morton
9316fcacb8 kernel/up.c: omit it if SMP=y, USE_GENERIC_SMP_HELPERS=n
Fix the sparc build - we were including `up.o' on SMP builds, when
CONFIG_USE_GENERIC_SMP_HELPERS=n.

Tested-by: Robert Reif <reif@earthlink.net>
Fixed-by: Robert Reif <reif@earthlink.net>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-14 09:42:11 -08:00
Peter Zijlstra
0d66bf6d35 mutex: implement adaptive spinning
Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 345% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:               296604

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:               85870

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_OWNER_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo OWNER_SPIN > /debug/sched_features

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 18:09:02 +01:00
Peter Zijlstra
41719b0309 mutex: preemption fixes
The problem is that dropping the spinlock right before schedule is a voluntary
preemption point and can cause a schedule, right after which we schedule again.

Fix this inefficiency by keeping preemption disabled until we schedule, do this
by explicity disabling preemption and providing a schedule() variant that
assumes preemption is already disabled.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 18:09:00 +01:00
Peter Zijlstra
93d81d1aca mutex: small cleanup
Remove a local variable by combining an assingment and test in one.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 18:08:58 +01:00
Gregory Haskins
398a153b16 sched: fix build error in kernel/sched_rt.c when RT_GROUP_SCHED && !SMP
Ingo found a build error in the scheduler when RT_GROUP_SCHED was
enabled, but SMP was not.  This patch rearranges the code such
that it is a little more streamlined and compiles under all permutations
of SMP, UP and RT_GROUP_SCHED.  It was boot tested on my 4-way x86_64
and it still passes preempt-test.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
2009-01-14 09:10:04 -05:00
Gregory Haskins
b07430ac37 sched: de CPP-ify the scheduler code
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
2009-01-14 08:55:39 -05:00
Heiko Carstens
d4e82042c4 [CVE-2009-0029] System call wrappers part 32
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:31 +01:00
Heiko Carstens
836f92adf1 [CVE-2009-0029] System call wrappers part 31
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:31 +01:00
Heiko Carstens
6559eed8ca [CVE-2009-0029] System call wrappers part 30
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:30 +01:00
Heiko Carstens
1e7bfb2134 [CVE-2009-0029] System call wrappers part 27
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:29 +01:00
Heiko Carstens
c4ea37c26a [CVE-2009-0029] System call wrappers part 26
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:29 +01:00
Heiko Carstens
e48fbb699f [CVE-2009-0029] System call wrappers part 24
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:28 +01:00
Heiko Carstens
5a8a82b1d3 [CVE-2009-0029] System call wrappers part 23
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:28 +01:00
Heiko Carstens
003d7ab479 [CVE-2009-0029] System call wrappers part 19
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:26 +01:00
Heiko Carstens
a6b42e83f2 [CVE-2009-0029] System call wrappers part 18
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:25 +01:00
Heiko Carstens
ca013e945b [CVE-2009-0029] System call wrappers part 17
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:25 +01:00
Heiko Carstens
a5f8fa9e9b [CVE-2009-0029] System call wrappers part 09
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:21 +01:00
Heiko Carstens
17da2bd90a [CVE-2009-0029] System call wrappers part 08
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:21 +01:00
Heiko Carstens
754fe8d297 [CVE-2009-0029] System call wrappers part 07
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:20 +01:00
Heiko Carstens
5add95d4f7 [CVE-2009-0029] System call wrappers part 06
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:20 +01:00
Heiko Carstens
362e9c07c7 [CVE-2009-0029] System call wrappers part 05
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:20 +01:00
Heiko Carstens
b290ebe2c4 [CVE-2009-0029] System call wrappers part 04
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:19 +01:00
Heiko Carstens
ae1251ab78 [CVE-2009-0029] System call wrappers part 03
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:19 +01:00
Heiko Carstens
dbf040d9d1 [CVE-2009-0029] System call wrappers part 02
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:19 +01:00
Heiko Carstens
58fd3aa288 [CVE-2009-0029] System call wrappers part 01
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:18 +01:00
Heiko Carstens
f627a741d2 [CVE-2009-0029] Make sys_syslog a conditional system call
Remove the -ENOSYS implementation for !CONFIG_PRINTK and use
the cond_syscall infrastructure instead.

Acked-by: Kyle McMartin <kyle@redhat.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:16 +01:00
Heiko Carstens
2ed7c03ec1 [CVE-2009-0029] Convert all system calls to return a long
Convert all system calls to return a long. This should be a NOP since all
converted types should have the same size anyway.
With the exception of sys_exit_group which returned void. But that doesn't
matter since the system call doesn't return.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2009-01-14 14:15:14 +01:00
Ingo Molnar
14819ea1e0 irq: export __set_irq_handler() and handle_level_irq()
Impact: build fix

ARM updates broke x86 allmodconfig builds:

 ERROR: "__set_irq_handler" [drivers/mfd/pcf50633-core.ko] undefined!
 ERROR: "handle_level_irq" [drivers/mfd/pcf50633-core.ko] undefined!

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 12:34:21 +01:00
Uwe Kleine-Koenig
32632920a7 ftrace, trivial: fix typo "resgister" -> "register"
Signed-off-by: Uwe Kleine-Koenig <ukleinek@strlen.de>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 12:11:49 +01:00
Frederic Weisbecker
e1d8aa9f1d tracing: add a new workqueue tracer
Impact: new tracer

The workqueue tracer provides some statistical informations
about each cpu workqueue thread such as the number of the
works inserted and executed since their creation. It can help
to evaluate the amount of work each of them have to perform.
For example it can help a developer to decide whether he should
choose a per cpu workqueue instead of a singlethreaded one.

It only traces statistical informations for now but it will probably later
provide event tracing too.

Such a tracer could help too, and be improved, to help rt priority sorted
workqueue development.

To have a snapshot of the workqueues state at any time, just do

cat /debugfs/tracing/trace_stat/workqueues

Ie:

  1    125        125       reiserfs/1
  1      0          0       scsi_tgtd/1
  1      0          0       aio/1
  1      0          0       ata/1
  1    114        114       kblockd/1
  1      0          0       kintegrityd/1
  1   2147       2147       events/1

  0      0          0       kpsmoused
  0    105        105       reiserfs/0
  0      0          0       scsi_tgtd/0
  0      0          0       aio/0
  0      0          0       ata_aux
  0      0          0       ata/0
  0      0          0       cqueue
  0      0          0       kacpi_notify
  0      0          0       kacpid
  0    149        149       kblockd/0
  0      0          0       kintegrityd/0
  0   1000       1000       khelper
  0   2270       2270       events/0

Changes in V2:

_ Drop the static array based on NR_CPU and dynamically allocate the stat array
  with num_possible_cpus() and other cpu mask facilities....
_ Trace workqueue insertion at a bit lower level (insert_work instead of queue_work) to handle
  even the workqueue barriers.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 12:11:43 +01:00
Frederic Weisbecker
002bb86d8d tracing/ftrace: separate events tracing and stats tracing engine
Impact: tracing's Api change

Currently, the stat tracing depends on the events tracing.
When you switch to a new tracer, the stats files of the previous tracer
will disappear. But it's more scalable to separate those two engines.
This way, we can keep the stat files of one or several tracers when we
want, without bothering of multiple tracer stat files or tracer switching.

To build/destroys its stats files, a tracer just have to call
register_stat_tracer/unregister_stat_tracer everytimes it wants to.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 12:11:37 +01:00
Shaohua Li
f00012074b ftrace, ia64: Add macro for ftrace_caller
Define FTRACE_ADDR. In IA64, a function pointer isn't a 'unsigned long' but a
'struct {unsigned long ip, unsigned long gp}'.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 12:11:18 +01:00
Shaohua Li
25aac9dc7c ftrace, ia64: explictly ignore a file in recordmcount.pl
In IA64, a function pointer isn't a 'unsigned long' but a
'struct {unsigned long ip, unsigned long gp}'. MCOUNT_ADDR is determined
at link time not compile time, so explictly ignore kernel/trace/ftrace.o
in recordmcount.pl.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 12:11:13 +01:00
Mandeep Singh Baines
baf48f6577 softlock: fix false panic which can occur if softlockup_thresh is reduced
At run-time, if softlockup_thresh is changed to a much lower value,
touch_timestamp is likely to be much older than the new softlock_thresh.

This will cause a false softlockup to be detected. If softlockup_panic
is enabled, the system will panic.

The fix is to touch all watchdogs before changing softlockup_thresh.

Signed-off-by: Mandeep Singh Baines <msb@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 11:48:07 +01:00
Lai Jiangshan
e4fa4c9701 rcu: add __cpuinit to rcu_init_percpu_data()
Impact: reduce memory footprint

add __cpuinit to rcu_init_percpu_data(), and this function's text
will be discarded after boot when !CONFIG_HOTPLUG_CPU.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-14 11:26:40 +01:00
Linus Torvalds
28839855bf Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  smp_call_function_single(): be slightly less stupid, fix #2
  lockdep, mm: fix might_fault() annotation
2009-01-13 09:02:21 -08:00
Yinghai Lu
4a046d1754 x86: arch_probe_nr_irqs
Impact: save RAM with large NR_CPUS, get smaller nr_irqs

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Mike Travis <travis@sgi.com>
2009-01-12 17:39:24 -08:00
Arjan van de Ven
37a76bd4f1 async: fix __lowest_in_progress()
At 37000 feet somewhere near Greenland I woke up from a half-sleep with the
realisation that __lowest_in_progress() is buggy. After landing I checked
and there were indeed 2 problems with it; this patch fixes both:
* The order of the list checks was wrong
* The locking was not correct.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-12 16:39:58 -08:00
Linus Torvalds
12847095e9 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  kernel/sched.c: add missing forward declaration for 'double_rq_lock'
  sched: partly revert "sched debug: remove NULL checking in print_cfs_rt_rq()"
  cpumask: fix CONFIG_NUMA=y sched.c
2009-01-12 16:29:00 -08:00
Ingo Molnar
6e96281412 smp_call_function_single(): be slightly less stupid, fix #2
fix m68k build failure:

 tip/kernel/up.c: In function 'smp_call_function_single':
 tip/kernel/up.c:16: error: dereferencing pointer to incomplete type
 make[2]: *** [kernel/up.o] Error 1

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-12 16:04:37 +01:00
Ingo Molnar
01e3eb8227 Revert "sched: improve preempt debugging"
This reverts commit 7317d7b87e.

This has been reported (and bisected) by Alexey Zaytsev and
Kamalesh Babulal to produce annoying warnings during bootup
on both x86 and powerpc.

kernel_locked() is not a valid test in IRQ context (we update the
BKL's ->lock_depth and the preempt count separately and non-atomicalyy),
so we cannot put it into the generic preempt debugging checks which
can run in IRQ contexts too.

Reported-and-bisected-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Reported-and-bisected-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-12 13:00:50 +01:00
Ingo Molnar
e3ee1e1231 Merge commit 'v2.6.29-rc1' into timers/hrtimers
Conflicts:
	kernel/time/tick-common.c
2009-01-12 11:32:03 +01:00
Mike Travis
92296c6d6e cpumask, irq: non-x86 build failures
Ingo Molnar wrote:

> All non-x86 architectures fail to build:
>
> In file included from /home/mingo/tip/include/linux/random.h:11,
>                  from /home/mingo/tip/include/linux/stackprotector.h:6,
>                  from /home/mingo/tip/init/main.c:17:
> /home/mingo/tip/include/linux/irqnr.h:26:63: error: asm/irq_vectors.h: No such file or directory

Do not include asm/irq_vectors.h in generic code - it's not available
on all architectures.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-01-11 19:13:45 +01:00
Mike Travis
542d865bbe kstat: modify kstat_irqs_legacy to be variable sized
Impact: reduce memory usage.

Allocate kstat_irqs_legacy based on nr_cpu_ids to deal with this
memory usage bump when NR_CPUS bumped from 128 to 4096:

     8192   +253952    262144 +3100%  kstat_irqs_legacy(.bss)

This is only when CONFIG_SPARSE_IRQS=y.

Signed-off-by: Mike Travis <travis@sgi.com>
2009-01-11 19:13:42 +01:00
Mike Travis
9332fccded irq: initialize nr_irqs based on nr_cpu_ids
Impact: Reduce memory usage.

This is the second half of the changes to make the irq_desc_ptrs be
variable sized based on nr_cpu_ids.  This is done by adding a new
"max_nr_irqs" macro to irq_vectors.h (and a dummy in irqnr.h) to
return a max NR_IRQS value based on NR_CPUS or nr_cpu_ids.

This necessitated moving the define of MAX_IO_APICS to a separate
file (asm/apicnum.h) so it could be included without the baggage
of the other asm/apicdef.h declarations.

Signed-off-by: Mike Travis <travis@sgi.com>
2009-01-11 19:13:38 +01:00
Mike Travis
0fa0ebbf15 irq: allocate irq_desc_ptrs array based on nr_irqs
Impact: allocate irq_desc_ptrs in preparation for making it variable-sized.

This addresses this memory usage bump when NR_CPUS bumped from 128 to 4096:

    34816   +229376    264192  +658%  irq_desc_ptrs(.data.read_mostly)

The patch is split into two parts, the first simply allocates the
irq_desc_ptrs array.  Then next will deal with making it variable.
This is only when CONFIG_SPARSE_IRQS=y.

Signed-off-by: Mike Travis <travis@sgi.com>
2009-01-11 19:13:35 +01:00
Mike Travis
e2f4d06545 irq: use WARN() instead of WARN_ON().
Impact: cleanup WARN msg.

Ingo requested:
> While at it, could you please also convert this to a WARN() construct
> instead? (in a separate commit)

... and it shall be done.  ;-)

Signed-off-by: Mike Travis <travis@sgi.com>
2009-01-11 19:13:32 +01:00
Mike Travis
9594949b06 irq: change references from NR_IRQS to nr_irqs
Impact: preparation, cleanup, add KERN_INFO printk

Modify references from NR_IRQS to nr_irqs as the later will become
variable-sized based on nr_cpu_ids when CONFIG_SPARSE_IRQS=y.

Signed-off-by: Mike Travis <travis@sgi.com>
2009-01-11 19:13:29 +01:00
Mike Travis
d38b223c86 cpumask: reduce stack usage in find_lowest_rq
Impact: reduce stack usage, cleanup

Use a cpumask_var_t in find_lowest_rq() and clean up other old
cpumask_t calls.

Signed-off-by: Mike Travis <travis@sgi.com>
2009-01-11 19:13:22 +01:00
Mike Travis
802bf931f2 cpumask: fix bug in use cpumask_var_t in irq_desc
Impact: fix bug where new irq_desc uses old cpumask pointers which are freed.

As Yinghai pointed out, init_copy_one_irq_desc() copies the old desc to
the new desc overwriting the cpumask pointers.  Since the old_desc and
the cpumask pointers are freed, then memory corruption will occur if
these old pointers are used.

Move the allocation of these pointers to after the copy.

Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
2009-01-11 19:13:02 +01:00