linux/kernel/trace
Steven Rostedt b6345879cc tracing: Do not record user stack trace from NMI context
A bug was found with Li Zefan's ftrace_stress_test that caused applications
to segfault during the test.

Placing a tracing_off() in the segfault code, and examining several
traces, I found that the following was always the case. The lock tracer
was enabled (lockdep being required) and userstack was enabled. Testing
this out, I just enabled the two, but that was not good enough. I needed
to run something else that could trigger it. Running a load like hackbench
did not work, but executing a new program would. The following would
trigger the segfault within seconds:

  # echo 1 > /debug/tracing/options/userstacktrace
  # echo 1 > /debug/tracing/events/lock/enable
  # while :; do ls > /dev/null ; done

Enabling the function graph tracer and looking at what was happening
I finally noticed that all cashes happened just after an NMI.

 1)               |    copy_user_handle_tail() {
 1)               |      bad_area_nosemaphore() {
 1)               |        __bad_area_nosemaphore() {
 1)               |          no_context() {
 1)               |            fixup_exception() {
 1)   0.319 us    |              search_exception_tables();
 1)   0.873 us    |            }
[...]
 1)   0.314 us    |  __rcu_read_unlock();
 1)   0.325 us    |    native_apic_mem_write();
 1)   0.943 us    |  }
 1)   0.304 us    |  rcu_nmi_exit();
[...]
 1)   0.479 us    |  find_vma();
 1)               |  bad_area() {
 1)               |    __bad_area() {

After capturing several traces of failures, all of them happened
after an NMI. Curious about this, I added a trace_printk() to the NMI
handler to read the regs->ip to see where the NMI happened. In which I
found out it was here:

ffffffff8135b660 <page_fault>:
ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>

What was happening is that the NMI would happen at the place that a page
fault occurred. It would call rcu_read_lock() which was traced by
the lock events, and the user_stack_trace would run. This would trigger
a page fault inside the NMI. I do not see where the CR2 register is
saved or restored in NMI handling. This means that it would corrupt
the page fault handling that the NMI interrupted.

The reason the while loop of ls helped trigger the bug, was that
each execution of ls would cause lots of pages to be faulted in, and
increase the chances of the race happening.

The simple solution is to not allow user stack traces in NMI context.
After this patch, I ran the above "ls" test for a couple of hours
without any issues. Without this patch, the bug would trigger in less
than a minute.

Cc: stable@kernel.org
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-03-12 20:31:49 -05:00
..
blktrace.c blktrace: perform cleanup after setup error 2010-02-28 19:47:19 +01:00
ftrace.c function-graph: Init curr_ret_stack with ret_stack 2010-03-12 20:28:02 -05:00
Kconfig Merge branch 'tracing/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into tracing/core 2010-02-27 10:06:10 +01:00
kmemtrace.c kmemtrace: Fix up tracer registration 2009-10-01 11:53:44 +02:00
Makefile perf events: Remove CONFIG_EVENT_PROFILE 2009-12-28 10:33:06 +01:00
power-traces.c tracing/power: Remove two exports 2009-12-13 18:37:28 +01:00
ring_buffer_benchmark.c ring-buffer-benchmark: Add parameters to set produce/consumer priorities 2009-11-25 14:14:15 -05:00
ring_buffer.c ring-buffer: Move disabled check into preempt disable section 2010-03-12 20:26:56 -05:00
trace_boot.c tracing: add filter event logic to special, mmiotrace and boot tracers 2009-09-12 23:34:04 -04:00
trace_branch.c tracing: Add correct/incorrect to sort keys for branch annotation output 2010-02-09 21:35:05 -05:00
trace_clock.c tracing: Include irqflags headers from trace clock 2010-02-28 19:45:01 +01:00
trace_entries.h hw-breakpoints: Rewrite the hw-breakpoints layer on top of perf events 2009-11-08 15:34:42 +01:00
trace_event_profile.c perf: Factorize trace events raw sample buffer operations 2010-01-29 02:02:57 +01:00
trace_events_filter.c Merge branch 'perf/urgent' into perf/core 2010-01-29 10:36:22 +01:00
trace_events.c tracing: Simplify memory recycle of trace_define_field 2010-02-25 10:42:55 -05:00
trace_export.c tracing: Remove show_format and related macros from TRACE_EVENT 2010-01-06 12:08:46 -05:00
trace_functions_graph.c function-graph: Add tracing_thresh support to function_graph tracer 2010-03-05 21:20:57 -05:00
trace_functions.c tracing: switch function prints from %pf to %ps 2009-09-17 15:53:40 -04:00
trace_hw_branches.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
trace_irqsoff.c tracing: Add stack trace to irqsoff tracer 2009-12-11 13:19:51 -05:00
trace_kprobe.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-28 10:20:25 -08:00
trace_ksym.c ksym_tracer: Remove trace_stat 2009-12-30 07:50:50 +01:00
trace_mmiotrace.c tracing: add filter event logic to special, mmiotrace and boot tracers 2009-09-12 23:34:04 -04:00
trace_nop.c tracing/ftrace: make nop-tracer use polling wait for events on pipe 2009-03-23 09:22:15 +01:00
trace_output.c tracing: Add full state to trace_seq 2009-12-09 14:05:49 -05:00
trace_output.h tracing: consolidate code between trace_output.c and trace_function_graph.c 2009-09-11 14:24:13 -04:00
trace_printk.c tracing: Remove markers 2009-09-18 21:22:08 +02:00
trace_sched_switch.c tracing: pass around ring buffer instead of tracer 2009-09-04 18:59:39 -04:00
trace_sched_wakeup.c locking: Convert __raw_spin* functions to arch_spin* 2009-12-14 23:55:32 +01:00
trace_selftest_dynamic.c ftrace: fix dynamic ftrace selftest 2008-05-23 21:13:23 +02:00
trace_selftest.c locking: Convert __raw_spin* functions to arch_spin* 2009-12-14 23:55:32 +01:00
trace_stack.c tracing: Fix circular dead lock in stack trace 2010-02-02 10:20:18 -05:00
trace_stat.c trace_stat: Fix missing entry in stat file 2009-08-17 11:25:09 +02:00
trace_stat.h tracing/stat: Add stat_release() callback 2009-07-10 12:14:05 +02:00
trace_syscalls.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-02-28 10:20:25 -08:00
trace_sysprof.c perf events, x86/stacktrace: Make stack walking optional 2009-12-17 09:56:19 +01:00
trace_workqueue.c tracing/workqueues: Add refcnt to struct cpu_workqueue_stats 2009-07-10 12:14:07 +02:00
trace.c tracing: Do not record user stack trace from NMI context 2010-03-12 20:31:49 -05:00
trace.h function-graph: Add tracing_thresh support to function_graph tracer 2010-03-05 21:20:57 -05:00