Register/unregister tracepoint probes with struct tracepoint pointer
rather than tracepoint name.
This change, which vastly simplifies tracepoint.c, has been proposed by
Steven Rostedt. It also removes 8.8kB (mostly of text) to the vmlinux
size.
From this point on, the tracers need to pass a struct tracepoint pointer
to probe register/unregister. A probe can now only be connected to a
tracepoint that exists. Moreover, tracers are responsible for
unregistering the probe before the module containing its associated
tracepoint is unloaded.
text data bss dec hex filename
10443444 4282528 10391552 25117524 17f4354 vmlinux.orig
10434930 4282848 10391552 25109330 17f2352 vmlinux
Link: http://lkml.kernel.org/r/1396992381-23785-2-git-send-email-mathieu.desnoyers@efficios.com
CC: Ingo Molnar <mingo@kernel.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Frank Ch. Eigler <fche@redhat.com>
CC: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
[ SDR - fixed return val in void func in tracepoint_module_going() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Merge second patch-bomb from Andrew Morton:
- the rest of MM
- zram updates
- zswap updates
- exit
- procfs
- exec
- wait
- crash dump
- lib/idr
- rapidio
- adfs, affs, bfs, ufs
- cris
- Kconfig things
- initramfs
- small amount of IPC material
- percpu enhancements
- early ioremap support
- various other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (156 commits)
MAINTAINERS: update Intel C600 SAS driver maintainers
fs/ufs: remove unused ufs_super_block_third pointer
fs/ufs: remove unused ufs_super_block_second pointer
fs/ufs: remove unused ufs_super_block_first pointer
fs/ufs/super.c: add __init to init_inodecache()
doc/kernel-parameters.txt: add early_ioremap_debug
arm64: add early_ioremap support
arm64: initialize pgprot info earlier in boot
x86: use generic early_ioremap
mm: create generic early_ioremap() support
x86/mm: sparse warning fix for early_memremap
lglock: map to spinlock when !CONFIG_SMP
percpu: add preemption checks to __this_cpu ops
vmstat: use raw_cpu_ops to avoid false positives on preemption checks
slub: use raw_cpu_inc for incrementing statistics
net: replace __this_cpu_inc in route.c with raw_cpu_inc
modules: use raw_cpu_write for initialization of per cpu refcount.
mm: use raw_cpu ops for determining current NUMA node
percpu: add raw_cpu_ops
slub: fix leak of 'name' in sysfs_slab_add
...
To increase compiler portability there is <linux/compiler.h> which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes
with the right macro in the kernel subsystem.
Signed-off-by: Gideon Israel Dsouza <gidisrael@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The purpose of this single series of commits from Srivatsa S Bhat (with
a small piece from Gautham R Shenoy) touching multiple subsystems that use
CPU hotplug notifiers is to provide a way to register them that will not
lead to deadlocks with CPU online/offline operations as described in the
changelog of commit 93ae4f978c (CPU hotplug: Provide lockless versions
of callback registration functions).
The first three commits in the series introduce the API and document it
and the rest simply goes through the users of CPU hotplug notifiers and
converts them to using the new method.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTQow2AAoJEILEb/54YlRxW4QQAJlYRDUzwFJzJzYhltQYuVR+
4D74XMtvXgoJfg3cwdSWvMKKpJZnA9BVN0f7Hcx9wYmgdexYUuHeZJmMNyc3S2+g
KjKBIsugvgmZhHbbLd6TJ6GBbhGT5JLt9VmSfL9zIkveInU1YHFUUqL/mxdHm4J0
BSGKjk2rN3waRJgmY+xfliFLtQjDKFwJpMuvrgtoUyfas3f4sIV43UNbqdvA/weJ
rzedxXOlKH/id4b56lj/4iIzcoL3mwvJJ7r6n0CEMsKv87z09kqR0O+69Tsq/cgs
j17CsvoJOmZGk3QTeKVMQWBsvk6aPoDu3zK83gLbQMt+qjOpSTbJLz/3HZw4/TrW
ss4nuZne1DLMGS+6hoxYbTP+6Ni//Kn+l/LrHc5jb7m1X3lMO4W2aV3IROtIE1rv
lEP1IG01NU4u9YwkVj1dyhrkSp8tLPul4SrUK8W+oNweOC5crjJV7vJbIPJgmYiM
IZN55wln0yVRtR4TX+rmvN0PixsInE8MeaVCmReApyF9pdzul/StxlBze5BKLSJD
cqo1kNPpsmdxoDucqUpQ/gSvy+IOl2qnlisB5PpV93sk7De6TFDYrGHxjYIW7jMf
StXwdCDDQhzd2Q8Kfpp895A1dbIl8rKtwA6bTU2eX+BfMVFzuMdT44cvosx1+UdQ
sWl//rg76nb13dFjvF+q
=SW7Q
-----END PGP SIGNATURE-----
Merge tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull CPU hotplug notifiers registration fixes from Rafael Wysocki:
"The purpose of this single series of commits from Srivatsa S Bhat
(with a small piece from Gautham R Shenoy) touching multiple
subsystems that use CPU hotplug notifiers is to provide a way to
register them that will not lead to deadlocks with CPU online/offline
operations as described in the changelog of commit 93ae4f978c ("CPU
hotplug: Provide lockless versions of callback registration
functions").
The first three commits in the series introduce the API and document
it and the rest simply goes through the users of CPU hotplug notifiers
and converts them to using the new method"
* tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (52 commits)
net/iucv/iucv.c: Fix CPU hotplug callback registration
net/core/flow.c: Fix CPU hotplug callback registration
mm, zswap: Fix CPU hotplug callback registration
mm, vmstat: Fix CPU hotplug callback registration
profile: Fix CPU hotplug callback registration
trace, ring-buffer: Fix CPU hotplug callback registration
xen, balloon: Fix CPU hotplug callback registration
hwmon, via-cputemp: Fix CPU hotplug callback registration
hwmon, coretemp: Fix CPU hotplug callback registration
thermal, x86-pkg-temp: Fix CPU hotplug callback registration
octeon, watchdog: Fix CPU hotplug callback registration
oprofile, nmi-timer: Fix CPU hotplug callback registration
intel-idle: Fix CPU hotplug callback registration
clocksource, dummy-timer: Fix CPU hotplug callback registration
drivers/base/topology.c: Fix CPU hotplug callback registration
acpi-cpufreq: Fix CPU hotplug callback registration
zsmalloc: Fix CPU hotplug callback registration
scsi, fcoe: Fix CPU hotplug callback registration
scsi, bnx2fc: Fix CPU hotplug callback registration
scsi, bnx2i: Fix CPU hotplug callback registration
...
Pull ARM changes from Russell King:
- Perf updates from Will Deacon:
- Support for Qualcomm Krait processors (run perf on your phone!)
- Support for Cortex-A12 (run perf stat on your FPGA!)
- Support for perf_sample_event_took, allowing us to automatically decrease
the sample rate if we can't handle the PMU interrupts quickly enough
(run perf record on your FPGA!).
- Basic uprobes support from David Long:
This patch series adds basic uprobes support to ARM. It is based on
patches developed earlier by Rabin Vincent. That approach of adding
hooks into the kprobes instruction parsing code was not well received.
This approach separates the ARM instruction parsing code in kprobes out
into a separate set of functions which can be used by both kprobes and
uprobes. Both kprobes and uprobes then provide their own semantic action
tables to process the results of the parsing.
- ARMv7M (microcontroller) updates from Uwe Kleine-König
- OMAP DMA updates (recently added Vinod's Ack even though they've been
sitting in linux-next for a few months) to reduce the reliance of
omap-dma on the code in arch/arm.
- SA11x0 changes from Dmitry Eremin-Solenikov and Alexander Shiyan
- Support for Cortex-A12 CPU
- Align support for ARMv6 with ARMv7 so they can cooperate better in a
single zImage.
- Addition of first AT_HWCAP2 feature bits for ARMv8 crypto support.
- Removal of IRQ_DISABLED from various ARM files
- Improved efficiency of virt_to_page() for single zImage
- Patch from Ulf Hansson to permit runtime PM callbacks to be available for
AMBA devices for suspend/resume as well.
- Finally kill asm/system.h on ARM.
* 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (89 commits)
dmaengine: omap-dma: more consolidation of CCR register setup
dmaengine: omap-dma: move IRQ handling to omap-dma
dmaengine: omap-dma: move register read/writes into omap-dma.c
ARM: omap: dma: get rid of 'p' allocation and clean up
ARM: omap: move dma channel allocation into plat-omap code
ARM: omap: dma: get rid of errata global
ARM: omap: clean up DMA register accesses
ARM: omap: remove almost-const variables
ARM: omap: remove references to disable_irq_lch
dmaengine: omap-dma: cleanup errata 3.3 handling
dmaengine: omap-dma: provide register read/write functions
dmaengine: omap-dma: use cached CCR value when enabling DMA
dmaengine: omap-dma: move barrier to omap_dma_start_desc()
dmaengine: omap-dma: move clnk_ctrl setting to preparation functions
dmaengine: omap-dma: improve efficiency loading C.SA/C.EI/C.FI registers
dmaengine: omap-dma: consolidate clearing channel status register
dmaengine: omap-dma: move CCR buffering disable errata out of the fast path
dmaengine: omap-dma: provide register definitions
dmaengine: omap-dma: consolidate setup of CCR
dmaengine: omap-dma: consolidate setup of CSDP
...
But there were a few features that were added.
Uprobes now work with event triggers and multi buffers.
Uprobes have support under ftrace and perf.
The big feature is that the function tracer can now be used within the
multi buffer instances. That is, you can now trace some functions
in one buffer, others in another buffer, all functions in a third buffer
and so on. They are basically agnostic from each other. This only
works for the function tracer and not for the function graph trace,
although you can have the function graph tracer running in the top level
buffer (or any tracer for that matter) and have different function tracing
going on in the sub buffers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTOthtAAoJEKQekfcNnQGu5c8H/Ana/U+0tmksp1dbHkRHsKSH
+Fsv4Jeu8gf1NaFKHEhkUTcFtnzE6qAPV2VCrcJwXbhAhhwZm+LjrnWdoy3215S3
cQW4LftLEonh2cM36Cos74TulMEYN6XmL6dQZV+CILKQkDrWU4qJjQ64okXEkqrd
9iG3p/mSXyvJcmnyg61ALnMOhZDLsXY3djBhWBPhiTPGS6BRb9zh4Pmw6Zv0n2rJ
U93Gt/3AQrv1ybu73dUxqP0abp60oXOiWoF/R2jcbKqIM+K9RPJX79unCV3jq3u9
f+6jMlB9PgAMqQj6ihJdwxKDDuzwyrVdEPnsgvl4jarCBCtVVwhKedBaKN/KS8k=
=HdXY
-----END PGP SIGNATURE-----
Merge tag 'trace-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Most of the changes were largely clean ups, and some documentation.
But there were a few features that were added:
Uprobes now work with event triggers and multi buffers and have
support under ftrace and perf.
The big feature is that the function tracer can now be used within the
multi buffer instances. That is, you can now trace some functions in
one buffer, others in another buffer, all functions in a third buffer
and so on. They are basically agnostic from each other. This only
works for the function tracer and not for the function graph trace,
although you can have the function graph tracer running in the top
level buffer (or any tracer for that matter) and have different
function tracing going on in the sub buffers"
* tag 'trace-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (45 commits)
tracing: Add BUG_ON when stack end location is over written
tracepoint: Remove unused API functions
Revert "tracing: Move event storage for array from macro to standalone function"
ftrace: Constify ftrace_text_reserved
tracepoints: API doc update to tracepoint_probe_register() return value
tracepoints: API doc update to data argument
ftrace: Fix compilation warning about control_ops_free
ftrace/x86: BUG when ftrace recovery fails
ftrace: Warn on error when modifying ftrace function
ftrace: Remove freelist from struct dyn_ftrace
ftrace: Do not pass data to ftrace_dyn_arch_init
ftrace: Pass retval through return in ftrace_dyn_arch_init()
ftrace: Inline the code from ftrace_dyn_table_alloc()
ftrace: Cleanup of global variables ftrace_new_pgs and ftrace_update_cnt
tracing: Evaluate len expression only once in __dynamic_array macro
tracing: Correctly expand len expressions from __dynamic_array macro
tracing/module: Replace include of tracepoint.h with jump_label.h in module.h
tracing: Fix event header migrate.h to include tracepoint.h
tracing: Fix event header writeback.h to include tracepoint.h
tracing: Warn if a tracepoint is not set via debugfs
...
Pull core block layer updates from Jens Axboe:
"This is the pull request for the core block IO bits for the 3.15
kernel. It's a smaller round this time, it contains:
- Various little blk-mq fixes and additions from Christoph and
myself.
- Cleanup of the IPI usage from the block layer, and associated
helper code. From Frederic Weisbecker and Jan Kara.
- Duplicate code cleanup in bio-integrity from Gu Zheng. This will
give you a merge conflict, but that should be easy to resolve.
- blk-mq notify spinlock fix for RT from Mike Galbraith.
- A blktrace partial accounting bug fix from Roman Pen.
- Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"
* 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
blk-mq: add REQ_SYNC early
rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
blk-mq: support partial I/O completions
blk-mq: merge blk_mq_insert_request and blk_mq_run_request
blk-mq: remove blk_mq_alloc_rq
blk-mq: don't dump CPU -> hw queue map on driver load
blk-mq: fix wrong usage of hctx->state vs hctx->flags
blk-mq: allow blk_mq_init_commands() to return failure
block: remove old blk_iopoll_enabled variable
blktrace: fix accounting of partially completed requests
smp: Rename __smp_call_function_single() to smp_call_function_single_async()
smp: Remove wait argument from __smp_call_function_single()
watchdog: Simplify a little the IPI call
smp: Move __smp_call_function_single() below its safe version
smp: Consolidate the various smp_call_function_single() declensions
smp: Teach __smp_call_function_single() to check for offline cpus
smp: Remove unused list_head from csd
smp: Iterate functions through llist_for_each_entry_safe()
block: Stop abusing rq->csd.list in blk-softirq
block: Remove useless IPI struct initialization
...
Pull x86 LTO changes from Peter Anvin:
"More infrastructure work in preparation for link-time optimization
(LTO). Most of these changes is to make sure symbols accessed from
assembly code are properly marked as visible so the linker doesn't
remove them.
My understanding is that the changes to support LTO are still not
upstream in binutils, but are on the way there. This patchset should
conclude the x86-specific changes, and remaining patches to actually
enable LTO will be fed through the Kbuild tree (other than keeping up
with changes to the x86 code base, of course), although not
necessarily in this merge window"
* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
Kbuild, lto: Handle basic LTO in modpost
Kbuild, lto: Disable LTO for asm-offsets.c
Kbuild, lto: Add a gcc-ld script to let run gcc as ld
Kbuild, lto: add ld-version and ld-ifversion macros
Kbuild, lto: Drop .number postfixes in modpost
Kbuild, lto, workaround: Don't warn for initcall_reference in modpost
lto: Disable LTO for sys_ni
lto: Handle LTO common symbols in module loader
lto, workaround: Add workaround for initcall reordering
lto: Make asmlinkage __visible
x86, lto: Disable LTO for the x86 VDSO
initconst, x86: Fix initconst mistake in ts5500 code
initconst: Fix initconst mistake in dcdbas
asmlinkage: Make trace_hardirqs_on/off_caller visible
asmlinkage, x86: Fix 32bit memcpy for LTO
asmlinkage Make __stack_chk_failed and memcmp visible
asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
asmlinkage: Make main_extable_sort_needed visible
asmlinkage, mutex: Mark __visible
asmlinkage: Make trace_hardirq visible
...
Pull scheduler changes from Ingo Molnar:
"Bigger changes:
- sched/idle restructuring: they are WIP preparation for deeper
integration between the scheduler and idle state selection, by
Nicolas Pitre.
- add NUMA scheduling pseudo-interleaving, by Rik van Riel.
- optimize cgroup context switches, by Peter Zijlstra.
- RT scheduling enhancements, by Thomas Gleixner.
The rest is smaller changes, non-urgnt fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
sched: Clean up the task_hot() function
sched: Remove double calculation in fix_small_imbalance()
sched: Fix broken setscheduler()
sparc64, sched: Remove unused sparc64_multi_core
sched: Remove unused mc_capable() and smt_capable()
sched/numa: Move task_numa_free() to __put_task_struct()
sched/fair: Fix endless loop in idle_balance()
sched/core: Fix endless loop in pick_next_task()
sched/fair: Push down check for high priority class task into idle_balance()
sched/rt: Fix picking RT and DL tasks from empty queue
trace: Replace hardcoding of 19 with MAX_NICE
sched: Guarantee task priority in pick_next_task()
sched/idle: Remove stale old file
sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
cpuidle/arm64: Remove redundant cpuidle_idle_call()
cpuidle/powernv: Remove redundant cpuidle_idle_call()
sched, nohz: Exclude isolated cores from load balancing
sched: Fix select_task_rq_fair() description comments
workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
...
Pull perf changes from Ingo Molnar:
"Main changes:
Kernel side changes:
- Add SNB/IVB/HSW client uncore memory controller support (Stephane
Eranian)
- Fix various x86/P4 PMU driver bugs (Don Zickus)
Tooling, user visible changes:
- Add several futex 'perf bench' microbenchmarks (Davidlohr Bueso)
- Speed up thread map generation (Don Zickus)
- Introduce 'perf kvm --list-cmds' command line option for use by
scripts (Ramkumar Ramachandra)
- Print the evsel name in the annotate stdio output, prep to fix
support outputting annotation for multiple events, not just for the
first one (Arnaldo Carvalho de Melo)
- Allow setting preferred callchain method in .perfconfig (Jiri Olsa)
- Show in what binaries/modules 'perf probe's are set (Masami
Hiramatsu)
- Support distro-style debuginfo for uprobe in 'perf probe' (Masami
Hiramatsu)
Tooling, internal changes and fixes:
- Use tid in mmap/mmap2 events to find maps (Don Zickus)
- Record the reason for filtering an address_location (Namhyung Kim)
- Apply all filters to an addr_location (Namhyung Kim)
- Merge al->filtered with hist_entry->filtered in report/hists
(Namhyung Kim)
- Fix memory leak when synthesizing thread records (Namhyung Kim)
- Use ui__has_annotation() in 'report' (Namhyung Kim)
- hists browser refactorings to reuse code accross UIs (Namhyung Kim)
- Add support for the new DWARF unwinder library in elfutils (Jiri
Olsa)
- Fix build race in the generation of bison files (Jiri Olsa)
- Further streamline the feature detection display, trimming it a bit
to show just the libraries detected, using VF=1 gets a more verbose
output, showing the less interesting feature checks as well (Jiri
Olsa).
- Check compatible symtab type before loading dso (Namhyung Kim)
- Check return value of filename__read_debuglink() (Stephane Eranian)
- Move some hashing and fs related code from tools/perf/util/ to
tools/lib/ so that it can be used by more tools/ living utilities
(Borislav Petkov)
- Prepare DWARF unwinding code for using an elfutils alternative
unwinding library (Jiri Olsa)
- Fix DWARF unwind max_stack processing (Jiri Olsa)
- Add dwarf unwind 'perf test' entry (Jiri Olsa)
- 'perf probe' improvements including memory leak fixes, sharing the
intlist class with other tools, uprobes/kprobes code sharing and
use of ref_reloc_sym (Masami Hiramatsu)
- Shorten sample symbol resolving by adding cpumode to struct
addr_location (Arnaldo Carvalho de Melo)
- Fix synthesizing mmaps for threads (Don Zickus)
- Fix invalid output on event group stdio report (Namhyung Kim)
- Fixup header alignment in 'perf sched latency' output (Ramkumar
Ramachandra)
- Fix off-by-one error in 'perf timechart record' argv handling
(Ramkumar Ramachandra)
Tooling, cleanups:
- Remove unused thread__find_map function (Jiri Olsa)
- Remove unused simple_strtoul() function (Ramkumar Ramachandra)
Tooling, documentation updates:
- Update function names in debug messages (Ramkumar Ramachandra)
- Update some code references in design.txt (Ramkumar Ramachandra)
- Clarify load-latency information in the 'perf mem' docs (Andi
Kleen)
- Clarify x86 register naming in 'perf probe' docs (Andi Kleen)"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (96 commits)
perf tools: Remove unused simple_strtoul() function
perf tools: Update some code references in design.txt
perf evsel: Update function names in debug messages
perf tools: Remove thread__find_map function
perf annotate: Print the evsel name in the stdio output
perf report: Use ui__has_annotation()
perf tools: Fix memory leak when synthesizing thread records
perf tools: Use tid in mmap/mmap2 events to find maps
perf report: Merge al->filtered with hist_entry->filtered
perf symbols: Apply all filters to an addr_location
perf symbols: Record the reason for filtering an address_location
perf sched: Fixup header alignment in 'latency' output
perf timechart: Fix off-by-one error in 'record' argv handling
perf machine: Factor machine__find_thread to take tid argument
perf tools: Speed up thread map generation
perf kvm: introduce --list-cmds for use by scripts
perf ui hists: Pass evsel to hpp->header/width functions explicitly
perf symbols: Introduce thread__find_cpumode_addr_location
perf session: Change header.misc dump from decimal to hex
perf ui/tui: Reuse generic __hpp__fmt() code
...
While working on my tutorial for 2014 Linux Collaboration Summit
I found that the traceon trigger did not work when conditions were
used. The other triggers worked fine though. Looking into it, it
is because of the way the triggers use the ring buffer to store
the fields it will use for the condition. But if tracing is off, nothing
is stored in the buffer, and the tracepoint exits before calling the
trigger to test the condition. This is fine for all the triggers that
only work when tracing is on, but for traceon trigger that is to
work when tracing is off, nothing happens.
The fix is simple, just use a temp ring buffer to record the event
if tracing is off and the event has a trace event conditional trigger
enabled. The rest of the tracepoint code will work just fine, but
the tracepoint wont be recorded in the other buffers.
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It is difficult to detect a stack overrun when it
actually occurs.
We have observed that this type of corruption is often
silent and can go unnoticed. Once the corrupted region
is examined, the outcome is undefined and often
results in sporadic system crashes.
When the stack tracing feature is enabled, let's check
for this condition and take appropriate action.
Note: init_task doesn't get its stack end location
set to STACK_END_MAGIC.
Link: http://lkml.kernel.org/r/1395669837-30209-1-git-send-email-atomlin@redhat.com
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I originally wrote commit 35bb4399bd to shrink the size of the overhead of
tracepoints by several kilobytes. Later, I received a patch from Vaibhav
Nagarnaik that fixed a bug in the same code that this commit touches. Not
only did it fix a bug, it also removed code and shrunk the size of the
overhead of trace events even more than this commit did.
Since this commit is scheduled for 3.15 and Vaibhav's patch is already in
mainline, I need to revert this patch in order to keep it from conflicting
with Vaibhav's patch. Not to mention, Vaibhav's patch makes this patch
obsolete.
Link: http://lkml.kernel.org/r/20140320225637.0226041b@gandalf.local.home
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In event format strings, the array size is reported in two locations.
One in array subscript and then via the "size:" attribute. The values
reported there have a mismatch.
For e.g., in sched:sched_switch the prev_comm and next_comm character
arrays have subscript values as [32] where as the actual field size is
16.
name: sched_switch
ID: 301
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char prev_comm[32]; offset:8; size:16; signed:1;
field:pid_t prev_pid; offset:24; size:4; signed:1;
field:int prev_prio; offset:28; size:4; signed:1;
field:long prev_state; offset:32; size:8; signed:1;
field:char next_comm[32]; offset:40; size:16; signed:1;
field:pid_t next_pid; offset:56; size:4; signed:1;
field:int next_prio; offset:60; size:4; signed:1;
After bisection, the following commit was blamed:
92edca0 tracing: Use direct field, type and system names
This commit removes the duplication of strings for field->name and
field->type assuming that all the strings passed in
__trace_define_field() are immutable. This is not true for arrays, where
the type string is created in event_storage variable and field->type for
all array fields points to event_storage.
Use __stringify() to create a string constant for the type string.
Also, get rid of event_storage and event_storage_mutex that are not
needed anymore.
also, an added benefit is that this reduces the overhead of events a bit more:
text data bss dec hex filename
8424787 2036472 1302528 11763787 b3804b vmlinux
8420814 2036408 1302528 11759750 b37086 vmlinux.patched
Link: http://lkml.kernel.org/r/1392349908-29685-1-git-send-email-vnagarnaik@google.com
Cc: Laurent Chavey <chavey@google.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Instead, the correct and race-free way of performing the callback
registration is:
cpu_notifier_register_begin();
for_each_online_cpu(cpu)
init_cpu(cpu);
/* Note the use of the double underscored version of the API */
__register_cpu_notifier(&foobar_cpu_notifier);
cpu_notifier_register_done();
Fix the tracing ring-buffer code by using this latter form of callback
registration.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Suggested change from Oleg Nesterov. Fixes incomplete dependencies
for uprobes feature.
Signed-off-by: David A. Long <dave.long@linaro.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
With CONFIG_DYNAMIC_FTRACE=n, I see a warning:
kernel/trace/ftrace.c:240:13: warning: 'control_ops_free' defined but not used
static void control_ops_free(struct ftrace_ops *ops)
^
Move that function around to an already existing #ifdef
CONFIG_DYNAMIC_FTRACE block as the function is used solely from the
dynamic function tracing functions.
Link: http://lkml.kernel.org/r/1394484131-5107-1-git-send-email-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Recent issues with user space callchains processing within
page fault handler tracing showed as Peter said 'there's
just too much fail surface'.
The user space stack dump is just another source of the this issue.
Related list discussions:
http://marc.info/?t=139302086500001&r=1&w=2http://marc.info/?t=139301437300003&r=1&w=2
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393775800-13524-3-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recent issues with user space callchains processing within
page fault handler tracing showed as Peter said 'there's
just too much fail surface'.
Related list discussions:
http://marc.info/?t=139302086500001&r=1&w=2http://marc.info/?t=139301437300003&r=1&w=2
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393775800-13524-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We should print some warning and kill ftrace functionality when the ftrace
function is not set correctly. Otherwise, ftrace might do crazy things without
an explanation. The error value has been ignored so far.
Note that an error that happens during updating all the traced calls is handled
in ftrace_replace_code(). We print more details about the particular
failing address via ftrace_bug() there.
Link: http://lkml.kernel.org/r/1393258342-29978-3-git-send-email-pmladek@suse.cz
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the data parameter is not really used by any ftrace_dyn_arch_init,
remove that from ftrace_dyn_arch_init. This also removes the addr
local variable from ftrace_init which is now unused.
Note the documentation was imprecise as it did not suggest to set
(*data) to 0.
Link: http://lkml.kernel.org/r/1393268401-24379-4-git-send-email-jslaby@suse.cz
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
No architecture uses the "data" parameter in ftrace_dyn_arch_init() in any
way, it just sets the value to 0. And this is used as a return value
in the caller -- ftrace_init, which just checks the retval against
zero.
Note there is also "return 0" in every ftrace_dyn_arch_init. So it is
enough to check the retval and remove all the indirect sets of data on
all archs.
Link: http://lkml.kernel.org/r/1393268401-24379-3-git-send-email-jslaby@suse.cz
Cc: linux-arch@vger.kernel.org
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function used to do allocations some time ago. This no longer
happens and it only checks the count and prints some info. This patch
inlines the body to the only caller. There are two reasons:
* the name of the function was misleading
* it's clear what is going on in ftrace_init now
Link: http://lkml.kernel.org/r/1393268401-24379-2-git-send-email-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some of them can be local to functions, so make them local and pass
them as parameters where needed:
* __start_mcount_loc+__stop_mcount_loc are local to ftrace_init
* ftrace_new_pgs -> new_pgs/start_pg
* ftrace_update_cnt -> local update_cnt in ftrace_update_code
Link: http://lkml.kernel.org/r/1393268401-24379-1-git-send-email-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions that assign the contents for the ftrace events are
defined by the TRACE_EVENT() macros. Each event has its own unique
way to assign data to its buffer. When you have over 500 events,
that means there's 500 functions assigning data uniquely for each
event (not really that many, as DECLARE_EVENT_CLASS() and multiple
DEFINE_EVENT()s will only need a single function).
By making helper functions in the core kernel to do some of the work
instead, we can shrink the size of the kernel down a bit.
With a kernel configured with 502 events, the change in size was:
text data bss dec hex filename
12987390 1913504 9785344 24686238 178ae9e /tmp/vmlinux
12959102 1913504 9785344 24657950 178401e /tmp/vmlinux.patched
That's a total of 28288 bytes, which comes down to 56 bytes per event.
Link: http://lkml.kernel.org/r/20120810034708.370808175@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code that shows array fields for events is defined for all events.
This can add up quite a bit when you have over 500 events.
By making helper functions in the core kernel to do the work
instead, we can shrink the size of the kernel down a bit.
With a kernel configured with 502 events, the change in size was:
text data bss dec hex filename
12990946 1913568 9785344 24689858 178bcc2 /tmp/vmlinux
12987390 1913504 9785344 24686238 178ae9e /tmp/vmlinux.patched
That's a total of 3556 bytes, which comes down to 7 bytes per event.
Although it's not much, this code is just called at initialization of
the events.
Link: http://lkml.kernel.org/r/20120810034708.084036335@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code for trace events to format the raw recorded event data
into human readable format in the 'trace' file is repeated for every
event in the system. When you have over 500 events, this can add up
quite a bit.
By making helper functions in the core kernel to do the work
instead, we can shrink the size of the kernel down a bit.
With a kernel configured with 502 events, the change in size was:
text data bss dec hex filename
12991007 1913568 9785344 24689919 178bcff /tmp/vmlinux.orig
12990946 1913568 9785344 24689858 178bcc2 /tmp/vmlinux.patched
Note, this version does not save as much as the version of this patch
I had a few years ago. That is because in the mean time, commit
f71130de5c ("tracing: Add a helper function for event print functions")
did a lot of the work my original patch did. But this change helps
slightly, and is part of a larger clean up to reduce the size much further.
Link: http://lkml.kernel.org/r/20120810034707.378538034@goodmis.org
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:
C R 232 + 240 [0]
C R 240 + 232 [0]
C R 248 + 224 [0]
C R 256 + 216 [0]
but should be:
C R 232 + 8 [0]
C R 240 + 8 [0]
C R 248 + 8 [0]
C R 256 + 8 [0]
Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.
This patch takes into account real completion size of the request and
fixes wrong completion accounting.
Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: linux-kernel@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
If a module fails to add its tracepoints due to module tainting, do not
create the module event infrastructure in the debugfs directory. As the events
will not work and worse yet, they will silently fail, making the user wonder
why the events they enable do not display anything.
Having a warning on module load and the events not visible to the users
will make the cause of the problem much clearer.
Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org
Fixes: 6d723736e4 "tracing/events: add support for modules to TRACE_EVENT"
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org # 2.6.31+
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use MAX_NICE instead of the value 19 for ring_buffer_benchmark.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1393251121-25534-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ENABLED flag needs to be cleared when a ftrace_ops is unregistered
otherwise it wont be able to be registered again.
This is only for static tracing and does not affect DYNAMIC_FTRACE at
all.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Being able to change the trace clock at boot can be advantageous if
you need a better source of when things happen across CPUs. The default
trace clock is the fastest, but it uses local clocks which may not be
synced across CPUs and it does not let you know when events took place
with respect to events on other CPUs.
The global trace clock can help in this case, and if you do not care
about timings, the counter "clock" is the best, as that is just a simple
atomic counter that is incremented for every event.
Usage is to add "trace_clock=counter" on the kernel command line. You
can replace counter with "global" or any of the clocks listed in
/sys/kernel/debug/tracing/trace_clock
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Appreciated-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It seems there's no reason to prevent mixed used of ftrace and perf
for a single uprobe event. At least the kprobes already support it.
Link: http://lkml.kernel.org/r/1389946120-19610-6-git-send-email-namhyung@kernel.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add support for event triggering to uprobes. This is same as kprobes
support added by Tom (plus cleanup by Steven).
Link: http://lkml.kernel.org/r/1389946120-19610-5-git-send-email-namhyung@kernel.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Support multi-buffer on uprobe-based dynamic events by
using ftrace_event_file.
This patch is based kprobe-based dynamic events multibuffer
support work initially, commited by Masami(commit 41a7dd420c),
but revised as below:
Oleg changed the kprobe-based multibuffer design from
array-pointers of ftrace_event_file into simple list,
so this patch also change to the list design.
rcu_read_lock/unlock added into uprobe_trace_func/uretprobe_trace_func,
to synchronize with ftrace_event_file list add and delete.
Even though we allow multi-uprobes instances now,
but TP_FLAG_PROFILE/TP_FLAG_TRACE are still mutually exclusive
in probe_event_enable currently, this means we cannot allow
one user is using uprobe-tracer, and another user is using
perf-probe on same uprobe concurrently.
(Perhaps this will be fix in future, kprobe don't have this
limitation now)
Link: http://lkml.kernel.org/r/1389946120-19610-4-git-send-email-namhyung@kernel.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A single uprobe event might serve different users like ftrace and
perf. And this is especially important for upcoming multi buffer
support. But in this case it'll fetch (same) data from userspace
multiple times. So move it to the beginning of the dispatcher
function and reuse it for each users.
Link: http://lkml.kernel.org/r/1389946120-19610-3-git-send-email-namhyung@kernel.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The uprobe_{trace,perf}_print functions are misnomers since what they
do is not printing. There's also a real print function named
print_uprobe_event() so they'll only increase confusion IMHO.
Rename them with double underscores to follow convention of kprobe.
Link: http://lkml.kernel.org/r/1389946120-19610-2-git-send-email-namhyung@kernel.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Create a "set_ftrace_filter" and "set_ftrace_notrace" files in the instance
directories to let users filter of functions to trace for the given instance.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for having the function tracing instances be able to
filter on functions, the generic filter functions must first be
converted to take in the global_ops as a parameter.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow instances (sub-buffers) to enable function tracing.
Each instance will have its own function tracing capability.
For now, instances will not have function stack tracing, or will
they be able to pick and choose what functions they can trace.
Picking and choosing their own functions will come later.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As tracers will soon be used by instances, the tracer enabled field
needs to be converted to a counter instead of a boolean.
This counter is protected by the trace_types_lock mutex.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When an instance is about to be deleted, make sure the tracer
is set to nop. If it isn't reset the tracer and set it to the nop
tracer, otherwise memory leaks and bad pointers may result.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If global_ops function is being called directly, instead of the global_ops
list function, set the global_ops private to be the same as the ops private
that's being called directly.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the tracers (function, function_graph, irqsoff, etc) can only
be used by the top level tracing directory (not for instances).
This sets up the infrastructure to allow instances to be able to
run a separate tracer apart from the what the top level tracing is
doing.
As tracers need to adapt for being used by instances, the tracers
must flag if they can be used by instances or not. Currently only the
'nop' tracer can be used by all instances.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As options (flags) may affect instances instead of being global
the flag_changed() callbacks need to receive the trace_array descriptor
of the instance they will be modifying.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As options (flags) may affect instances instead of being global
the set_flag() callbacks need to receive the trace_array descriptor
of the instance they will be modifying.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Each sub-buffer (buffer page) has a full 64 bit timestamp. The events on
that page use a 27 bit delta against that timestamp in order to save on
bits written to the ring buffer. If the time between events is larger than
what the 27 bits can hold, a "time extend" event is added to hold the
entire 64 bit timestamp again and the events after that hold a delta from
that timestamp.
As a "time extend" is always paired with an event, it is logical to just
allocate the event with the time extend, to make things a bit more efficient.
Unfortunately, when the pairing code was written, it removed the "delta = 0"
from the first commit on a page, causing the events on the page to be
slightly skewed.
Fixes: 69d1b839f7 "ring-buffer: Bind time extend and data events together"
Cc: stable@vger.kernel.org # 2.6.37+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
the new features added to 3.14.
The third patch is a minor bugfix to the trace_puts() functions that
will crash the system if a developer adds one before the tracing system
is setup. It also affects trace_printk() if it has no arguments, as
the code will convert it to a trace_puts() as well. Note, this bug
will not affect unmodified kernels, as trace_printk() and trace_puts()
should only be used by developers for testing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJS5oAkAAoJEKQekfcNnQGuOfkH/1ScwoRslERCCjhPEZu958bG
VugNvliB/uDwq3qkmBcncMHwqkFBgJay9ieah+JFeK6x3G6mO/uf+UhN5cmLzVBh
vA5dKR2zW6WTxKYSvXYapD7OzC7R+V6CLvpMl5WMZ1t3ESQhR6gUrpPdigecBWs9
015rMVA2xXjNnHNDM1nem1PhnMba78A/N98lFErYGpvVBjkCpB8mSt8adds9bYp8
a83P6tGUfXvsYO3sOqqBnOOqcfzCjjJbmr94v/F+SmLyuxuV0tVzBkIwXkKTNhEK
bQZj9Pfe+1oIkldXxstn5jWaOvI6RlAGW4b4qXt30AgrGRSyEo5xpvfLfyNUif4=
=N3kq
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"The first two patches fix the debugfs README file to reflect better
the new features added to 3.14.
The third patch is a minor bugfix to the trace_puts() functions that
will crash the system if a developer adds one before the tracing
system is setup. It also affects trace_printk() if it has no
arguments, as the code will convert it to a trace_puts() as well.
Note, this bug will not affect unmodified kernels, as trace_printk()
and trace_puts() should only be used by developers for testing"
* tag 'trace-fixes-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Check if tracing is enabled in trace_puts()
tracing: Fix formatting of trace README file
tracing/README: Add event file usage to tracing mini-HOWTO
If trace_puts() is used very early in boot up, it can crash the machine
if it is called before the ring buffer is allocated. If a trace_printk()
is used with no arguments, then it will be converted into a trace_puts()
and suffer the same fate.
Cc: stable@vger.kernel.org # 3.10+
Fixes: 09ae72348e "tracing: Add trace_puts() for even faster trace_printk() tracing"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix the formatting of the README file in the trace debugfs to fit in
an 80 character window.
Also add a comment about the event trigger counter with regards to
traceon and traceoff.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It would be useful to have a cheat-sheet for everything under
tracing/events/ alongside the existing text describing the other files
in the tracing/ dir.
Add short descriptions of the directories and files under events/
along with examples, similar to the existing text for the other files
in tracing/.
Also clean up a few minor alignment problems noticed when adding the
new text.
Link: http://lkml.kernel.org/r/1389993104.3040.445.camel@empanada
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
triggers by Tom Zanussi. A trigger is a way to enable an action when an
event is hit. The actions are:
o trace on/off - enable or disable tracing
o snapshot - save the current trace buffer in the snapshot
o stacktrace - dump the current stack trace to the ringbuffer
o enable/disable events - enable or disable another event
Namhyung Kim added updates to the tracing uprobes code. Having the
uprobes add support for fetch methods.
The rest are various bug fixes with the new code, and minor ones for
the old code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJS3Z9fAAoJEKQekfcNnQGuFf0H/0CteaN+BJjpif6Tnxia15Sp
pcftzU0lgqfNzsfitmbjiVTgXWqCghoZo8UI9tQZvBZ9wmDIxeXQR73uoBgVlSCQ
ovyBO/R8r+lq+7EsDCwntZvrLbcdn6s/jzoruRvt7r35ghK5pH81DNR1BOzTQBhW
x+361Xtc13aok7N7JN8KR96VDUP9f8KU6PWqJ5lgS2Zl+wbVw6b0p8OV8IMCHczP
MdYrx8y4Jv4QWW7rMShAAVBe9qJQ56JWiWA17ysa4kY8BkKQ7QtlEFr+r1YY0nX5
67brXiL8u0NFzRx5y2VRpGc25BbImnVBFpoLQ5Itluq9OdZE3aOQubzXlY70R6g=
=Hkho
-----END PGP SIGNATURE-----
Merge tag 'trace-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This pull request has a new feature to ftrace, namely the trace event
triggers by Tom Zanussi. A trigger is a way to enable an action when
an event is hit. The actions are:
o trace on/off - enable or disable tracing
o snapshot - save the current trace buffer in the snapshot
o stacktrace - dump the current stack trace to the ringbuffer
o enable/disable events - enable or disable another event
Namhyung Kim added updates to the tracing uprobes code. Having the
uprobes add support for fetch methods.
The rest are various bug fixes with the new code, and minor ones for
the old code"
* tag 'trace-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (38 commits)
tracing: Fix buggered tee(2) on tracing_pipe
tracing: Have trace buffer point back to trace_array
ftrace: Fix synchronization location disabling and freeing ftrace_ops
ftrace: Have function graph only trace based on global_ops filters
ftrace: Synchronize setting function_trace_op with ftrace_trace_function
tracing: Show available event triggers when no trigger is set
tracing: Consolidate event trigger code
tracing: Fix counter for traceon/off event triggers
tracing: Remove double-underscore naming in syscall trigger invocations
tracing/kprobes: Add trace event trigger invocations
tracing/probes: Fix build break on !CONFIG_KPROBE_EVENT
tracing/uprobes: Add @+file_offset fetch method
uprobes: Allocate ->utask before handler_chain() for tracing handlers
tracing/uprobes: Add support for full argument access methods
tracing/uprobes: Fetch args before reserving a ring buffer
tracing/uprobes: Pass 'is_return' to traceprobe_parse_probe_arg()
tracing/probes: Implement 'memory' fetch method for uprobes
tracing/probes: Add fetch{,_size} member into deref fetch method
tracing/probes: Move 'symbol' fetch method to kprobes
tracing/probes: Implement 'stack' fetch method for uprobes
...
In kernel/trace/trace.c we have this:
static void tracing_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
__free_page(buf->page);
}
static const struct pipe_buf_operations tracing_pipe_buf_ops = {
.can_merge = 0,
.map = generic_pipe_buf_map,
.unmap = generic_pipe_buf_unmap,
.confirm = generic_pipe_buf_confirm,
.release = tracing_pipe_buf_release,
.steal = generic_pipe_buf_steal,
.get = generic_pipe_buf_get,
};
with
void generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
{
page_cache_get(buf->page);
}
and I don't see anything that would've prevented tee(2) called on the pipe
that got stuff spliced into it from that sucker. ->ops->get() will be
called, then buf gets copied into target pipe's ->bufs[] and eventually
readers get to both copies of the buffer. With
get_page(page)
look at that page
__free_page(page)
look at that page
__free_page(page)
which is not a good thing, to put it mildly. AFAICS, that ought to use
the normal generic_pipe_buf_release() (aka page_cache_release(buf->page)),
shouldn't it?
[
SDR - As trace_pipe just allocates the page with alloc_page(GFP_KERNEL),
and doesn't do anything special with it (no LRU logic). The __free_page()
should be fine, as it wont actually free a page with reference count.
Maybe there's a chance to leak memory? Anyway, This change is at a minimum
good for being symmetric with generic_pipe_buf_get, it is fine to add.
]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ SDR - Removed no longer used tracing_pipe_buf_release ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace buffer has a descriptor pointer that goes back to the trace
array. But it was never assigned. Luckily, nothing uses it (yet), but
it will in the future.
Although nothing currently uses this, if any of the new features get
backported to older kernels, and because this is such a simple change,
I'm marking it for stable too.
Cc: stable@vger.kernel.org # v3.10+
Fixes: 12883efb67 "tracing: Consolidate max_tr into main trace_array structure"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The synchronization needed after ftrace_ops are unregistered must happen
after the callback is disabled from becing called by functions.
The current location happens after the function is being removed from the
internal lists, but not after the function callbacks were disabled, leaving
the functions susceptible of being called after their callbacks are freed.
This affects perf and any externel users of function tracing (LTTng and
SystemTap).
Cc: stable@vger.kernel.org # 3.0+
Fixes: cdbe61bfe7 "ftrace: Allow dynamically allocated function tracers"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Doing some different tests, I discovered that function graph tracing, when
filtered via the set_ftrace_filter and set_ftrace_notrace files, does
not always keep with them if another function ftrace_ops is registered
to trace functions.
The reason is that function graph just happens to trace all functions
that the function tracer enables. When there was only one user of
function tracing, the function graph tracer did not need to worry about
being called by functions that it did not want to trace. But now that there
are other users, this becomes a problem.
For example, one just needs to do the following:
# cd /sys/kernel/debug/tracing
# echo schedule > set_ftrace_filter
# echo function_graph > current_tracer
# cat trace
[..]
0) | schedule() {
------------------------------------------
0) <idle>-0 => rcu_pre-7
------------------------------------------
0) ! 2980.314 us | }
0) | schedule() {
------------------------------------------
0) rcu_pre-7 => <idle>-0
------------------------------------------
0) + 20.701 us | }
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
# cat trace
[..]
1) + 20.825 us | }
1) + 21.651 us | }
1) + 30.924 us | } /* SyS_ioctl */
1) | do_page_fault() {
1) | __do_page_fault() {
1) 0.274 us | down_read_trylock();
1) 0.098 us | find_vma();
1) | handle_mm_fault() {
1) | _raw_spin_lock() {
1) 0.102 us | preempt_count_add();
1) 0.097 us | do_raw_spin_lock();
1) 2.173 us | }
1) | do_wp_page() {
1) 0.079 us | vm_normal_page();
1) 0.086 us | reuse_swap_page();
1) 0.076 us | page_move_anon_rmap();
1) | unlock_page() {
1) 0.082 us | page_waitqueue();
1) 0.086 us | __wake_up_bit();
1) 1.801 us | }
1) 0.075 us | ptep_set_access_flags();
1) | _raw_spin_unlock() {
1) 0.098 us | do_raw_spin_unlock();
1) 0.105 us | preempt_count_sub();
1) 1.884 us | }
1) 9.149 us | }
1) + 13.083 us | }
1) 0.146 us | up_read();
When the stack tracer was enabled, it enabled all functions to be traced, which
now the function graph tracer also traces. This is a side effect that should
not occur.
To fix this a test is added when the function tracing is changed, as well as when
the graph tracer is enabled, to see if anything other than the ftrace global_ops
function tracer is enabled. If so, then the graph tracer calls a test trampoline
that will look at the function that is being traced and compare it with the
filters defined by the global_ops.
As an optimization, if there's no other function tracers registered, or if
the only registered function tracers also use the global ops, the function
graph infrastructure will call the registered function graph callback directly
and not go through the test trampoline.
Cc: stable@vger.kernel.org # 3.3+
Fixes: d2d45c7a03 "tracing: Have stack_tracer use a separate list of functions"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.
For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.
As a consequence of applying this patch there will be three wakeup
latency tracer:
* "wakeup", that deals with all tasks in the system;
* "wakeup_rt", that deals with -rt and -deadline tasks only;
* "wakeup_dl", that deals with -deadline tasks only.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-9-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ftrace_trace_function is a variable that holds what function will be called
directly by the assembly code (mcount). If just a single function is
registered and it handles recursion itself, then the assembly will call that
function directly without any helper function. It also passes in the
ftrace_op that was registered with the callback. The ftrace_op to send is
stored in the function_trace_op variable.
The ftrace_trace_function and function_trace_op needs to be coordinated such
that the called callback wont be called with the wrong ftrace_op, otherwise
bad things can happen if it expected a different op. Luckily, there's no
callback that doesn't use the helper functions that requires this. But
there soon will be and this needs to be fixed.
Use a set_function_trace_op to store the ftrace_op to set the
function_trace_op to when it is safe to do so (during the update function
within the breakpoint or stop machine calls). Or if dynamic ftrace is not
being used (static tracing) then we have to do a bit more synchronization
when the ftrace_trace_function is set as that takes affect immediately
(as oppose to dynamic ftrace doing it with the modification of the trampoline).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently there's no way to know what triggers exist on a kernel without
looking at the source of the kernel or randomly trying out triggers.
Instead of creating another file in the debugfs system, simply show
what available triggers are there when cat'ing the trigger file when
it has no events:
[root /sys/kernel/debug/tracing]# cat events/sched/sched_switch/trigger
# Available triggers:
# traceon traceoff snapshot stacktrace enable_event disable_event
This stays consistent with other debugfs files where meta data like
this is always proceeded with a '#' at the start of the line so that
tools can strip these out.
Link: http://lkml.kernel.org/r/20140107103548.0a84536d@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The event trigger code that checks for callback triggers before and
after recording of an event has lots of flags checks. This code is
duplicated throughout the ftrace events, kprobes and system calls.
They all do the exact same checks against the event flags.
Added helper functions ftrace_trigger_soft_disabled(),
event_trigger_unlock_commit() and event_trigger_unlock_commit_regs()
that consolidated the code and these are used instead.
Link: http://lkml.kernel.org/r/20140106222703.5e7dbba2@gandalf.local.home
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The counters for the traceon and traceoff are only suppose to decrement
when the trigger enables or disables tracing. It is not suppose to decrement
every time the event is hit.
Only decrement the counter if the trigger actually did something.
Link: http://lkml.kernel.org/r/20140106223124.0e5fd0b4@gandalf.local.home
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's no reason to use double-underscores for any variable name in
ftrace_syscall_enter()/exit(), since those functions aren't generated
and there's no need to avoid namespace collisions as with the event
macros, which is where the original invocation code came from.
Link: http://lkml.kernel.org/r/0b489c9d1f7ee315cff60fa0e4c2b433ade8ae0d.1389036657.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add code to the kprobe/kretprobe event functions that will invoke any
event triggers associated with a probe's ftrace_event_file.
The code to do this is very similar to the invocation code already
used to invoke the triggers associated with static events and
essentially replaces the existing soft-disable checks with a superset
that preserves the original behavior but adds the bits needed to
support event triggers.
Link: http://lkml.kernel.org/r/f2d49f157b608070045fdb26c9564d5a05a5a7d0.1389036657.git.tom.zanussi@linux.intel.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When kprobe-based dynamic event tracer is not enabled, it caused
following build error:
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c8dd): undefined reference to `fetch_symbol_u8'
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c8e9): undefined reference to `fetch_symbol_u16'
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c8f5): undefined reference to `fetch_symbol_u32'
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c901): undefined reference to `fetch_symbol_u64'
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c909): undefined reference to `fetch_symbol_string'
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c913): undefined reference to `fetch_symbol_string_size'
...
It was due to the fetch methods are referred from CHECK_FETCH_FUNCS
macro and since it was only defined in trace_kprobe.c. Move NULL
definition of such fetch functions to the header file.
Note, it also requires CONFIG_BRANCH_PROFILING enabled to trigger
this failure as well. This is because the "fetch_symbol_*" variables
are referenced in a "else if" statement that will only call
update_symbol_cache(), which is a static inline stub function
when CONFIG_KPROBE_EVENT is not enabled. gcc is smart enough
to optimize this "else if" out and that also removes the code that
references the undefined variables.
But when BRANCH_PROFILING is enabled, it fools gcc into keeping
the if statement around and thus references the undefined symbols
and fails to build.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Enable to fetch data from a file offset. Currently it only supports
fetching from same binary uprobe set. It'll translate the file offset
to a proper virtual address in the process.
The syntax is "@+OFFSET" as it does similar to normal memory fetching
(@ADDR) which does no address translation.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Enable to fetch other types of argument for the uprobes. IOW, we can
access stack, memory, deref, bitfield and retval from uprobes now.
The format for the argument types are same as kprobes (but @SYMBOL
type is not supported for uprobes), i.e:
@ADDR : Fetch memory at ADDR
$stackN : Fetch Nth entry of stack (N >= 0)
$stack : Fetch stack address
$retval : Fetch return value
+|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address
Note that the retval only can be used with uretprobes.
Original-patch-by: Hyeoncheol Lee <cheol.lee@lge.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Hyeoncheol Lee <cheol.lee@lge.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Fetching from user space should be done in a non-atomic context. So
use a per-cpu buffer and copy its content to the ring buffer
atomically. Note that we can migrate during accessing user memory
thus use a per-cpu mutex to protect concurrent accesses.
This is needed since we'll be able to fetch args from an user memory
which can be swapped out. Before that uprobes could fetch args from
registers only which saved in a kernel space.
While at it, use __get_data_size() and store_trace_args() to reduce
code duplication. And add struct uprobe_cpu_buffer and its helpers as
suggested by Oleg.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Currently uprobes don't pass is_return to the argument parser so that
it cannot make use of "$retval" fetch method since it only works for
return probes.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Use separate method to fetch from memory. Move existing functions to
trace_kprobe.c and make them static. Also add new memory fetch
implementation for uprobes.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The deref fetch methods access a memory region but it assumes that
it's a kernel memory since uprobes does not support them.
Add ->fetch and ->fetch_size member in order to provide a proper
access methods for supporting uprobes.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Hyeoncheol Lee <cheol.lee@lge.com>
[namhyung@kernel.org: Split original patch into pieces as requested]
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Move existing functions to trace_kprobe.c and add NULL entries to the
uprobes fetch type table. I don't make them static since some generic
routines like update/free_XXX_fetch_param() require pointers to the
functions.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Use separate method to fetch from stack. Move existing functions to
trace_kprobe.c and make them static. Also add new stack fetch
implementation for uprobes.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Use separate fetch_type_table for kprobes and uprobes. It currently
shares all fetch methods but some of them will be implemented
differently later.
This is not to break build if [ku]probes is configured alone (like
!CONFIG_KPROBE_EVENT and CONFIG_UPROBE_EVENT). So I added '__weak'
to the table declaration so that it can be safely omitted when it
configured out.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Move fetch function helper macros/functions to the header file and
make them external. This is preparation of supporting uprobe fetch
table in next patch.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The set_print_fmt() functions are implemented almost same for
[ku]probes. Move it to a common place and get rid of the duplication.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The __get_data_size() and store_trace_args() will be used by uprobes
too. Move them to a common location.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Convert struct trace_uprobe to make use of the common trace_probe
structure.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
There are functions that can be shared to both of kprobes and uprobes.
Separate common data structure to struct trace_probe and use it from
the shared functions.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The print format of s32 type was "ld" and it's casted to "long". So
it turned out to print 4294967295 for "-1" on 64-bit systems. Not
sure whether it worked well on 32-bit systems.
Anyway, it doesn't need to have cast argument at all since it already
casted using type pointer - just get rid of it. Thanks to Oleg for
pointing that out.
And print 0x prefix for unsigned type as it shows hex numbers.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The uprobe syntax requires an offset after a file path not a symbol.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The filter field of the event_trigger_data structure is protected under
RCU sched locks. It was not annotated as such, and after doing so,
sparse pointed out several locations that required fix ups.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Trace event triggers added a lseek that uses the ftrace_filter_lseek()
function. Unfortunately, when function tracing is not configured in
that function is not defined and the kernel fails to build.
This is the second time that function was added to a file ops and
it broke the build due to requiring special config dependencies.
Make a generic tracing_lseek() that all the tracing utilities may
use.
Also, modify the old ftrace_filter_lseek() to return 0 instead of
1 on WRONLY. Not sure why it was a 1 as that does not make sense.
This also changes the old tracing_seek() to modify the file pos
pointer on WRONLY as well.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
Add a generic event_command.set_trigger_filter() op implementation and
have the current set of trigger commands use it - this essentially
gives them all support for filters.
Syntactically, filters are supported by adding 'if <filter>' just
after the command, in which case only events matching the filter will
invoke the trigger. For example, to add a filter to an
enable/disable_event command:
echo 'enable_event:system:event if common_pid == 999' > \
.../othersys/otherevent/trigger
The above command will only enable the system:event event if the
common_pid field in the othersys:otherevent event is 999.
As another example, to add a filter to a stacktrace command:
echo 'stacktrace if common_pid == 999' > \
.../somesys/someevent/trigger
The above command will only trigger a stacktrace if the common_pid
field in the event is 999.
The filter syntax is the same as that described in the 'Event
filtering' section of Documentation/trace/events.txt.
Because triggers can now use filters, the trigger-invoking logic needs
to be moved in those cases - e.g. for ftrace_raw_event_calls, if a
trigger has a filter associated with it, the trigger invocation now
needs to happen after the { assign; } part of the call, in order for
the trigger condition to be tested.
There's still a SOFT_DISABLED-only check at the top of e.g. the
ftrace_raw_events function, so when an event is soft disabled but not
because of the presence of a trigger, the original SOFT_DISABLED
behavior remains unchanged.
There's also a bit of trickiness in that some triggers need to avoid
being invoked while an event is currently in the process of being
logged, since the trigger may itself log data into the trace buffer.
Thus we make sure the current event is committed before invoking those
triggers. To do that, we split the trigger invocation in two - the
first part (event_triggers_call()) checks the filter using the current
trace record; if a command has the post_trigger flag set, it sets a
bit for itself in the return value, otherwise it directly invoks the
trigger. Once all commands have been either invoked or set their
return flag, event_triggers_call() returns. The current record is
then either committed or discarded; if any commands have deferred
their triggers, those commands are finally invoked following the close
of the current event by event_triggers_post_call().
To simplify the above and make it more efficient, the TRIGGER_COND bit
is introduced, which is set only if a soft-disabled trigger needs to
use the log record for filter testing or needs to wait until the
current log record is closed.
The syscall event invocation code is also changed in analogous ways.
Because event triggers need to be able to create and free filters,
this also adds a couple external wrappers for the existing
create_filter and free_filter functions, which are too generic to be
made extern functions themselves.
Link: http://lkml.kernel.org/r/7164930759d8719ef460357f143d995406e4eead.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that event triggers use ftrace_event_file(), it needs to be outside
the #ifdef CONFIG_DYNAMIC_FTRACE, as it can now be used when that is
not defined.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add 'enable_event' and 'disable_event' event_command commands.
enable_event and disable_event event triggers are added by the user
via these commands in a similar way and using practically the same
syntax as the analagous 'enable_event' and 'disable_event' ftrace
function commands, but instead of writing to the set_ftrace_filter
file, the enable_event and disable_event triggers are written to the
per-event 'trigger' files:
echo 'enable_event:system:event' > .../othersys/otherevent/trigger
echo 'disable_event:system:event' > .../othersys/otherevent/trigger
The above commands will enable or disable the 'system:event' trace
events whenever the othersys:otherevent events are hit.
This also adds a 'count' version that limits the number of times the
command will be invoked:
echo 'enable_event:system:event:N' > .../othersys/otherevent/trigger
echo 'disable_event:system:event:N' > .../othersys/otherevent/trigger
Where N is the number of times the command will be invoked.
The above commands will will enable or disable the 'system:event'
trace events whenever the othersys:otherevent events are hit, but only
N times.
This also makes the find_event_file() helper function extern, since
it's useful to use from other places, such as the event triggers code,
so make it accessible.
Link: http://lkml.kernel.org/r/f825f3048c3f6b026ee37ae5825f9fc373451828.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add 'stacktrace' event_command. stacktrace event triggers are added
by the user via this command in a similar way and using practically
the same syntax as the analogous 'stacktrace' ftrace function command,
but instead of writing to the set_ftrace_filter file, the stacktrace
event trigger is written to the per-event 'trigger' files:
echo 'stacktrace' > .../tracing/events/somesys/someevent/trigger
The above command will turn on stacktraces for someevent i.e. whenever
someevent is hit, a stacktrace will be logged.
This also adds a 'count' version that limits the number of times the
command will be invoked:
echo 'stacktrace:N' > .../tracing/events/somesys/someevent/trigger
Where N is the number of times the command will be invoked.
The above command will log N stacktraces for someevent i.e. whenever
someevent is hit N times, a stacktrace will be logged.
Link: http://lkml.kernel.org/r/0c30c008a0828c660aa0e1bbd3255cf179ed5c30.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add 'snapshot' event_command. snapshot event triggers are added by
the user via this command in a similar way and using practically the
same syntax as the analogous 'snapshot' ftrace function command, but
instead of writing to the set_ftrace_filter file, the snapshot event
trigger is written to the per-event 'trigger' files:
echo 'snapshot' > .../somesys/someevent/trigger
The above command will turn on snapshots for someevent i.e. whenever
someevent is hit, a snapshot will be done.
This also adds a 'count' version that limits the number of times the
command will be invoked:
echo 'snapshot:N' > .../somesys/someevent/trigger
Where N is the number of times the command will be invoked.
The above command will snapshot N times for someevent i.e. whenever
someevent is hit N times, a snapshot will be done.
Also adds a new tracing_alloc_snapshot() function - the existing
tracing_snapshot_alloc() function is a special version of
tracing_snapshot() that also does the snapshot allocation - the
snapshot triggers would like to be able to do just the allocation but
not take a snapshot; the existing tracing_snapshot_alloc() in turn now
also calls tracing_alloc_snapshot() underneath to do that allocation.
Link: http://lkml.kernel.org/r/c9524dd07ce01f9dcbd59011290e0a8d5b47d7ad.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
[ fix up from kbuild test robot <fengguang.wu@intel.com report ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add 'traceon' and 'traceoff' event_command commands. traceon and
traceoff event triggers are added by the user via these commands in a
similar way and using practically the same syntax as the analagous
'traceon' and 'traceoff' ftrace function commands, but instead of
writing to the set_ftrace_filter file, the traceon and traceoff
triggers are written to the per-event 'trigger' files:
echo 'traceon' > .../tracing/events/somesys/someevent/trigger
echo 'traceoff' > .../tracing/events/somesys/someevent/trigger
The above command will turn tracing on or off whenever someevent is
hit.
This also adds a 'count' version that limits the number of times the
command will be invoked:
echo 'traceon:N' > .../tracing/events/somesys/someevent/trigger
echo 'traceoff:N' > .../tracing/events/somesys/someevent/trigger
Where N is the number of times the command will be invoked.
The above commands will will turn tracing on or off whenever someevent
is hit, but only N times.
Some common register/unregister_trigger() implementations of the
event_command reg()/unreg() callbacks are also provided, which add and
remove trigger instances to the per-event list of triggers, and
arm/disarm them as appropriate. event_trigger_callback() is a
general-purpose event_command func() implementation that orchestrates
command parsing and registration for most normal commands.
Most event commands will use these, but some will override and
possibly reuse them.
The event_trigger_init(), event_trigger_free(), and
event_trigger_print() functions are meant to be common implementations
of the event_trigger_ops init(), free(), and print() ops,
respectively.
Most trigger_ops implementations will use these, but some will
override and possibly reuse them.
Link: http://lkml.kernel.org/r/00a52816703b98d2072947478dd6e2d70cde5197.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a 'trigger' file for each trace event, enabling 'trace event
triggers' to be set for trace events.
'trace event triggers' are patterned after the existing 'ftrace
function triggers' implementation except that triggers are written to
per-event 'trigger' files instead of to a single file such as the
'set_ftrace_filter' used for ftrace function triggers.
The implementation is meant to be entirely separate from ftrace
function triggers, in order to keep the respective implementations
relatively simple and to allow them to diverge.
The event trigger functionality is built on top of SOFT_DISABLE
functionality. It adds a TRIGGER_MODE bit to the ftrace_event_file
flags which is checked when any trace event fires. Triggers set for a
particular event need to be checked regardless of whether that event
is actually enabled or not - getting an event to fire even if it's not
enabled is what's already implemented by SOFT_DISABLE mode, so trigger
mode directly reuses that. Event trigger essentially inherit the soft
disable logic in __ftrace_event_enable_disable() while adding a bit of
logic and trigger reference counting via tm_ref on top of that in a
new trace_event_trigger_enable_disable() function. Because the base
__ftrace_event_enable_disable() code now needs to be invoked from
outside trace_events.c, a wrapper is also added for those usages.
The triggers for an event are actually invoked via a new function,
event_triggers_call(), and code is also added to invoke them for
ftrace_raw_event calls as well as syscall events.
The main part of the patch creates a new trace_events_trigger.c file
to contain the trace event triggers implementation.
The standard open, read, and release file operations are implemented
here.
The open() implementation sets up for the various open modes of the
'trigger' file. It creates and attaches the trigger iterator and sets
up the command parser. If opened for reading set up the trigger
seq_ops.
The read() implementation parses the event trigger written to the
'trigger' file, looks up the trigger command, and passes it along to
that event_command's func() implementation for command-specific
processing.
The release() implementation does whatever cleanup is needed to
release the 'trigger' file, like releasing the parser and trigger
iterator, etc.
A couple of functions for event command registration and
unregistration are added, along with a list to add them to and a mutex
to protect them, as well as an (initially empty) registration function
to add the set of commands that will be added by future commits, and
call to it from the trace event initialization code.
also added are a couple trigger-specific data structures needed for
these implementations such as a trigger iterator and a struct for
trigger-specific data.
A couple structs consisting mostly of function meant to be implemented
in command-specific ways, event_command and event_trigger_ops, are
used by the generic event trigger command implementations. They're
being put into trace.h alongside the other trace_event data structures
and functions, in the expectation that they'll be needed in several
trace_event-related files such as trace_events_trigger.c and
trace_events.c.
The event_command.func() function is meant to be called by the trigger
parsing code in order to add a trigger instance to the corresponding
event. It essentially coordinates adding a live trigger instance to
the event, and arming the triggering the event.
Every event_command func() implementation essentially does the
same thing for any command:
- choose ops - use the value of param to choose either a number or
count version of event_trigger_ops specific to the command
- do the register or unregister of those ops
- associate a filter, if specified, with the triggering event
The reg() and unreg() ops allow command-specific implementations for
event_trigger_op registration and unregistration, and the
get_trigger_ops() op allows command-specific event_trigger_ops
selection to be parameterized. When a trigger instance is added, the
reg() op essentially adds that trigger to the triggering event and
arms it, while unreg() does the opposite. The set_filter() function
is used to associate a filter with the trigger - if the command
doesn't specify a set_filter() implementation, the command will ignore
filters.
Each command has an associated trigger_type, which serves double duty,
both as a unique identifier for the command as well as a value that
can be used for setting a trigger mode bit during trigger invocation.
The signature of func() adds a pointer to the event_command struct,
used to invoke those functions, along with a command_data param that
can be passed to the reg/unreg functions. This allows func()
implementations to use command-specific blobs and supports code
re-use.
The event_trigger_ops.func() command corrsponds to the trigger 'probe'
function that gets called when the triggering event is actually
invoked. The other functions are used to list the trigger when
needed, along with a couple mundane book-keeping functions.
This also moves event_file_data() into trace.h so it can be used
outside of trace_events.c.
Link: http://lkml.kernel.org/r/316d95061accdee070aac8e5750afba0192fa5b9.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Idea-by: Steve Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The problem is that the profiler only initializes the online
CPUs, and not possible CPUs. This causes issues if the user takes
CPUs online or offline while the profiler is running.
If we online a CPU after starting the profiler, we lose all the
trace information on the CPU going online.
If we offline a CPU after running a test and start a new test, it
will not clear the old data from that CPU.
This bug causes incorrect data to be reported to the user if they
online or offline CPUs during the profiling.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJSsNBHAAoJEKQekfcNnQGuKP8H/2mol/d7z2vANh7/FeNjTKIN
VkRzDEwUIwoaJBsL75EDDXBFx7w8jjAsXyoTrqrvMRV4UNcsfm46mohQTPAmK39y
muqodL1VnVXdKrUmtw/1nL7yDi2KltQH1UwOgvwXGuUFIq5cuCXNQxNK9/1fVVVn
tIMNz5kEAG3XCwnqP0PgQxWCuA7s+aQR0ijTf4vPf1G3IJujPyG9VhJWcGS3dJTR
t8TPyatd9D/S+7/r7iZ9hS8nWpaka3qJfhiWqk16SC9LiUXVA8oFOVMoN7n6Co5E
6r2dNo01WOABlojCxi1t3afUtcV1bUjBnVkiDva5cSc84pQSxe1qRrIpjTmHk00=
=MSZs
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"This fixes a long standing bug in the ftrace profiler. The problem is
that the profiler only initializes the online CPUs, and not possible
CPUs. This causes issues if the user takes CPUs online or offline
while the profiler is running.
If we online a CPU after starting the profiler, we lose all the trace
information on the CPU going online.
If we offline a CPU after running a test and start a new test, it will
not clear the old data from that CPU.
This bug causes incorrect data to be reported to the user if they
online or offline CPUs during the profiling"
* tag 'trace-fixes-v3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Initialize the ftrace profiler for each possible cpu
Ftrace currently initializes only the online CPUs. This implementation has
two problems:
- If we online a CPU after we enable the function profile, and then run the
test, we will lose the trace information on that CPU.
Steps to reproduce:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# echo 1 > /sys/devices/system/cpu/cpu1/online
# run test
- If we offline a CPU before we enable the function profile, we will not clear
the trace information when we enable the function profile. It will trouble
the users.
Steps to reproduce:
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# run test
# cat trace_stat/function*
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > function_profile_enabled
# echo 1 > function_profile_enabled
# cat trace_stat/function*
# run test
# cat trace_stat/function*
So it is better that we initialize the ftrace profiler for each possible cpu
every time we enable the function profile instead of just the online ones.
Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com
Cc: stable@vger.kernel.org # 2.6.31+
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
all events. This was prevalent when FTRACE_SELFTEST was enabled which
enables all events several times, and caused the system bootup to
pause for over a minute.
This was tracked down to an addition of a synchronize_sched() performed
when system call tracepoints are unregistered.
The synchronize_sched() is needed between the unregistering of the
system call tracepoint and a deletion of a tracing instance buffer.
But placing the synchronize_sched() in the unreg of *every* system call
tracepoint is a bit overboard. A single synchronize_sched() before
the deletion of the instance is sufficient.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJSofYNAAoJEKQekfcNnQGuTUcIAJOx8745KlkFjN4VX+nCWNfP
xrrpnLBymbGMA9lQ4fXk+kdiuhH8DjRYdKq9fU4T481MFYKkToUIZH6NeaLI5fr1
0nBPPjVyAlJ+yt9JbOAYa1jEYnAr27ORDHEtdQnqb6OJSky3oh9jCQi+toxmh2qX
Sv1tIYeAf3K2V/h5xt6uSl9oiZ6KBtwE3f+xkHWNizaU9i2rq2gxd77fSbPNTIps
wLdsESYziA2UeAm13eh8xXo1uqRbfvx7bPr59cu0+3AqdOoaXFG+JoE/MqmljIO5
HkyCKJVqP8HD+QieuEB+hw4zpSsl6A6iKSlaiHm8NMnRRocfM6qMKMQXQ28KhxA=
=sYm/
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"A regression showed up that there's a large delay when enabling all
events. This was prevalent when FTRACE_SELFTEST was enabled which
enables all events several times, and caused the system bootup to
pause for over a minute.
This was tracked down to an addition of a synchronize_sched()
performed when system call tracepoints are unregistered.
The synchronize_sched() is needed between the unregistering of the
system call tracepoint and a deletion of a tracing instance buffer.
But placing the synchronize_sched() in the unreg of *every* system
call tracepoint is a bit overboard. A single synchronize_sched()
before the deletion of the instance is sufficient"
* tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Only run synchronize_sched() at instance deletion time
It has been reported that boot up with FTRACE_SELFTEST enabled can take a
very long time. There can be stalls of over a minute.
This was tracked down to the synchronize_sched() called when a system call
event is disabled. As the self tests enable and disable thousands of events,
this makes the synchronize_sched() get called thousands of times.
The synchornize_sched() was added with d562aff93b "tracing: Add support
for SOFT_DISABLE to syscall events" which caused this regression (added
in 3.13-rc1).
The synchronize_sched() is to protect against the events being accessed
when a tracer instance is being deleted. When an instance is being deleted
all the events associated to it are unregistered. The synchronize_sched()
makes sure that no more users are running when it finishes.
Instead of calling synchronize_sched() for all syscall events, we only
need to call it once, after the events are unregistered and before the
instance is deleted. The event_mutex is held during this action to
prevent new users from enabling events.
Link: http://lkml.kernel.org/r/20131203124120.427b9661@gandalf.local.home
Reported-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Petr Mladek <pmladek@suse.cz>
Tested-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf fixes from Ingo Molnar:
"Misc kernel and tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools lib traceevent: Fix conversion of pointer to integer of different size
perf/trace: Properly use u64 to hold event_id
perf: Remove fragile swevent hlist optimization
ftrace, perf: Avoid infinite event generation loop
tools lib traceevent: Fix use of multiple options in processing field
perf header: Fix possible memory leaks in process_group_desc()
perf header: Fix bogus group name
perf tools: Tag thread comm as overriden
Commit 8c4f3c3fa9 "ftrace: Check module functions being traced on reload"
fixed module loading and unloading with respect to function tracing, but
it missed the function graph tracer. If you perform the following
# cd /sys/kernel/debug/tracing
# echo function_graph > current_tracer
# modprobe nfsd
# echo nop > current_tracer
You'll get the following oops message:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9()
Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt
CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000
0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668
ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000
Call Trace:
[<ffffffff814fe193>] dump_stack+0x4f/0x7c
[<ffffffff8103b80a>] warn_slowpath_common+0x81/0x9b
[<ffffffff810b2b9a>] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9
[<ffffffff8103b83e>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810b2b9a>] __ftrace_hash_rec_update.part.35+0x168/0x1b9
[<ffffffff81502f89>] ? __mutex_lock_slowpath+0x364/0x364
[<ffffffff810b2cc2>] ftrace_shutdown+0xd7/0x12b
[<ffffffff810b47f0>] unregister_ftrace_graph+0x49/0x78
[<ffffffff810c4b30>] graph_trace_reset+0xe/0x10
[<ffffffff810bf393>] tracing_set_tracer+0xa7/0x26a
[<ffffffff810bf5e1>] tracing_set_trace_write+0x8b/0xbd
[<ffffffff810c501c>] ? ftrace_return_to_handler+0xb2/0xde
[<ffffffff811240a8>] ? __sb_end_write+0x5e/0x5e
[<ffffffff81122aed>] vfs_write+0xab/0xf6
[<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
[<ffffffff81122dbd>] SyS_write+0x59/0x82
[<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
[<ffffffff8150a2d2>] system_call_fastpath+0x16/0x1b
---[ end trace 940358030751eafb ]---
The above mentioned commit didn't go far enough. Well, it covered the
function tracer by adding checks in __register_ftrace_function(). The
problem is that the function graph tracer circumvents that (for a slight
efficiency gain when function graph trace is running with a function
tracer. The gain was not worth this).
The problem came with ftrace_startup() which should always be called after
__register_ftrace_function(), if you want this bug to be completely fixed.
Anyway, this solution moves __register_ftrace_function() inside of
ftrace_startup() and removes the need to call them both.
Reported-by: Dave Wysochanski <dwysocha@redhat.com>
Fixes: ed926f9b35 ("ftrace: Use counters to enable functions to trace")
Cc: stable@vger.kernel.org # 3.0+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The 64-bit attr.config value for perf trace events was being copied into
an "int" before doing a comparison, meaning the top 32 bits were
being truncated.
As far as I can tell this didn't cause any errors, but it did mean
it was possible to create valid aliases for all the tracepoint ids
which I don't think was intended. (For example, 0xffffffff00000018
and 0x18 both enable the same tracepoint).
Signed-off-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1311151236100.11932@vincent-weaver-1.um.maine.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince's perf-trinity fuzzer found yet another 'interesting' problem.
When we sample the irq_work_exit tracepoint with period==1 (or
PERF_SAMPLE_PERIOD) and we add an fasync SIGNAL handler we create an
infinite event generation loop:
,-> <IPI>
| irq_work_exit() ->
| trace_irq_work_exit() ->
| ...
| __perf_event_overflow() -> (due to fasync)
| irq_work_queue() -> (irq_work_list must be empty)
'--------- arch_irq_work_raise()
Similar things can happen due to regular poll() wakeups if we exceed
the ring-buffer wakeup watermark, or have an event_limit.
To avoid this, dis-allow sampling this particular tracepoint.
In order to achieve this, create a special perf_perm function pointer
for each event and call this (when set) on trying to create a
tracepoint perf event.
[ roasted: use expr... to allow for ',' in your expression ]
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20131114152304.GC5364@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The only real feature that was added this release is from Namhyung Kim,
who introduced "set_graph_notrace" filter that lets you run the function
graph tracer and not trace particular functions and their call chain.
Tom Zanussi added some updates to the ftrace multibuffer tracing that
made it more consistent with the top level tracing.
One of the fixes for perf function tracing required an API change in
RCU; the addition of "rcu_is_watching()". As Paul McKenney is pushing
that change in this release too, he gave me a branch that included
all the changes to get that working, and I pulled that into my tree
in order to complete the perf function tracing fix.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJSgX5SAAoJEKQekfcNnQGulUAH/jORqJrKaNAulmZ314VsAqfa
zMtF5UAAPf7kqc3AN/jtFrhJUNEfxWOo7A4r0FsM/rKdWJF+98GA6aqYVD+XoWFt
+36fg1enxbXUjixQ96Uh+o1+BJUgYDqljuWzqSu/oiXWfWwl8+WL4kcbhb+V9WcF
SpdzLCWVZRfhyDiN3+0zvyQ8RSG2Pd7CWn9zroI0e4sxGo0Ki6JUnIcXtZGOBDOQ
IIZdjXvGSfpJ+3u3XvRPXJcltRCtOsVWxYzrmvRlmHDW5QMe1+WmmrlojTePrLaJ
xn8+3WINqetAR+ZQnazbpt1XzJzKa8QtFgpiN0kT6qL7cg3N1Owc4vLGohl7wok=
=Nesf
-----END PGP SIGNATURE-----
Merge tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing update from Steven Rostedt:
"This batch of changes is mostly clean ups and small bug fixes. The
only real feature that was added this release is from Namhyung Kim,
who introduced "set_graph_notrace" filter that lets you run the
function graph tracer and not trace particular functions and their
call chain.
Tom Zanussi added some updates to the ftrace multibuffer tracing that
made it more consistent with the top level tracing.
One of the fixes for perf function tracing required an API change in
RCU; the addition of "rcu_is_watching()". As Paul McKenney is pushing
that change in this release too, he gave me a branch that included all
the changes to get that working, and I pulled that into my tree in
order to complete the perf function tracing fix"
* tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add rcu annotation for syscall trace descriptors
tracing: Do not use signed enums with unsigned long long in fgragh output
tracing: Remove unused function ftrace_off_permanent()
tracing: Do not assign filp->private_data to freed memory
tracing: Add helper function tracing_is_disabled()
tracing: Open tracer when ftrace_dump_on_oops is used
tracing: Add support for SOFT_DISABLE to syscall events
tracing: Make register/unregister_ftrace_command __init
tracing: Update event filters for multibuffer
recordmcount.pl: Add support for __fentry__
ftrace: Have control op function callback only trace when RCU is watching
rcu: Do not trace rcu_is_watching() functions
ftrace/x86: skip over the breakpoint for ftrace caller
trace/trace_stat: use rbtree postorder iteration helper instead of opencoding
ftrace: Add set_graph_notrace filter
ftrace: Narrow down the protected area of graph_lock
ftrace: Introduce struct ftrace_graph_data
ftrace: Get rid of ftrace_graph_filter_enabled
tracing: Fix potential out-of-bounds in trace_get_user()
tracing: Show more exact help information about snapshot
Pull block IO core updates from Jens Axboe:
"This is the pull request for the core changes in the block layer for
3.13. It contains:
- The new blk-mq request interface.
This is a new and more scalable queueing model that marries the
best part of the request based interface we currently have (which
is fully featured, but scales poorly) and the bio based "interface"
which the new drivers for high IOPS devices end up using because
it's much faster than the request based one.
The bio interface has no block layer support, since it taps into
the stack much earlier. This means that drivers end up having to
implement a lot of functionality on their own, like tagging,
timeout handling, requeue, etc. The blk-mq interface provides all
these. Some drivers even provide a switch to select bio or rq and
has code to handle both, since things like merging only works in
the rq model and hence is faster for some workloads. This is a
huge mess. Conversion of these drivers nets us a substantial code
reduction. Initial results on converting SCSI to this model even
shows an 8x improvement on single queue devices. So while the
model was intended to work on the newer multiqueue devices, it has
substantial improvements for "classic" hardware as well. This code
has gone through extensive testing and development, it's now ready
to go. A pull request is coming to convert virtio-blk to this
model will be will be coming as well, with more drivers scheduled
for 3.14 conversion.
- Two blktrace fixes from Jan and Chen Gang.
- A plug merge fix from Alireza Haghdoost.
- Conversion of __get_cpu_var() from Christoph Lameter.
- Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.
- A fix for a race between request completion and the timeout
handling from Jeff Moyer. This is what caused the merge conflict
with blk-mq/core, in case you are looking at that.
- A dm stacking fix from Mike Snitzer.
- A code consolidation fix and duplicated code removal from Kent
Overstreet.
- A handful of block bug fixes from Mikulas Patocka, fixing a loop
crash and memory corruption on blk cg.
- Elevator switch bug fix from Tomoki Sekiyama.
A heads-up that I had to rebase this branch. Initially the immutable
bio_vecs had been queued up for inclusion, but a week later, it became
clear that it wasn't fully cooked yet. So the decision was made to
pull this out and postpone it until 3.14. It was a straight forward
rebase, just pruning out the immutable series and the later fixes of
problems with it. The rest of the patches applied directly and no
further changes were made"
* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: Do not call sector_div() with a 64-bit divisor
kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
block: Consolidate duplicated bio_trim() implementations
block: Use rw_copy_check_uvector()
block: Enable sysfs nomerge control for I/O requests in the plug list
block: properly stack underlying max_segment_size to DM device
elevator: acquire q->sysfs_lock in elevator_change()
elevator: Fix a race in elevator switching and md device initialization
block: Replace __get_cpu_var uses
bdi: test bdi_init failure
block: fix a probe argument to blk_register_region
loop: fix crash if blk_alloc_queue fails
blk-core: Fix memory corruption if blkcg_init_queue fails
block: fix race between request completion and timeout handling
blktrace: Send BLK_TN_PROCESS events to all running traces
blk-mq: don't disallow request merges for req->special being set
blk-mq: mq plug list breakage
blk-mq: fix for flush deadlock
...
Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle are:
- (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
van Riel, Peter Zijlstra et al. Yay!
- optimize preemption counter handling: merge the NEED_RESCHED flag
into the preempt_count variable, by Peter Zijlstra.
- wait.h fixes and code reorganization from Peter Zijlstra
- cfs_bandwidth fixes from Ben Segall
- SMP load-balancer cleanups from Peter Zijstra
- idle balancer improvements from Jason Low
- other fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
stop_machine: Fix race between stop_two_cpus() and stop_cpus()
sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
sched: Fix asymmetric scheduling for POWER7
sched: Move completion code from core.c to completion.c
sched: Move wait code from core.c to wait.c
sched: Move wait.c into kernel/sched/
sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
sched: Avoid throttle_cfs_rq() racing with period_timer stopping
sched: Guarantee new group-entities always have weight
sched: Fix hrtimer_cancel()/rq->lock deadlock
sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
sched: Fix race on toggling cfs_bandwidth_used
sched: Remove extra put_online_cpus() inside sched_setaffinity()
sched/rt: Fix task_tick_rt() comment
sched/wait: Fix build breakage
sched/wait: Introduce prepare_to_wait_event()
sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
sched: Remove get_online_cpus() usage
sched: Fix race in migrate_swap_stop()
...
sparse complains about the enter/exit_sysycall_files[] variables being
dereferenced with rcu_dereference_sched(). The fields need to be
annotated with __rcu.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the introduction of PREEMPT_NEED_RESCHED in:
f27dde8dee ("sched: Add NEED_RESCHED to the preempt_count")
we need to be able to look at both TIF_NEED_RESCHED and
PREEMPT_NEED_RESCHED to understand the full preemption behaviour.
Add it to the trace output.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Link: http://lkml.kernel.org/r/20131004152826.GP3081@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
do_blk_trace_setup() will fully initialize 'buts.name', so can remove
the related memcpy(). And also use BLKTRACE_BDEV_SIZE and ARRAY_SIZE
instead of hard code number '32'.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently each task sends BLK_TN_PROCESS event to the first traced
device it interacts with after a new trace is started. When there are
several traced devices and the task accesses more devices, this logic
can result in BLK_TN_PROCESS being sent several times to some devices
while it is never sent to other devices. Thus blkparse doesn't display
command name when parsing some blktrace files.
Fix the problem by sending BLK_TN_PROCESS event to all traced devices
when a task interacts with any of them.
Signed-off-by: Jan Kara <jack@suse.cz>
Review-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The duration field of print_graph_duration() can also be used
to do the space filling by passing an enum in it:
DURATION_FILL_FULL
DURATION_FILL_START
DURATION_FILL_END
The problem is that these are enums and defined as negative,
but the duration field is unsigned long long. Most archs are
fine with this but blackfin fails to compile because of it:
kernel/built-in.o: In function `print_graph_duration':
kernel/trace/trace_functions_graph.c:782: undefined reference to `__ucmpdi2'
Overloading a unsigned long long with an signed enum is just
bad in principle. We can accomplish the same thing by using
part of the flags field instead.
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the past, ftrace_off_permanent() was called if something
strange was detected. But the ftrace_bug() now handles all the
anomolies that can happen with ftrace (function tracing), and there
are no uses of ftrace_off_permanent(). Get rid of it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In system_tr_open(), the filp->private_data can be assigned the 'dir'
variable even if it was freed. This is on the error path, and is
harmless because the error return code will prevent filp->private_data
from being used. But for correctness, we should not assign it to
a recently freed variable, as that can cause static tools to give
false warnings.
Also have both subsystem_open() and system_tr_open() return -ENODEV
if tracing has been disabled.
Link: http://lkml.kernel.org/r/1383764571-7318-1-git-send-email-geyslan@gmail.com
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The current default perf paranoid level is "1" which has
"perf_paranoid_kernel()" return false, and giving any operations that
use it, access to normal users. Unfortunately, this includes function
tracing and normal users should not be allowed to enable function
tracing by default.
The proper level is defined at "-1" (full perf access), which
"perf_paranoid_tracepoint_raw()" will only give access to. Use that
check instead for enabling function tracing.
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org # 3.4+
CVE: CVE-2013-2930
Fixes: ced39002f5 ("ftrace, perf: Add support to use function tracepoint in perf")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With ftrace_dump_on_oops, we previously did not open the tracer in
question, sometimes causing the trace output to be useless.
For example, the function_graph tracer with tracing_thresh set dumped via
ftrace_dump_on_oops would show a series of '}' indented at different levels,
but no function names.
call trace->open() (and do a few other fixups copied from the normal dump
path) to make the output more intelligible.
Link: http://lkml.kernel.org/r/1382554197-16961-1-git-send-email-cody@linux.vnet.ibm.com
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The original SOFT_DISABLE patches didn't add support for soft disable
of syscall events; this adds it.
Add an array of ftrace_event_file pointers indexed by syscall number
to the trace array and remove the existing enabled bitmaps, which as a
result are now redundant. The ftrace_event_file structs in turn
contain the soft disable flags we need for per-syscall soft disable
accounting.
Adding ftrace_event_files also means we can remove the USE_CALL_FILTER
bit, thus enabling multibuffer filter support for syscall events.
Link: http://lkml.kernel.org/r/6e72b566e85d8df8042f133efbc6c30e21fb017e.1382620672.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace event filters are still tied to event calls rather than
event files, which means you don't get what you'd expect when using
filters in the multibuffer case:
Before:
# echo 'bytes_alloc > 8192' > /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 8192
# mkdir /sys/kernel/debug/tracing/instances/test1
# echo 'bytes_alloc > 2048' > /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 2048
# cat /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
bytes_alloc > 2048
Setting the filter in tracing/instances/test1/events shouldn't affect
the same event in tracing/events as it does above.
After:
# echo 'bytes_alloc > 8192' > /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 8192
# mkdir /sys/kernel/debug/tracing/instances/test1
# echo 'bytes_alloc > 2048' > /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 8192
# cat /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
bytes_alloc > 2048
We'd like to just move the filter directly from ftrace_event_call to
ftrace_event_file, but there are a couple cases that don't yet have
multibuffer support and therefore have to continue using the current
event_call-based filters. For those cases, a new USE_CALL_FILTER bit
is added to the event_call flags, whose main purpose is to keep the
old behavior for those cases until they can be updated with
multibuffer support; at that point, the USE_CALL_FILTER flag (and the
new associated call_filter_check_discard() function) can go away.
The multibuffer support also made filter_current_check_discard()
redundant, so this change removes that function as well and replaces
it with filter_check_discard() (or call_filter_check_discard() as
appropriate).
Link: http://lkml.kernel.org/r/f16e9ce4270c62f46b2e966119225e1c3cca7e60.1382620672.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Dave Jones reported that trinity would be able to trigger the following
back trace:
===============================
[ INFO: suspicious RCU usage. ]
3.10.0-rc2+ #38 Not tainted
-------------------------------
include/linux/rcupdate.h:771 rcu_read_lock() used illegally while idle!
other info that might help us debug this:
RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0
RCU used illegally from extended quiescent state!
1 lock held by trinity-child1/18786:
#0: (rcu_read_lock){.+.+..}, at: [<ffffffff8113dd48>] __perf_event_overflow+0x108/0x310
stack backtrace:
CPU: 3 PID: 18786 Comm: trinity-child1 Not tainted 3.10.0-rc2+ #38
0000000000000000 ffff88020767bac8 ffffffff816e2f6b ffff88020767baf8
ffffffff810b5897 ffff88021de92520 0000000000000000 ffff88020767bbf8
0000000000000000 ffff88020767bb78 ffffffff8113ded4 ffffffff8113dd48
Call Trace:
[<ffffffff816e2f6b>] dump_stack+0x19/0x1b
[<ffffffff810b5897>] lockdep_rcu_suspicious+0xe7/0x120
[<ffffffff8113ded4>] __perf_event_overflow+0x294/0x310
[<ffffffff8113dd48>] ? __perf_event_overflow+0x108/0x310
[<ffffffff81309289>] ? __const_udelay+0x29/0x30
[<ffffffff81076054>] ? __rcu_read_unlock+0x54/0xa0
[<ffffffff816f4000>] ? ftrace_call+0x5/0x2f
[<ffffffff8113dfa1>] perf_swevent_overflow+0x51/0xe0
[<ffffffff8113e08f>] perf_swevent_event+0x5f/0x90
[<ffffffff8113e1c9>] perf_tp_event+0x109/0x4f0
[<ffffffff8113e36f>] ? perf_tp_event+0x2af/0x4f0
[<ffffffff81074630>] ? __rcu_read_lock+0x20/0x20
[<ffffffff8112d79f>] perf_ftrace_function_call+0xbf/0xd0
[<ffffffff8110e1e1>] ? ftrace_ops_control_func+0x181/0x210
[<ffffffff81074630>] ? __rcu_read_lock+0x20/0x20
[<ffffffff81100cae>] ? rcu_eqs_enter_common+0x5e/0x470
[<ffffffff8110e1e1>] ftrace_ops_control_func+0x181/0x210
[<ffffffff816f4000>] ftrace_call+0x5/0x2f
[<ffffffff8110e229>] ? ftrace_ops_control_func+0x1c9/0x210
[<ffffffff816f4000>] ? ftrace_call+0x5/0x2f
[<ffffffff81074635>] ? debug_lockdep_rcu_enabled+0x5/0x40
[<ffffffff81074635>] ? debug_lockdep_rcu_enabled+0x5/0x40
[<ffffffff81100cae>] ? rcu_eqs_enter_common+0x5e/0x470
[<ffffffff8110112a>] rcu_eqs_enter+0x6a/0xb0
[<ffffffff81103673>] rcu_user_enter+0x13/0x20
[<ffffffff8114541a>] user_enter+0x6a/0xd0
[<ffffffff8100f6d8>] syscall_trace_leave+0x78/0x140
[<ffffffff816f46af>] int_check_syscall_exit_work+0x34/0x3d
------------[ cut here ]------------
Perf uses rcu_read_lock() but as the function tracer can trace functions
even when RCU is not currently active, this makes the rcu_read_lock()
used by perf ineffective.
As perf is currently the only user of the ftrace_ops_control_func() and
perf is also the only function callback that actively uses rcu_read_lock(),
the quick fix is to prevent the ftrace_ops_control_func() from calling
its callbacks if RCU is not active.
With Paul's new "rcu_is_watching()" we can tell if RCU is active or not.
Reported-by: Dave Jones <davej@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead
of opencoding an alternate postorder iteration that modifies the tree
Link: http://lkml.kernel.org/r/1383345566-25087-2-git-send-email-cody@linux.vnet.ibm.com
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The parser set up is just a generic utility that uses local variables
allocated by the function. There's no need to hold the graph_lock for
this set up.
This also makes the code simpler.
Link: http://lkml.kernel.org/r/1381739066-7531-4-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The struct ftrace_graph_data is for generalizing the access to
set_graph_function file. This is a preparation for adding support to
set_graph_notrace.
Link: http://lkml.kernel.org/r/1381739066-7531-3-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_graph_filter_enabled means that user sets function filter
and it always has same meaning of ftrace_graph_count > 0.
Link: http://lkml.kernel.org/r/1381739066-7531-2-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Andrey reported the following report:
ERROR: AddressSanitizer: heap-buffer-overflow on address ffff8800359c99f3
ffff8800359c99f3 is located 0 bytes to the right of 243-byte region [ffff8800359c9900, ffff8800359c99f3)
Accessed by thread T13003:
#0 ffffffff810dd2da (asan_report_error+0x32a/0x440)
#1 ffffffff810dc6b0 (asan_check_region+0x30/0x40)
#2 ffffffff810dd4d3 (__tsan_write1+0x13/0x20)
#3 ffffffff811cd19e (ftrace_regex_release+0x1be/0x260)
#4 ffffffff812a1065 (__fput+0x155/0x360)
#5 ffffffff812a12de (____fput+0x1e/0x30)
#6 ffffffff8111708d (task_work_run+0x10d/0x140)
#7 ffffffff810ea043 (do_exit+0x433/0x11f0)
#8 ffffffff810eaee4 (do_group_exit+0x84/0x130)
#9 ffffffff810eafb1 (SyS_exit_group+0x21/0x30)
#10 ffffffff81928782 (system_call_fastpath+0x16/0x1b)
Allocated by thread T5167:
#0 ffffffff810dc778 (asan_slab_alloc+0x48/0xc0)
#1 ffffffff8128337c (__kmalloc+0xbc/0x500)
#2 ffffffff811d9d54 (trace_parser_get_init+0x34/0x90)
#3 ffffffff811cd7b3 (ftrace_regex_open+0x83/0x2e0)
#4 ffffffff811cda7d (ftrace_filter_open+0x2d/0x40)
#5 ffffffff8129b4ff (do_dentry_open+0x32f/0x430)
#6 ffffffff8129b668 (finish_open+0x68/0xa0)
#7 ffffffff812b66ac (do_last+0xb8c/0x1710)
#8 ffffffff812b7350 (path_openat+0x120/0xb50)
#9 ffffffff812b8884 (do_filp_open+0x54/0xb0)
#10 ffffffff8129d36c (do_sys_open+0x1ac/0x2c0)
#11 ffffffff8129d4b7 (SyS_open+0x37/0x50)
#12 ffffffff81928782 (system_call_fastpath+0x16/0x1b)
Shadow bytes around the buggy address:
ffff8800359c9700: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
ffff8800359c9780: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
ffff8800359c9800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>ffff8800359c9980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[03]fb
ffff8800359c9a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9b00: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
ffff8800359c9b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8800359c9c00: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap redzone: fa
Heap kmalloc redzone: fb
Freed heap region: fd
Shadow gap: fe
The out-of-bounds access happens on 'parser->buffer[parser->idx] = 0;'
Although the crash happened in ftrace_regex_open() the real bug
occurred in trace_get_user() where there's an incrementation to
parser->idx without a check against the size. The way it is triggered
is if userspace sends in 128 characters (EVENT_BUF_SIZE + 1), the loop
that reads the last character stores it and then breaks out because
there is no more characters. Then the last character is read to determine
what to do next, and the index is incremented without checking size.
Then the caller of trace_get_user() usually nulls out the last character
with a zero, but since the index is equal to the size, it writes a nul
character after the allocated space, which can corrupt memory.
Luckily, only root user has write access to this file.
Link: http://lkml.kernel.org/r/20131009222323.04fd1a0d@gandalf.local.home
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The current "help" that comes out of the snapshot file when it is
not allocated looks like this:
# * Snapshot is freed *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
Echo 2 says that it does not allocate the buffer, which is correct,
but to be more consistent with "echo 0" it should also state
that it does not free.
Link: http://lkml.kernel.org/r/20130914045916.GA4243@udknight
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
are still in flux, and will have to wait for 3.13.
The changes for 3.12 are mostly clean ups and minor fixes.
H. Peter Anvin added a check to x86_32 static function tracing that
helps a small segment of the kernel community.
Oleg Nesterov had a few changes from 3.11, but were mostly clean ups
and not worth pushing in the -rc time frame.
Li Zefan had small clean up with annotating a raw_init with __init.
I fixed a slight race in updating function callbacks, but the race
is so small and the bug that happens when it occurs is so minor it's
not even worth pushing to stable.
The only real enhancement is from Alexander Z Lam that made the
tracing_cpumask work for trace buffer instances, instead of them all
sharing a global cpumask.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJSLJm1AAoJEOdOSU1xswtMSu0H/0/Uuh0D5VhANZRcTATY4gUO
n3WH6sm3atOxH+cbeYQcFXxOcvRcR2n90tvCMpiFlPiC0NiNR1yjro3VLS4zWb77
twq7gABdJf+Tdq7sOBmSzmY5vRKQVHIXvAfC27mBez38nCWZz0BjJGEsPBwoly25
ZaiCbKlusw/QKIEy40tuKUL/rXF6yEWnQrMujhBbyNm0w7sJVdfnd+HHmCvy15H2
IQE1g83d/dAMBjFY2BYg77J+oV6qmJxql2itvDivQWXHqFb52Jw3ZTwHwWLZlPYU
AZcHtYGs2lSUscQLF56LejB7zZyE8taUufExFEVexXxZS5u7nNPXsPrA2LOOK70=
=JWO6
-----END PGP SIGNATURE-----
Merge tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Not much changes for the 3.12 merge window. The major tracing changes
are still in flux, and will have to wait for 3.13.
The changes for 3.12 are mostly clean ups and minor fixes.
H Peter Anvin added a check to x86_32 static function tracing that
helps a small segment of the kernel community.
Oleg Nesterov had a few changes from 3.11, but were mostly clean ups
and not worth pushing in the -rc time frame.
Li Zefan had small clean up with annotating a raw_init with __init.
I fixed a slight race in updating function callbacks, but the race is
so small and the bug that happens when it occurs is so minor it's not
even worth pushing to stable.
The only real enhancement is from Alexander Z Lam that made the
tracing_cpumask work for trace buffer instances, instead of them all
sharing a global cpumask"
* tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/rcu: Do not trace debug_lockdep_rcu_enabled()
x86-32, ftrace: Fix static ftrace when early microcode is enabled
ftrace: Fix a slight race in modifying what function callback gets traced
tracing: Make tracing_cpumask available for all instances
tracing: Kill the !CONFIG_MODULES code in trace_events.c
tracing: Don't pass file_operations array to event_create_dir()
tracing: Kill trace_create_file_ops() and friends
tracing/syscalls: Annotate raw_init function with __init
There's a slight race when going from a list function to a non list
function. That is, when only one callback is registered to the function
tracer, it gets called directly by the mcount trampoline. But if this
function has filters, it may be called by the wrong functions.
As the list ops callback that handles multiple callbacks that are
registered to ftrace, it also handles what functions they call. While
the transaction is taking place, use the list function always, and
after all the updates are finished (only the functions that should be
traced are being traced), then we can update the trampoline to call
the function directly.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull RCU updates from Paul E. McKenney:
"
* Update RCU documentation. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/611.
* Miscellaneous fixes. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/619.
* Full-system idle detection. This is for use by Frederic
Weisbecker's adaptive-ticks mechanism. Its purpose is
to allow the timekeeping CPU to shut off its tick when
all other CPUs are idle. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/648.
* Improve rcutorture test coverage. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/675.
"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allow tracer instances to disable tracing by cpu by moving
the static global tracing_cpumask into trace_array.
Link: http://lkml.kernel.org/r/921622317f239bfc2283cac2242647801ef584f2.1375980149.git.azl@google.com
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move trace_module_nb under CONFIG_MODULES and kill the dummy
trace_module_notify(). Imho it doesn't make sense to define
"struct notifier_block" and its .notifier_call just to avoid
"ifdef" in event_trace_init(), and all other !CONFIG_MODULES
code has already gone away.
Link: http://lkml.kernel.org/r/20130731173137.GA31043@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that event_create_dir() and __trace_add_new_event() always
use the same file_operations we can kill these arguments and
simplify the code.
Link: http://lkml.kernel.org/r/20130731173135.GA31040@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_create_file_ops() allocates the copy of id/filter/format/enable
file_operations to set "f_op->owner = mod" for fops_get().
However after the recent changes there is no reason to prevent rmmod
even if one of these files is opened. A file operation can do nothing
but fail after remove_event_file_dir() clears ->i_private for every
file removed by trace_module_remove_events().
Kill "struct ftrace_module_file_ops" and fix the compilation errors.
Link: http://lkml.kernel.org/r/20130731173132.GA31033@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
init_syscall_trace() can only be called during kernel bootup only, so we can
mark it and the functions it calls as __init.
Link: http://lkml.kernel.org/r/51528E89.6080508@huawei.com
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fixed two issues with changing the timestamp clock with trace_clock:
- The global buffer was reset on instance clock changes. Change this to pass
the correct per-instance buffer
- ftrace_now() is used to set buf->time_start in tracing_reset_online_cpus().
This was incorrect because ftrace_now() used the global buffer's clock to
return the current time. Change this to use buffer_ftrace_now() which
returns the current time for the correct per-instance buffer.
Also removed tracing_reset_current() because it is not used anywhere
Link: http://lkml.kernel.org/r/1375493777-17261-2-git-send-email-azl@google.com
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Releasing the free_buffer file in an instance causes the global buffer
to be stopped when TRACE_ITER_STOP_ON_FREE is enabled. Operate on the
correct buffer.
Link: http://lkml.kernel.org/r/1375493777-17261-1-git-send-email-azl@google.com
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_read_pipe zeros all fields bellow "seq". The declaration contains
a comment about that, but it doesn't help.
The first field is "snapshot", it's true when current open file is
snapshot. Looks obvious, that it should not be zeroed.
The second field is "started". It was converted from cpumask_t to
cpumask_var_t (v2.6.28-4983-g4462344), in other words it was
converted from cpumask to pointer on cpumask.
Currently the reference on "started" memory is lost after the first read
from tracing_read_pipe and a proper object will never be freed.
The "started" is never dereferenced for trace_pipe, because trace_pipe
can't have the TRACE_FILE_ANNOTATE options.
Link: http://lkml.kernel.org/r/1375463803-3085183-1-git-send-email-avagin@openvz.org
Cc: stable@vger.kernel.org # 2.6.30
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Uprobes suffer the same problem that kprobes have. There's a race between
writing to the "enable" file and removing the probe. The probe checks for
it being in use and if it is not, goes about deleting the probe and the
event that represents it. But the problem with that is, after it checks
if it is in use it can be enabled, and the deletion of the event (access
to the probe) will fail, as it is in use. But the uprobe will still be
deleted. This is a problem as the event can reference the uprobe that
was deleted.
The fix is to remove the event first, and check to make sure the event
removal succeeds. Then it is safe to remove the probe.
When the event exists, either ftrace or perf can enable the probe and
prevent the event from being removed.
Link: http://lkml.kernel.org/r/20130704034038.991525256@goodmis.org
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a probe is being removed, it cleans up the event files that correspond
to the probe. But there is a race between writing to one of these files
and deleting the probe. This is especially true for the "enable" file.
CPU 0 CPU 1
----- -----
fd = open("enable",O_WRONLY);
probes_open()
release_all_trace_probes()
unregister_trace_probe()
if (trace_probe_is_enabled(tp))
return -EBUSY
write(fd, "1", 1)
__ftrace_set_clr_event()
call->class->reg()
(kprobe_register)
enable_trace_probe(tp)
__unregister_trace_probe(tp);
list_del(&tp->list)
unregister_probe_event(tp) <-- fails!
free_trace_probe(tp)
write(fd, "0", 1)
__ftrace_set_clr_event()
call->class->unreg
(kprobe_register)
disable_trace_probe(tp) <-- BOOM!
A test program was written that used two threads to simulate the
above scenario adding a nanosleep() interval to change the timings
and after several thousand runs, it was able to trigger this bug
and crash:
BUG: unable to handle kernel paging request at 00000005000000f9
IP: [<ffffffff810dee70>] probes_open+0x3b/0xa7
PGD 7808a067 PUD 0
Oops: 0000 [#1] PREEMPT SMP
Dumping ftrace buffer:
---------------------------------
Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6
CPU: 1 PID: 2070 Comm: test-kprobe-rem Not tainted 3.11.0-rc3-test+ #47
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
task: ffff880077756440 ti: ffff880076e52000 task.ti: ffff880076e52000
RIP: 0010:[<ffffffff810dee70>] [<ffffffff810dee70>] probes_open+0x3b/0xa7
RSP: 0018:ffff880076e53c38 EFLAGS: 00010203
RAX: 0000000500000001 RBX: ffff88007844f440 RCX: 0000000000000003
RDX: 0000000000000003 RSI: 0000000000000003 RDI: ffff880076e52000
RBP: ffff880076e53c58 R08: ffff880076e53bd8 R09: 0000000000000000
R10: ffff880077756440 R11: 0000000000000006 R12: ffffffff810dee35
R13: ffff880079250418 R14: 0000000000000000 R15: ffff88007844f450
FS: 00007f87a276f700(0000) GS:ffff88007d480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000005000000f9 CR3: 0000000077262000 CR4: 00000000000007e0
Stack:
ffff880076e53c58 ffffffff81219ea0 ffff88007844f440 ffffffff810dee35
ffff880076e53ca8 ffffffff81130f78 ffff8800772986c0 ffff8800796f93a0
ffffffff81d1b5d8 ffff880076e53e04 0000000000000000 ffff88007844f440
Call Trace:
[<ffffffff81219ea0>] ? security_file_open+0x2c/0x30
[<ffffffff810dee35>] ? unregister_trace_probe+0x4b/0x4b
[<ffffffff81130f78>] do_dentry_open+0x162/0x226
[<ffffffff81131186>] finish_open+0x46/0x54
[<ffffffff8113f30b>] do_last+0x7f6/0x996
[<ffffffff8113cc6f>] ? inode_permission+0x42/0x44
[<ffffffff8113f6dd>] path_openat+0x232/0x496
[<ffffffff8113fc30>] do_filp_open+0x3a/0x8a
[<ffffffff8114ab32>] ? __alloc_fd+0x168/0x17a
[<ffffffff81131f4e>] do_sys_open+0x70/0x102
[<ffffffff8108f06e>] ? trace_hardirqs_on_caller+0x160/0x197
[<ffffffff81131ffe>] SyS_open+0x1e/0x20
[<ffffffff81522742>] system_call_fastpath+0x16/0x1b
Code: e5 41 54 53 48 89 f3 48 83 ec 10 48 23 56 78 48 39 c2 75 6c 31 f6 48 c7
RIP [<ffffffff810dee70>] probes_open+0x3b/0xa7
RSP <ffff880076e53c38>
CR2: 00000005000000f9
---[ end trace 35f17d68fc569897 ]---
The unregister_trace_probe() must be done first, and if it fails it must
fail the removal of the kprobe.
Several changes have already been made by Oleg Nesterov and Masami Hiramatsu
to allow moving the unregister_probe_event() before the removal of
the probe and exit the function if it fails. This prevents the tp
structure from being used after it is freed.
Link: http://lkml.kernel.org/r/20130704034038.819592356@goodmis.org
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The "break" used in the do_for_each_event_file() is used as an optimization
as the loop is really a double loop. The loop searches all event files
for each trace_array. There's only one matching event file per trace_array
and after we find the event file for the trace_array, the break is used
to jump to the next trace_array and start the search there.
As this is not a standard way of using "break" in C code, it requires
a comment right before the break to let people know what is going on.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Change trace_remove_event_call(call) to return the error if this
call is active. This is what the callers assume but can't verify
outside of the tracing locks. Both trace_kprobe.c/trace_uprobe.c
need the additional changes, unregister_trace_probe() should abort
if trace_remove_event_call() fails.
The caller is going to free this call/file so we must ensure that
nobody can use them after trace_remove_event_call() succeeds.
debugfs should be fine after the previous changes and event_remove()
does TRACE_REG_UNREGISTER, but still there are 2 reasons why we need
the additional checks:
- There could be a perf_event(s) attached to this tp_event, so the
patch checks ->perf_refcount.
- TRACE_REG_UNREGISTER can be suppressed by FTRACE_EVENT_FL_SOFT_MODE,
so we simply check FTRACE_EVENT_FL_ENABLED protected by event_mutex.
Link: http://lkml.kernel.org/r/20130729175033.GB26284@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's been a nasty bug that would show up and not give much info.
The bug displayed the following warning:
WARNING: at kernel/trace/ftrace.c:1529 __ftrace_hash_rec_update+0x1e3/0x230()
Pid: 20903, comm: bash Tainted: G O 3.6.11+ #38405.trunk
Call Trace:
[<ffffffff8103e5ff>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8103e65a>] warn_slowpath_null+0x1a/0x20
[<ffffffff810c2ee3>] __ftrace_hash_rec_update+0x1e3/0x230
[<ffffffff810c4f28>] ftrace_hash_move+0x28/0x1d0
[<ffffffff811401cc>] ? kfree+0x2c/0x110
[<ffffffff810c68ee>] ftrace_regex_release+0x8e/0x150
[<ffffffff81149f1e>] __fput+0xae/0x220
[<ffffffff8114a09e>] ____fput+0xe/0x10
[<ffffffff8105fa22>] task_work_run+0x72/0x90
[<ffffffff810028ec>] do_notify_resume+0x6c/0xc0
[<ffffffff8126596e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[<ffffffff815c0f88>] int_signal+0x12/0x17
---[ end trace 793179526ee09b2c ]---
It was finally narrowed down to unloading a module that was being traced.
It was actually more than that. When functions are being traced, there's
a table of all functions that have a ref count of the number of active
tracers attached to that function. When a function trace callback is
registered to a function, the function's record ref count is incremented.
When it is unregistered, the function's record ref count is decremented.
If an inconsistency is detected (ref count goes below zero) the above
warning is shown and the function tracing is permanently disabled until
reboot.
The ftrace callback ops holds a hash of functions that it filters on
(and/or filters off). If the hash is empty, the default means to filter
all functions (for the filter_hash) or to disable no functions (for the
notrace_hash).
When a module is unloaded, it frees the function records that represent
the module functions. These records exist on their own pages, that is
function records for one module will not exist on the same page as
function records for other modules or even the core kernel.
Now when a module unloads, the records that represents its functions are
freed. When the module is loaded again, the records are recreated with
a default ref count of zero (unless there's a callback that traces all
functions, then they will also be traced, and the ref count will be
incremented).
The problem is that if an ftrace callback hash includes functions of the
module being unloaded, those hash entries will not be removed. If the
module is reloaded in the same location, the hash entries still point
to the functions of the module but the module's ref counts do not reflect
that.
With the help of Steve and Joern, we found a reproducer:
Using uinput module and uinput_release function.
cd /sys/kernel/debug/tracing
modprobe uinput
echo uinput_release > set_ftrace_filter
echo function > current_tracer
rmmod uinput
modprobe uinput
# check /proc/modules to see if loaded in same addr, otherwise try again
echo nop > current_tracer
[BOOM]
The above loads the uinput module, which creates a table of functions that
can be traced within the module.
We add uinput_release to the filter_hash to trace just that function.
Enable function tracincg, which increments the ref count of the record
associated to uinput_release.
Remove uinput, which frees the records including the one that represents
uinput_release.
Load the uinput module again (and make sure it's at the same address).
This recreates the function records all with a ref count of zero,
including uinput_release.
Disable function tracing, which will decrement the ref count for uinput_release
which is now zero because of the module removal and reload, and we have
a mismatch (below zero ref count).
The solution is to check all currently tracing ftrace callbacks to see if any
are tracing any of the module's functions when a module is loaded (it already does
that with callbacks that trace all functions). If a callback happens to have
a module function being traced, it increments that records ref count and starts
tracing that function.
There may be a strange side effect with this, where tracing module functions
on unload and then reloading a new module may have that new module's functions
being traced. This may be something that confuses the user, but it's not
a big deal. Another approach is to disable all callback hashes on module unload,
but this leaves some ftrace callbacks that may not be registered, but can
still have hashes tracing the module's function where ftrace doesn't know about
it. That situation can cause the same bug. This solution solves that case too.
Another benefit of this solution, is it is possible to trace a module's
function on unload and load.
Link: http://lkml.kernel.org/r/20130705142629.GA325@redhat.com
Reported-by: Jörn Engel <joern@logfs.org>
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Steve Hodgson <steve@purestorage.com>
Tested-by: Steve Hodgson <steve@purestorage.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When ftrace ops modifies the functions that it will trace, the update
to the function mcount callers may need to be modified. Consolidate
the two places that do the checks to see if an update is required
with a wrapper function for those checks.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Change remove_event_file_dir() to clear ->i_private for every
file we are going to remove.
We need to check file->dir != NULL because event_create_dir()
can fail. debugfs_remove_recursive(NULL) is fine but the patch
moves it under the same check anyway for readability.
spin_lock(d_lock) and "d_inode != NULL" check are not needed
afaics, but I do not understand this code enough.
tracing_open_generic_file() and tracing_release_generic_file()
can go away, ftrace_enable_fops and ftrace_event_filter_fops()
use tracing_open_generic() but only to check tracing_disabled.
This fixes all races with event_remove() or instance_delete().
f_op->read/write/whatever can never use the freed file/call,
all event/* files were changed to check and use ->i_private
under event_mutex.
Note: this doesn't not fix other problems, event_remove() can
destroy the active ftrace_event_call, we need more changes but
those changes are completely orthogonal.
Link: http://lkml.kernel.org/r/20130728183527.GB16723@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Preparation for the next patch. Extract the common code from
remove_event_from_tracers() and __trace_remove_event_dirs()
into the new helper, remove_event_file_dir().
The patch looks more complicated than it actually is, it also
moves remove_subsystem() up to avoid the forward declaration.
Link: http://lkml.kernel.org/r/20130726172547.GA3629@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_format_open() and trace_format_seq_ops are racy, nothing
protects ftrace_event_call from trace_remove_event_call().
Change f_start() to take event_mutex and verify i_private != NULL,
change f_stop() to drop this lock.
This fixes nothing, but now we can change debugfs_remove("format")
callers to nullify ->i_private and fix the the problem.
Note: the usage of event_mutex is sub-optimal but simple, we can
change this later.
Link: http://lkml.kernel.org/r/20130726172543.GA3622@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
event_filter_read/write() are racy, ftrace_event_call can be already
freed by trace_remove_event_call() callers.
1. Shift mutex_lock(event_mutex) from print/apply_event_filter to
the callers.
2. Change the callers, event_filter_read() and event_filter_write()
to read i_private under this mutex and abort if it is NULL.
This fixes nothing, but now we can change debugfs_remove("filter")
callers to nullify ->i_private and fix the the problem.
Link: http://lkml.kernel.org/r/20130726172540.GA3619@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open_generic_file() is racy, ftrace_event_file can be
already freed by rmdir or trace_remove_event_call().
Change event_enable_read() and event_disable_read() to read and
verify "file = i_private" under event_mutex.
This fixes nothing, but now we can change debugfs_remove("enable")
callers to nullify ->i_private and fix the the problem.
Link: http://lkml.kernel.org/r/20130726172536.GA3612@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
event_id_read() is racy, ftrace_event_call can be already freed
by trace_remove_event_call() callers.
Change event_create_dir() to pass "data = call->event.type", this
is all event_id_read() needs. ftrace_event_id_fops no longer needs
tracing_open_generic().
We add the new helper, event_file_data(), to read ->i_private, it
will have more users.
Note: currently ACCESS_ONCE() and "id != 0" check are not needed,
but we are going to change event_remove/rmdir to clear ->i_private.
Link: http://lkml.kernel.org/r/20130726172532.GA3605@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are several tracepoints (mostly in RCU), that reference a string
pointer and uses the print format of "%s" to display the string that
exists in the kernel, instead of copying the actual string to the
ring buffer (saves time and ring buffer space).
But this has an issue with userspace tools that read the binary buffers
that has the address of the string but has no access to what the string
itself is. The end result is just output that looks like:
rcu_dyntick: ffffffff818adeaa 1 0
rcu_dyntick: ffffffff818adeb5 0 140000000000000
rcu_dyntick: ffffffff818adeb5 0 140000000000000
rcu_utilization: ffffffff8184333b
rcu_utilization: ffffffff8184333b
The above is pretty useless when read by the userspace tools. Ideally
we would want something that looks like this:
rcu_dyntick: Start 1 0
rcu_dyntick: End 0 140000000000000
rcu_dyntick: Start 140000000000000 0
rcu_callback: rcu_preempt rhp=0xffff880037aff710 func=put_cred_rcu 0/4
rcu_callback: rcu_preempt rhp=0xffff880078961980 func=file_free_rcu 0/5
rcu_dyntick: End 0 1
The trace_printk() which also only stores the address of the string
format instead of recording the string into the buffer itself, exports
the mapping of kernel addresses to format strings via the printk_format
file in the debugfs tracing directory.
The tracepoint strings can use this same method and output the format
to the same file and the userspace tools will be able to decipher
the address without any modification.
The tracepoint strings need its own section to save the strings because
the trace_printk section will cause the trace_printk() buffers to be
allocated if anything exists within the section. trace_printk() is only
used for debugging and should never exist in the kernel, we can not use
the trace_printk sections.
Add a new tracepoint_str section that will also be examined by the output
of the printk_format file.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit a82274151a "tracing: Protect ftrace_trace_arrays list in trace_events.c"
added taking the trace_types_lock mutex in trace_events.c as there were
several locations that needed it for protection. Unfortunately, it also
encapsulated a call to tracing_reset_all_online_cpus() which also takes
the trace_types_lock, causing a deadlock.
This happens when a module has tracepoints and has been traced. When the
module is removed, the trace events module notifier will grab the
trace_types_lock, do a bunch of clean ups, and also clears the buffer
by calling tracing_reset_all_online_cpus. This doesn't happen often
which explains why it wasn't caught right away.
Commit a82274151a was marked for stable, which means this must be
sent to stable too.
Link: http://lkml.kernel.org/r/51EEC646.7070306@broadcom.com
Reported-by: Arend van Spril <arend@broadcom.com>
Tested-by: Arend van Spriel <arend@broadcom.com>
Cc: Alexander Z Lam <azl@google.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If a ftrace ops is registered with the SAVE_REGS flag set, and there's
already a ops registered to one of its functions but without the
SAVE_REGS flag, there's a small race window where the SAVE_REGS ops gets
added to the list of callbacks to call for that function before the
callback trampoline gets set to save the regs.
The problem is, the function is not currently saving regs, which opens
a small race window where the ops that is expecting regs to be passed
to it, wont. This can cause a crash if the callback were to reference
the regs, as the SAVE_REGS guarantees that regs will be set.
To fix this, we add a check in the loop case where it checks if the ops
has the SAVE_REGS flag set, and if so, it will ignore it if regs is
not set.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After the previous changes trace_array_cpu->trace_cpu and
trace_array->trace_cpu becomes write-only. Remove these members
and kill "struct trace_cpu" as well.
As a side effect this also removes memset(per_cpu_memory, 0).
It was not needed, alloc_percpu() returns zero-filled memory.
Link: http://lkml.kernel.org/r/20130723152613.GA23741@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open() and tracing_snapshot_open() are racy, the memory
inode->i_private points to can be already freed.
Convert these last users of "inode->i_private == trace_cpu" to
use "i_private = trace_array" and rely on tracing_get_cpu().
v2: incorporate the fix from Steven, tracing_release() must not
blindly dereference file->private_data unless we know that
the file was opened for reading.
Link: http://lkml.kernel.org/r/20130723152610.GA23737@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open_generic_tc() is racy, the memory inode->i_private
points to can be already freed.
1. Change its last user, tracing_entries_fops, to use
tracing_*_generic_tr() instead.
2. Change debugfs_create_file("buffer_size_kb", data) callers
to pass "data = tr".
3. Change tracing_entries_read() and tracing_entries_write() to
use tracing_get_cpu().
4. Kill the no longer used tracing_open_generic_tc() and
tracing_release_generic_tc().
Link: http://lkml.kernel.org/r/20130723152606.GA23730@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open_generic_tc() is racy, the memory inode->i_private
points to can be already freed.
1. Change one of its users, tracing_stats_fops, to use
tracing_*_generic_tr() instead.
2. Change trace_create_cpu_file("stats", data) to pass "data = tr".
3. Change tracing_stats_read() to use tracing_get_cpu().
Link: http://lkml.kernel.org/r/20130723152603.GA23727@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_buffers_open() is racy, the memory inode->i_private points
to can be already freed.
Change debugfs_create_file("trace_pipe_raw", data) caller to pass
"data = tr", tracing_buffers_open() can use tracing_get_cpu().
Change debugfs_create_file("snapshot_raw_fops", data) caller too,
this file uses tracing_buffers_open/release.
Link: http://lkml.kernel.org/r/20130723152600.GA23720@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open_pipe() is racy, the memory inode->i_private points to
can be already freed.
Change debugfs_create_file("trace_pipe", data) callers to to pass
"data = tr", tracing_open_pipe() can use tracing_get_cpu().
Link: http://lkml.kernel.org/r/20130723152557.GA23717@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every "file_operations" used by tracing_init_debugfs_percpu is buggy.
f_op->open/etc does:
1. struct trace_cpu *tc = inode->i_private;
struct trace_array *tr = tc->tr;
2. trace_array_get(tr) or fail;
3. do_something(tc);
But tc (and tr) can be already freed before trace_array_get() is called.
And it doesn't matter whether this file is per-cpu or it was created by
init_tracer_debugfs(), free_percpu() or kfree() are equally bad.
Note that even 1. is not safe, the freed memory can be unmapped. But even
if it was safe trace_array_get() can wrongly succeed if we also race with
the next new_instance_create() which can re-allocate the same tr, or tc
was overwritten and ->tr points to the valid tr. In this case 3. uses the
freed/reused memory.
Add the new trivial helper, trace_create_cpu_file() which simply calls
trace_create_file() and encodes "cpu" in "struct inode". Another helper,
tracing_get_cpu() will be used to read cpu_nr-or-RING_BUFFER_ALL_CPUS.
The patch abuses ->i_cdev to encode the number, it is never used unless
the file is S_ISCHR(). But we could use something else, say, i_bytes or
even ->d_fsdata. In any case this hack is hidden inside these 2 helpers,
it would be trivial to change them if needed.
This patch only changes tracing_init_debugfs_percpu() to use the new
trace_create_cpu_file(), the next patches will change file_operations.
Note: tracing_get_cpu(inode) is always safe but you can't trust the
result unless trace_array_get() was called, without trace_types_lock
which acts as a barrier it can wrongly return RING_BUFFER_ALL_CPUS.
Link: http://lkml.kernel.org/r/20130723152554.GA23710@redhat.com
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_buffers_open() does trace_array_get() and then it wrongly
inrcements tr->ref again under trace_types_lock. This means that
every caller leaks trace_array:
# cd /sys/kernel/debug/tracing/
# mkdir instances/X
# true < instances/X/per_cpu/cpu0/trace_pipe_raw
# rmdir instances/X
rmdir: failed to remove `instances/X': Device or resource busy
Link: http://lkml.kernel.org/r/20130719153644.GA18899@redhat.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Trivial. trace_array->waiter has no users since 6eaaa5d5
"tracing/core: use appropriate waiting on trace_pipe".
Link: http://lkml.kernel.org/r/20130719142036.GA1594@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
event_id_read() has no reason to kmalloc "struct trace_seq"
(more than PAGE_SIZE!), it can use a small buffer instead.
Note: "if (*ppos) return 0" looks strange and even wrong,
simple_read_from_buffer() handles ppos != 0 case corrrectly.
And it seems that almost every user of trace_seq in this file
should be converted too. Unless you use seq_open(), trace_seq
buys nothing compared to the raw buffer, but it needs a bit
more memory and code.
Link: http://lkml.kernel.org/r/20130718184712.GA4786@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
f_next() looks overcomplicated, and it is not strictly correct
even if this doesn't matter.
Say, FORMAT_FIELD_SEPERATOR should not return NULL (means EOF)
if trace_get_fields() returns an empty list, we should simply
advance to FORMAT_PRINTFMT as we do when we find the end of list.
1. Change f_next() to return "struct list_head *" rather than
"ftrace_event_field *", and change f_show() to do list_entry().
This simplifies the code a bit, only f_show() needs to know
about ftrace_event_field, and f_next() can play with ->prev
directly
2. Change f_next() to not play with ->prev / return inside the
switch() statement. It can simply set node = head/common_head,
the prev-or-advance-to-the-next-magic below does all work.
While at it. f_start() looks overcomplicated too. I don't think
*pos == 0 makes sense as a separate case, just change this code
to do "while" instead of "do/while".
The patch also moves f_start() down, close to f_stop(). This is
purely cosmetic, just to make the locking added by the next patch
more clear/visible.
Link: http://lkml.kernel.org/r/20130718184710.GA4783@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The selftest for function and function graph tracers are defined as
__init, as they are only executed at boot up. The "tracer" structs
that are associated to those tracers are not setup as __init as they
are used after boot. To stop mismatch warnings, those structures
need to be annotated with __ref_data.
Currently, the tracer structures are defined to __read_mostly, as they
do not really change. But in the future they should be converted to
consts, but that will take a little work because they have a "next"
pointer that gets updated when they are registered. That will have to
wait till the next major release.
Link: http://lkml.kernel.org/r/1373596735.17876.84.camel@gandalf.local.home
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some error paths did not handle ref counting properly, and some trace files need
ref counting.
Link: http://lkml.kernel.org/r/1374171524-11948-1-git-send-email-azl@google.com
Cc: stable@vger.kernel.org # 3.10
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove debugfs directories for tracing instances during creation if an error
occurs causing the trace_array for that instance to not be added to
ftrace_trace_arrays. If the directory continues to exist after the error, it
cannot be removed because the respective trace_array is not in
ftrace_trace_arrays.
Link: http://lkml.kernel.org/r/1373502874-1706-2-git-send-email-azl@google.com
Cc: stable@vger.kernel.org # 3.10
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Wait for disabling all running kprobe handlers when a kprobe
event is disabled, since the caller, trace_remove_event_call()
supposes that a removing event is disabled completely by
disabling the event.
With this change, ftrace can ensure that there is no running
event handlers after disabling it.
Link: http://lkml.kernel.org/r/20130709093526.20138.93100.stgit@mhiramat-M0-7522
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every perf_trace_buf_prepare() caller does
WARN_ONCE(size > PERF_MAX_TRACE_SIZE, message) and "message" is
almost the same.
Shift this WARN_ONCE() into perf_trace_buf_prepare(). This changes
the meaning of _ONCE, but I think this is fine.
- 4947014 2932448 10104832 17984294 1126b26 vmlinux
+ 4948422 2932448 10104832 17985702 11270a6 vmlinux
on my build.
Link: http://lkml.kernel.org/r/20130617170211.GA19813@redhat.com
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
perf_trace_buf_prepare() + perf_trace_buf_submit(head, task => NULL)
make no sense if hlist_empty(head). Change perf_syscall_enter/exit()
to check sys_data->{enter,exit}_event->perf_events beforehand.
Link: http://lkml.kernel.org/r/20130617170207.GA19806@redhat.com
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
perf_trace_buf_prepare() + perf_trace_buf_submit(head, task => NULL)
make no sense if hlist_empty(head). Change perf_ftrace_function_call()
to check event_function.perf_events beforehand.
Link: http://lkml.kernel.org/r/20130617170204.GA19803@redhat.com
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There have some mismatch between comments with
real function name, update it.
This patch also add some missed function arguments
description.
Link: http://lkml.kernel.org/r/51E3B3B2.4080307@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For string without format specifiers, use trace_seq_puts()
or trace_seq_putc().
Link: http://lkml.kernel.org/r/51E3B3AC.1000605@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ fixed a trace_seq_putc(s, " ") to trace_seq_putc(s, ' ') ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We should use CONFIG_STACK_TRACER to guard readme text
of stack tracer related file, not CONFIG_STACKTRACE.
Link: http://lkml.kernel.org/r/51E3B3A2.8080609@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
were added to 3.10, which includes several bug fixes that have been
marked for stable.
As for new features, there were a few, but nothing to write to LWN about.
These include:
New function trigger called "dump" and "cpudump" that will cause
ftrace to dump its buffer to the console when the function is called.
The difference between "dump" and "cpudump" is that "dump" will dump
the entire contents of the ftrace buffer, where as "cpudump" will only
dump the contents of the ftrace buffer for the CPU that called the function.
Another small enhancement is a new sysctl switch called "traceoff_on_warning"
which, when enabled, will disable tracing if any WARN_ON() is triggered.
This is useful if you want to debug what caused a warning and do not
want to risk losing your trace data by the ring buffer overwriting the
data before you can disable it. There's also a kernel command line
option that will make this enabled at boot up called the same thing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJR1uF2AAoJEOdOSU1xswtMJ1IH/2LSiZAKTA2QaRgGQC/5Bb9c
XSOI1HfD/78lmUvTyb0AX8sLpkzZlvIONEQ/WaZUFo1Zjbrl45zJUwMkTE9uImEg
ZqI5x8OiiN6j4XrRbfYn3Ti060H/Jq41pZXa+shh961Vv51ilv/1yyLkoRmnjzuO
JTloPdXDV7icOqqiSdgxSdtUSv59Ef1ZdHgvvsb3aqzMC5btVQPi4kIys0ST1Tr1
pMWBY+UgvH0xYm3gvTR+W6jjDlkVZEH2alkmcinfr+uC1tm9DDqK2HA17Pd5yZ5z
HNdT76lCzf9iqRF5F8HUvUt+PIp76dNNxAt2qpB6APqAuJTojyguxXHDbY/0kzs=
=UvLi
-----END PGP SIGNATURE-----
Merge tag 'trace-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing changes from Steven Rostedt:
"The majority of the changes here are cleanups for the large changes
that were added to 3.10, which includes several bug fixes that have
been marked for stable.
As for new features, there were a few, but nothing to write to LWN
about. These include:
New function trigger called "dump" and "cpudump" that will cause
ftrace to dump its buffer to the console when the function is called.
The difference between "dump" and "cpudump" is that "dump" will dump
the entire contents of the ftrace buffer, where as "cpudump" will only
dump the contents of the ftrace buffer for the CPU that called the
function.
Another small enhancement is a new sysctl switch called
"traceoff_on_warning" which, when enabled, will disable tracing if any
WARN_ON() is triggered. This is useful if you want to debug what
caused a warning and do not want to risk losing your trace data by the
ring buffer overwriting the data before you can disable it. There's
also a kernel command line option that will make this enabled at boot
up called the same thing"
* tag 'trace-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (34 commits)
tracing: Make tracing_open_generic_{tr,tc}() static
tracing: Remove ftrace() function
tracing: Remove TRACE_EVENT_TYPE enum definition
tracing: Make tracer_tracing_{off,on,is_on}() static
tracing: Fix irqs-off tag display in syscall tracing
uprobes: Fix return value in error handling path
tracing: Fix race between deleting buffer and setting events
tracing: Add trace_array_get/put() to event handling
tracing: Get trace_array ref counts when accessing trace files
tracing: Add trace_array_get/put() to handle instance refs better
tracing: Protect ftrace_trace_arrays list in trace_events.c
tracing: Make trace_marker use the correct per-instance buffer
ftrace: Do not run selftest if command line parameter is set
tracing/kprobes: Don't pass addr=ip to perf_trace_buf_submit()
tracing: Use flag buffer_disabled for irqsoff tracer
tracing/kprobes: Turn trace_probe->files into list_head
tracing: Fix disabling of soft disable
tracing: Add missing syscall_metadata comment
tracing: Simplify code for showing of soft disabled flag
tracing/kprobes: Kill probe_enable_lock
...
I have patches that will use tracing_open_generic_tr/tc() in other
files, but as they are not ready to be merged yet, and Fengguang Wu's
sparse scripts pointed out that these functions were not declared
anywhere, I'll make them static for now.
When these functions are required to be used elsewhere, I'll remove
the static then.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I have patches that will use tracer_tracing_on/off/is_on() in other
files, but as they are not ready to be merged yet, and Fengguang Wu's
sparse scripts pointed out that these functions were not declared
anywhere, I'll make them static for now.
When these functions are required to be used elsewhere, I'll remove
the static then.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While analyzing the code, I discovered that there's a potential race between
deleting a trace instance and setting events. There are a few races that can
occur if events are being traced as the buffer is being deleted. Mostly the
problem comes with freeing the descriptor used by the trace event callback.
To prevent problems like this, the events are disabled before the buffer is
deleted. The problem with the current solution is that the event_mutex is let
go between disabling the events and freeing the files, which means that the events
could be enabled again while the freeing takes place.
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit a695cb5816 "tracing: Prevent deleting instances when they are being read"
tried to fix a race between deleting a trace instance and reading contents
of a trace file. But it wasn't good enough. The following could crash the kernel:
# cd /sys/kernel/debug/tracing/instances
# ( while :; do mkdir foo; rmdir foo; done ) &
# ( while :; do echo 1 > foo/events/sched/sched_switch 2> /dev/null; done ) &
Luckily this can only be done by root user, but it should be fixed regardless.
The problem is that a delete of the file can happen after the write to the event
is opened, but before the enabling happens.
The solution is to make sure the trace_array is available before succeeding in
opening for write, and incerment the ref counter while opened.
Now the instance can be deleted when the events are writing to the buffer,
but the deletion of the instance will disable all events before the instance
is actually deleted.
Cc: stable@vger.kernel.org # 3.10
Reported-by: Alexander Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a trace file is opened that may access a trace array, it must
increment its ref count to prevent it from being deleted.
Cc: stable@vger.kernel.org # 3.10
Reported-by: Alexander Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit a695cb5816 "tracing: Prevent deleting instances when they are being read"
tried to fix a race between deleting a trace instance and reading contents
of a trace file. But it wasn't good enough. The following could crash the kernel:
# cd /sys/kernel/debug/tracing/instances
# ( while :; do mkdir foo; rmdir foo; done ) &
# ( while :; do cat foo/trace &> /dev/null; done ) &
Luckily this can only be done by root user, but it should be fixed regardless.
The problem is that a delete of the file can happen after the reader starts
to open the file but before it grabs the trace_types_mutex.
The solution is to validate the trace array before using it. If the trace
array does not exist in the list of trace arrays, then it returns -ENODEV.
There's a possibility that a trace_array could be deleted and a new one
created and the open would open its file instead. But that is very minor as
it will just return the data of the new trace array, it may confuse the user
but it will not crash the system. As this can only be done by root anyway,
the race will only occur if root is deleting what its trying to read at
the same time.
Cc: stable@vger.kernel.org # 3.10
Reported-by: Alexander Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are multiple places where the ftrace_trace_arrays list is accessed in
trace_events.c without the trace_types_lock held.
Link: http://lkml.kernel.org/r/1372732674-22726-1-git-send-email-azl@google.com
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_marker file was present for each new instance created, but it
added the trace mark to the global trace buffer instead of to
the instance's buffer.
Link: http://lkml.kernel.org/r/1372717885-4543-2-git-send-email-azl@google.com
Cc: David Sharp <dhsharp@google.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the kernel command line ftrace filter parameters are set
(ftrace_filter or ftrace_notrace), force the function self test to
pass, with a warning why it was forced.
If the user adds a filter to the kernel command line, it is assumed
that they know what they are doing, and the self test should just not
run instead of failing (which disables function tracing) or clearing
the filter, as that will probably annoy the user.
If the user wants the selftest to run, the message will tell them why
it did not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
kprobe_perf_func() and kretprobe_perf_func() pass addr=ip to
perf_trace_buf_submit() for no reason.
This sets perf_sample_data->addr for PERF_SAMPLE_ADDR, we already
have perf_sample_data->ip initialized if PERF_SAMPLE_IP.
Link: http://lkml.kernel.org/r/20130620173811.GA13161@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the ring buffer is disabled and the irqsoff tracer records a trace it
will clear out its buffer and lose the data it had previously recorded.
Currently there's a callback when writing to the tracing_of file, but if
tracing is disabled via the function tracer trigger, it will not inform
the irqsoff tracer to stop recording.
By using the "mirror" flag (buffer_disabled) in the trace_array, that keeps
track of the status of the trace_array's buffer, it gives the irqsoff
tracer a fast way to know if it should record a new trace or not.
The flag may be a little behind the real state of the buffer, but it
should not affect the trace too much. It's more important for the irqsoff
tracer to be fast.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I think that "ftrace_event_file *trace_probe[]" complicates the
code for no reason, turn it into list_head to simplify the code.
enable_trace_probe() no longer needs synchronize_sched().
This needs the extra sizeof(list_head) memory for every attached
ftrace_event_file, hopefully not a problem in this case.
Link: http://lkml.kernel.org/r/20130620173814.GA13165@redhat.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The comment on the soft disable 'disable' case of
__ftrace_event_enable_disable() states that the soft disable bit
should be cleared in that case, but currently only the soft mode bit
is actually cleared.
This essentially leaves the standard non-soft-enable enable/disable
paths as the only way to clear the soft disable flag, but the soft
disable bit should also be cleared when removing a trigger with '!'.
Also, the SOFT_DISABLED bit should never be set if SOFT_MODE is
cleared.
This fixes the above discrepancies.
Link: http://lkml.kernel.org/r/b9c68dd50bc07019e6c67d3f9b29be4ef1b2badb.1372479499.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
enable_trace_probe() and disable_trace_probe() should not worry about
serialization, the caller (perf_trace_init or __ftrace_set_clr_event)
holds event_mutex.
They are also called by kprobe_trace_self_tests_init(), but this __init
function can't race with itself or trace_events.c
And note that this code depended on event_mutex even before 41a7dd420c
which introduced probe_enable_lock. In fact it assumes that the caller
kprobe_register() can never race with itself. Otherwise, say, tp->flags
manipulations are racy.
Link: http://lkml.kernel.org/r/20130620173809.GA13158@redhat.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
perf_trace_buf_prepare() + perf_trace_buf_submit() make no sense
if this task/CPU has no active counters. Change kprobe_perf_func()
and kretprobe_perf_func() to check call->perf_events beforehand
and return if this list is empty.
For example, "perf record -e some_probe -p1". Only /sbin/init will
report, all other threads which hit the same probe will do
perf_trace_buf_prepare/perf_trace_buf_submit just to realize that
nobody wants perf_swevent_event().
Link: http://lkml.kernel.org/r/20130620173806.GA13151@redhat.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Running the following:
# cd /sys/kernel/debug/tracing
# echo p:i do_sys_open > kprobe_events
# echo p:j schedule >> kprobe_events
# cat kprobe_events
p:kprobes/i do_sys_open
p:kprobes/j schedule
# echo p:i do_sys_open >> kprobe_events
# cat kprobe_events
p:kprobes/j schedule
p:kprobes/i do_sys_open
# ls /sys/kernel/debug/tracing/events/kprobes/
enable filter j
Notice that the 'i' is missing from the kprobes directory.
The console produces:
"Failed to create system directory kprobes"
This is because kprobes passes in a allocated name for the system
and the ftrace event subsystem saves off that name instead of creating
a duplicate for it. But the kprobes may free the system name making
the pointer to it invalid.
This bug was introduced by 92edca073c "tracing: Use direct field, type
and system names" which switched from using kstrdup() on the system name
in favor of just keeping apointer to it, as the internal ftrace event
system names are static and exist for the life of the computer being booted.
Instead of reverting back to duplicating system names again, we can use
core_kernel_data() to determine if the passed in name was allocated or
static. Then use the MSB of the ref_count to be a flag to keep track if
the name was allocated or not. Then we can still save from having to duplicate
strings that will always exist, but still copy the ones that may be freed.
Cc: stable@vger.kernel.org # 3.10
Reported-by: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com>
Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since tp->flags assignment was moved into function enable_trace_probe(),
there is no need to use trace_probe_is_enabled to check flags
in the same function.
Remove the unnecessary checking.
Link: http://lkml.kernel.org/r/51BA7B9E.3040807@huawei.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a traceoff_on_warning option in both the kernel command line as well
as a sysctl option. When set, any WARN*() function that is hit will cause
the tracing_on variable to be cleared, which disables writing to the
ring buffer.
This is useful especially when tracing a bug with function tracing. When
a warning is hit, the print caused by the warning can flood the trace with
the functions that producing the output for the warning. This can make the
resulting trace useless by either hiding where the bug happened, or worse,
by overflowing the buffer and losing the trace of the bug totally.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are some cases when filtering on a set flag of a field of a tracepoint
is useful. But currently the only filtering commands for numbered fields
is ==, !=, <, <=, >, >=. This does not help when you just want to trace if
a specific flag is set. For example:
> # sudo trace-cmd record -e brcmfmac:brcmf_dbg -f 'level & 0x40000'
> disable all
> enable brcmfmac:brcmf_dbg
> path = /sys/kernel/debug/tracing/events/brcmfmac/brcmf_dbg/enable
> (level & 0x40000)
> ^
> parse_error: Invalid operator
>
When trying to trace brcmf_dbg when level has its 1 << 18 bit set, the
filter fails to perform.
By allowing a binary '&' operation, this gives the user the ability to
test a bit.
Note, a binary '|' is not added, as it doesn't make sense as fields must
be compared to constants (for now), and ORing a constant will always return
true.
Link: http://lkml.kernel.org/r/1371057385.9844.261.camel@gandalf.local.home
Suggested-by: Arend van Spriel <arend@broadcom.com>
Tested-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracer uses preempt_disable/enable_notrace() for
synchronization between reading registered ftrace_ops and unregistering
them.
Most of the ftrace_ops are global permanent structures that do not
require this synchronization. That is, ops may be added and removed from
the hlist but are never freed, and wont hurt if a synchronization is
missed.
But this is not true for dynamically created ftrace_ops or control_ops,
which are used by the perf function tracing.
The problem here is that the function tracer can be used to trace
kernel/user context switches as well as going to and from idle.
Basically, it can be used to trace blind spots of the RCU subsystem.
This means that even though preempt_disable() is done, a
synchronize_sched() will ignore CPUs that haven't made it out of user
space or idle. These can include functions that are being traced just
before entering or exiting the kernel sections.
To implement the RCU synchronization, instead of using
synchronize_sched() the use of schedule_on_each_cpu() is performed. This
means that when a dynamically allocated ftrace_ops, or a control ops is
being unregistered, all CPUs must be touched and execute a ftrace_sync()
stub function via the work queues. This will rip CPUs out from idle or
in dynamic tick mode. This only happens when a user disables perf
function tracing or other dynamically allocated function tracers, but it
allows us to continue to debug RCU and context tracking with function
tracing.
Link: http://lkml.kernel.org/r/1369785676.15552.55.camel@gandalf.local.home
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 4f271a2a60
(tracing: Add a proc file to stop tracing and free buffer)
implement a method to free up ring buffer in kernel memory
in the release code path of free_buffer's fd.
Then we don't need read/write support for free_buffer,
indeed we just have a dummy write fop, and don't implement read fop.
So the 0200 is more reasonable file mode for free_buffer than
the current file mode 0644.
Link: http://lkml.kernel.org/r/20130526085201.GA3183@udknight
Acked-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Acked-by: David Sharp <dhsharp@google.com>
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the "cpudump" command to have the current CPU ftrace buffer dumped
to console if a function is hit. This is useful when debugging a
tripple fault, where you have an idea of a function that is called
just before the tripple fault occurs, and can tell ftrace to dump its
content out to the console before it continues.
This differs from the "dump" command as it only dumps the content of
the ring buffer for the currently executing CPU, and does not show
the contents of the other CPUs.
Format is:
<function>:cpudump
echo 'bad_address:cpudump' > /debug/tracing/set_ftrace_filter
To remove this:
echo '!bad_address:cpudump' > /debug/tracing/set_ftrace_filter
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the "dump" command to have the ftrace buffer dumped to console if
a function is hit. This is useful when debugging a tripple fault,
where you have an idea of a function that is called just before the
tripple fault occurs, and can tell ftrace to dump its content out
to the console before it continues.
Format is:
<function>:dump
echo 'bad_address:dump' > /debug/tracing/set_ftrace_filter
To remove this:
echo '!bad_address:dump' > /debug/tracing/set_ftrace_filter
Requested-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Outputting formats of x86-tsc and counter should be a raw format, but after
applying the patch(2b6080f28c), the format was
changed to nanosec. This is because the global variable trace_clock_id was used.
When we use multiple buffers, clock_id of each sub-buffer should be used. Then,
this patch uses tr->clock_id instead of the global variable trace_clock_id.
[ Basically, this fixes a regression where the multibuffer code changed the
trace_clock file to update tr->clock_id but the traces still use the old
global trace_clock_id variable, negating the file's effect. The global
trace_clock_id variable is obsolete and removed. - SR ]
Link: http://lkml.kernel.org/r/20130423013239.22334.7394.stgit@yunodevel
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The irqsoff tracer records the max time that interrupts are disabled.
There are hooks in the assembly code that calls back into the tracer when
interrupts are disabled or enabled.
When they are enabled, the tracer checks if the amount of time they
were disabled is larger than the previous recorded max interrupts off
time. If it is, it creates a snapshot of the currently running trace
to store where the last largest interrupts off time was held and how
it happened.
During testing, this RCU lockdep dump appeared:
[ 1257.829021] ===============================
[ 1257.829021] [ INFO: suspicious RCU usage. ]
[ 1257.829021] 3.10.0-rc1-test+ #171 Tainted: G W
[ 1257.829021] -------------------------------
[ 1257.829021] /home/rostedt/work/git/linux-trace.git/include/linux/rcupdate.h:780 rcu_read_lock() used illegally while idle!
[ 1257.829021]
[ 1257.829021] other info that might help us debug this:
[ 1257.829021]
[ 1257.829021]
[ 1257.829021] RCU used illegally from idle CPU!
[ 1257.829021] rcu_scheduler_active = 1, debug_locks = 0
[ 1257.829021] RCU used illegally from extended quiescent state!
[ 1257.829021] 2 locks held by trace-cmd/4831:
[ 1257.829021] #0: (max_trace_lock){......}, at: [<ffffffff810e2b77>] stop_critical_timing+0x1a3/0x209
[ 1257.829021] #1: (rcu_read_lock){.+.+..}, at: [<ffffffff810dae5a>] __update_max_tr+0x88/0x1ee
[ 1257.829021]
[ 1257.829021] stack backtrace:
[ 1257.829021] CPU: 3 PID: 4831 Comm: trace-cmd Tainted: G W 3.10.0-rc1-test+ #171
[ 1257.829021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[ 1257.829021] 0000000000000001 ffff880065f49da8 ffffffff8153dd2b ffff880065f49dd8
[ 1257.829021] ffffffff81092a00 ffff88006bd78680 ffff88007add7500 0000000000000003
[ 1257.829021] ffff88006bd78680 ffff880065f49e18 ffffffff810daebf ffffffff810dae5a
[ 1257.829021] Call Trace:
[ 1257.829021] [<ffffffff8153dd2b>] dump_stack+0x19/0x1b
[ 1257.829021] [<ffffffff81092a00>] lockdep_rcu_suspicious+0x109/0x112
[ 1257.829021] [<ffffffff810daebf>] __update_max_tr+0xed/0x1ee
[ 1257.829021] [<ffffffff810dae5a>] ? __update_max_tr+0x88/0x1ee
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810dbf85>] update_max_tr_single+0x11d/0x12d
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810e2b15>] stop_critical_timing+0x141/0x209
[ 1257.829021] [<ffffffff8109569a>] ? trace_hardirqs_on+0xd/0xf
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810e3057>] time_hardirqs_on+0x2a/0x2f
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff8109550c>] trace_hardirqs_on_caller+0x16/0x197
[ 1257.829021] [<ffffffff8109569a>] trace_hardirqs_on+0xd/0xf
[ 1257.829021] [<ffffffff811002b9>] user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810029b4>] do_notify_resume+0x92/0x97
[ 1257.829021] [<ffffffff8154bdca>] int_signal+0x12/0x17
What happened was entering into the user code, the interrupts were enabled
and a max interrupts off was recorded. The trace buffer was saved along with
various information about the task: comm, pid, uid, priority, etc.
The uid is recorded with task_uid(tsk). But this is a macro that uses rcu_read_lock()
to retrieve the data, and this happened to happen where RCU is blind (user_enter).
As only the preempt and irqs off tracers can have this happen, and they both
only have the tsk == current, if tsk == current, use current_uid() instead of
task_uid(), as current_uid() does not use RCU as only current can change its uid.
This fixes the RCU suspicious splat.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The branch selftest calls trace_test_buffer(), but with the new code
it expects the first parameter to be a pointer to a struct trace_buffer.
All self tests were changed but the branch selftest was missed.
This caused either a crash or failed test when the branch selftest was
enabled.
Link: http://lkml.kernel.org/r/20130529141333.GA24064@localhost
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As rcu_dereference_raw() under RCU debug config options can add quite a
bit of checks, and that tracing uses rcu_dereference_raw(), these checks
happen with the function tracer. The function tracer also happens to trace
these debug checks too. This added overhead can livelock the system.
Have the function tracer use the new RCU _notrace equivalents that do
not do the debug checks for RCU.
Link: http://lkml.kernel.org/r/20130528184209.467603904@goodmis.org
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing infrastructure sets up for possible CPUs, but it uses
the ring buffer polling, it is possible to call the ring buffer
polling code with a CPU that hasn't been allocated. This will cause
a kernel oops when it access a ring buffer cpu buffer that is part
of the possible cpus but hasn't been allocated yet as the CPU has never
been online.
Reported-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If ftrace=<tracer> is on the kernel command line, when that tracer is
registered, it will be initiated by tracing_set_tracer() to execute that
tracer.
The nop tracer is just a stub tracer that is used to have no tracer
enabled. It is assigned at early bootup as it is the default tracer.
But if ftrace=nop is on the kernel command line, the registering of the
nop tracer will call tracing_set_tracer() which will try to execute
the nop tracer. But it expects tr->current_trace to be assigned something
as it usually is assigned to the nop tracer. As it hasn't been assigned
to anything yet, it causes the system to crash.
The simple fix is to move the tr->current_trace = nop before registering
the nop tracer. The functionality is still the same as the nop tracer
doesn't do anything anyway.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since try_module_get() returns false( = 0) when it fails to
pindown a module, event_enable_func() returns 0 which means
"succeed". This can cause a kernel panic when the entry
is removed, because the event is already released.
This fixes the bug by returning -EBUSY, because the reason
why it fails is that the module is being removed at that time.
Link: http://lkml.kernel.org/r/20130516114848.13508.97899.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to sparse warning, print_*probe_event static because
those functions are not directly called from outside.
Link: http://lkml.kernel.org/r/20130513115839.6545.83067.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use rcu_dereference_raw() for accessing tp->files. Because the
write-side uses rcu_assign_pointer() for memory barrier,
the read-side also has to use rcu_dereference_raw() with
read memory barrier.
Link: http://lkml.kernel.org/r/20130513115834.6545.17022.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Special preds are created when folding a series of preds that
can be done in serial. These are allocated in an ops field of
the pred structure. But they were never freed, causing memory
leaks.
This was discovered using the kmemleak checker:
unreferenced object 0xffff8800797fd5e0 (size 32):
comm "swapper/0", pid 1, jiffies 4294690605 (age 104.608s)
hex dump (first 32 bytes):
00 00 01 00 03 00 05 00 07 00 09 00 0b 00 0d 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff814b52af>] kmemleak_alloc+0x73/0x98
[<ffffffff8111ff84>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
[<ffffffff81120e68>] __kmalloc+0xd7/0x125
[<ffffffff810d47eb>] kcalloc.constprop.24+0x2d/0x2f
[<ffffffff810d4896>] fold_pred_tree_cb+0xa9/0xf4
[<ffffffff810d3781>] walk_pred_tree+0x47/0xcc
[<ffffffff810d5030>] replace_preds.isra.20+0x6f8/0x72f
[<ffffffff810d50b5>] create_filter+0x4e/0x8b
[<ffffffff81b1c30d>] ftrace_test_event_filter+0x5a/0x155
[<ffffffff8100028d>] do_one_initcall+0xa0/0x137
[<ffffffff81afbedf>] kernel_init_freeable+0x14d/0x1dc
[<ffffffff814b24b7>] kernel_init+0xe/0xdb
[<ffffffff814d539c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: stable@vger.kernel.org # 2.6.39+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
kprobes up to par with the latest changes to ftrace (multi buffering
and the new function probes).
He also discovered and fixed some bugs in doing so. When pulling in his
patches, I also found a few minor bugs as well and fixed them.
This also includes a compile fix for some archs that select the ring buffer
but not tracing.
I based this off of the last patch you took from me that fixed the merge
conflict error, as that was the commit that had all the changes I needed
for this set of changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRjYnJAAoJEOdOSU1xswtMg9EH/iFs438FgrNMk2ZdQftmqcqA
cqcactHo1mmoHjAoLZT/oDBjEThhVUuqzMXrFRutSYcTh4PsQEC3arX0mpsC+T12
UEEV/tZS3TXH+GXEyrOit/O3kzntQcDHwJDV4+0n80IrJmw4IDZbnV3R8DWjS6wp
so+dq0A1pwehcG/upgpw1oTKsGv1G/p6vyf968B6W44icHEClLiph4JE2kzE6D3r
fzSpOLaQoBEvwIRf6xRKxi240VqIItXwfG7pwNpPpSC37gRLzm74zGr+Sj93/k1y
pARbZ/5XO7/pcVYQYupErRAoV5in+QMZ67k5G1vQIvyOS9r039catbQf/7PkzcI=
=EZCE
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing/kprobes update from Steven Rostedt:
"The majority of these changes are from Masami Hiramatsu bringing
kprobes up to par with the latest changes to ftrace (multi buffering
and the new function probes).
He also discovered and fixed some bugs in doing so. When pulling in
his patches, I also found a few minor bugs as well and fixed them.
This also includes a compile fix for some archs that select the ring
buffer but not tracing.
I based this off of the last patch you took from me that fixed the
merge conflict error, as that was the commit that had all the changes
I needed for this set of changes."
* tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Support soft-mode disabling
tracing/kprobes: Support ftrace_event_file base multibuffer
tracing/kprobes: Pass trace_probe directly from dispatcher
tracing/kprobes: Increment probe hit-count even if it is used by perf
tracing/kprobes: Use bool for retprobe checker
ftrace: Fix function probe when more than one probe is added
ftrace: Fix the output of enabled_functions debug file
ftrace: Fix locking in register_ftrace_function_probe()
tracing: Add helper function trace_create_new_event() to remove duplicate code
tracing: Modify soft-mode only if there's no other referrer
tracing: Indicate enabled soft-mode in enable file
tracing/kprobes: Fix to increment return event probe hit-count
ftrace: Cleanup regex_lock and ftrace_lock around hash updating
ftrace, kprobes: Fix a deadlock on ftrace_regex_lock
ftrace: Have ftrace_regex_write() return either read or error
tracing: Return error if register_ftrace_function_probe() fails for event_enable_func()
tracing: Don't succeed if event_enable_func did not register anything
ring-buffer: Select IRQ_WORK
Support soft-mode disabling on kprobe-based dynamic events.
Soft-disabling is just ignoring recording if the soft disabled
flag is set.
Link: http://lkml.kernel.org/r/20130509054454.30398.7237.stgit@mhiramat-M0-7522
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Increment probe hit-count for profiling even if it is used
by perf tool. Same thing has already done in trace_uprobe.
Link: http://lkml.kernel.org/r/20130509054436.30398.21133.stgit@mhiramat-M0-7522
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enabled_functions debugfs file was created to be able to see
what functions have been modified from nops to calling a tracer.
The current method uses the counter in the function record.
As when a ftrace_ops is registered to a function, its count
increases. But that doesn't mean that the function is actively
being traced. /proc/sys/kernel/ftrace_enabled can be set to zero
which would disable it, as well as something can go wrong and
we can think its enabled when only the counter is set.
The record's FTRACE_FL_ENABLED flag is set or cleared when its
function is modified. That is a much more accurate way of knowing
what function is enabled or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The iteration of the ftrace function list and the call to
ftrace_match_record() need to be protected by the ftrace_lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both __trace_add_new_event() and __trace_early_add_new_event() do
basically the same thing, except that __trace_add_new_event() does
a little more.
Instead of having duplicate code between the two functions, add
a helper function trace_create_new_event() that both can use.
This will help against having bugs fixed in one function but not
the other.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Modify soft-mode flag only if no other soft-mode referrer
(currently only the ftrace triggers) by using a reference
counter in each ftrace_event_file.
Without this fix, adding and removing several different
enable/disable_event triggers on the same event clear
soft-mode bit from the ftrace_event_file. This also
happens with a typo of glob on setting triggers.
e.g.
# echo vfs_symlink:enable_event:net:netif_rx > set_ftrace_filter
# cat events/net/netif_rx/enable
0*
# echo typo_func:enable_event:net:netif_rx > set_ftrace_filter
# cat events/net/netif_rx/enable
0
# cat set_ftrace_filter
#### all functions enabled ####
vfs_symlink:enable_event:net:netif_rx:unlimited
As above, we still have a trigger, but soft-mode is gone.
Link: http://lkml.kernel.org/r/20130509054429.30398.7464.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Indicate enabled soft-mode event as "1*" in "enable" file
for each event, because it can be soft-disabled when disable_event
trigger is hit.
Link: http://lkml.kernel.org/r/20130509054426.30398.28202.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cleanup regex_lock and ftrace_lock locking points around
ftrace_ops hash update code.
The new rule is that regex_lock protects ops->*_hash
read-update-write code for each ftrace_ops. Usually,
hash update is done by following sequence.
1. allocate a new local hash and copy the original hash.
2. update the local hash.
3. move(actually, copy) back the local hash to ftrace_ops.
4. update ftrace entries if needed.
5. release the local hash.
This makes regex_lock protect #1-#4, and ftrace_lock
to protect #3, #4 and adding and removing ftrace_ops from the
ftrace_ops_list. The ftrace_lock protects #3 as well because
the move functions update the entries too.
Link: http://lkml.kernel.org/r/20130509054421.30398.83411.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix a deadlock on ftrace_regex_lock which happens when setting
an enable_event trigger on dynamic kprobe event as below.
----
sh-2.05b# echo p vfs_symlink > kprobe_events
sh-2.05b# echo vfs_symlink:enable_event:kprobes:p_vfs_symlink_0 > set_ftrace_filter
=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #35 Not tainted
---------------------------------------------
sh/72 is trying to acquire lock:
(ftrace_regex_lock){+.+.+.}, at: [<ffffffff810ba6c1>] ftrace_set_hash+0x81/0x1f0
but task is already holding lock:
(ftrace_regex_lock){+.+.+.}, at: [<ffffffff810b7cbd>] ftrace_regex_write.isra.29.part.30+0x3d/0x220
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(ftrace_regex_lock);
lock(ftrace_regex_lock);
*** DEADLOCK ***
----
To fix that, this introduces a finer regex_lock for each ftrace_ops.
ftrace_regex_lock is too big of a lock which protects all
filter/notrace_hash operations, but it doesn't need to be a global
lock after supporting multiple ftrace_ops because each ftrace_ops
has its own filter/notrace_hash.
Link: http://lkml.kernel.org/r/20130509054417.30398.84254.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Added initialization flag and automate mutex initialization for
non ftrace.c ftrace_probes. ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As ftrace_regex_write() reads the result of ftrace_process_regex()
which can sometimes return a positive number, only consider a
failure if the return is negative. Otherwise, it will skip possible
other registered probes and by returning a positive number that
wasn't read, it will confuse the user processes doing the writing.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
register_ftrace_function_probe() returns the number of functions
it registered, which can be zero, it can also return a negative number
if something went wrong. But event_enable_func() only checks for
the case that it didn't register anything, it needs to also check
for the case that something went wrong and return that error code
as well.
Added some comments about the code as well, to make it more
understandable.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Return 0 instead of the number of activated ftrace function probes if
event_enable_func succeeded and return an error code if it failed or
did not register any functions. But it currently returns the number
of registered functions and if it didn't register anything, it returns 0,
but that is considered success.
This also fixes the return value. As if it succeeds, it returns the
number of functions that were enabled, which is returned back to
the user in ftrace_regex_write (the write() return code). If only
one function is enabled, then the return code of the write is one,
and this can confuse the user program in thinking it only wrote 1
byte.
Link: http://lkml.kernel.org/r/20130509054413.30398.55650.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Rewrote change log to reflect that this fixes two bugs - SR ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull block driver updates from Jens Axboe:
"It might look big in volume, but when categorized, not a lot of
drivers are touched. The pull request contains:
- mtip32xx fixes from Micron.
- A slew of drbd updates, this time in a nicer series.
- bcache, a flash/ssd caching framework from Kent.
- Fixes for cciss"
* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
bcache: Use bd_link_disk_holder()
bcache: Allocator cleanup/fixes
cciss: bug fix to prevent cciss from loading in kdump crash kernel
cciss: add cciss_allow_hpsa module parameter
drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
mtip32xx: Workaround for unaligned writes
bcache: Make sure blocksize isn't smaller than device blocksize
bcache: Fix merge_bvec_fn usage for when it modifies the bvm
bcache: Correctly check against BIO_MAX_PAGES
bcache: Hack around stuff that clones up to bi_max_vecs
bcache: Set ra_pages based on backing device's ra_pages
bcache: Take data offset from the bdev superblock.
mtip32xx: mtip32xx: Disable TRIM support
mtip32xx: fix a smatch warning
bcache: Disable broken btree fuzz tester
bcache: Fix a format string overflow
bcache: Fix a minor memory leak on device teardown
bcache: Documentation updates
bcache: Use WARN_ONCE() instead of __WARN()
bcache: Add missing #include <linux/prefetch.h>
...
As the wake up logic for waiters on the buffer has been moved
from the tracing code to the ring buffer, it requires also adding
IRQ_WORK as the wake up code is performed via irq_work.
This fixes compile breakage when a user of the ring buffer is selected
but tracing and irq_work are not.
Link http://lkml.kernel.org/r/20130503115332.GT8356@rric.localhost
Cc: Arnd Bergmann <arnd@arndb.de>
Reported-by: Robert Richter <rric@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf updates from Ingo Molnar:
"Features:
- Add "uretprobes" - an optimization to uprobes, like kretprobes are
an optimization to kprobes. "perf probe -x file sym%return" now
works like kretprobes. By Oleg Nesterov.
- Introduce per core aggregation in 'perf stat', from Stephane
Eranian.
- Add memory profiling via PEBS, from Stephane Eranian.
- Event group view for 'annotate' in --stdio, --tui and --gtk, from
Namhyung Kim.
- Add support for AMD NB and L2I "uncore" counters, by Jacob Shin.
- Add Ivy Bridge-EP uncore support, by Zheng Yan
- IBM zEnterprise EC12 oprofile support patchlet from Robert Richter.
- Add perf test entries for checking breakpoint overflow signal
handler issues, from Jiri Olsa.
- Add perf test entry for for checking number of EXIT events, from
Namhyung Kim.
- Add perf test entries for checking --cpu in record and stat, from
Jiri Olsa.
- Introduce perf stat --repeat forever, from Frederik Deweerdt.
- Add --no-demangle to report/top, from Namhyung Kim.
- PowerPC fixes plus a couple of cleanups/optimizations in uprobes
and trace_uprobes, by Oleg Nesterov.
Various fixes and refactorings:
- Fix dependency of the python binding wrt libtraceevent, from
Naohiro Aota.
- Simplify some perf_evlist methods and to allow 'stat' to share code
with 'record' and 'trace', by Arnaldo Carvalho de Melo.
- Remove dead code in related to libtraceevent integration, from
Namhyung Kim.
- Revert "perf sched: Handle PERF_RECORD_EXIT events" to get 'perf
sched lat' back working, by Arnaldo Carvalho de Melo
- We don't use Newt anymore, just plain libslang, by Arnaldo Carvalho
de Melo.
- Kill a bunch of die() calls, from Namhyung Kim.
- Fix build on non-glibc systems due to libio.h absence, from Cody P
Schafer.
- Remove some perf_session and tracing dead code, from David Ahern.
- Honor parallel jobs, fix from Borislav Petkov
- Introduce tools/lib/lk library, initially just removing duplication
among tools/perf and tools/vm. from Borislav Petkov
... and many more I missed to list, see the shortlog and git log for
more details."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (136 commits)
perf/x86/intel/P4: Robistify P4 PMU types
perf/x86/amd: Fix AMD NB and L2I "uncore" support
perf/x86/amd: Remove old-style NB counter support from perf_event_amd.c
perf/x86: Check all MSRs before passing hw check
perf/x86/amd: Add support for AMD NB and L2I "uncore" counters
perf/x86/intel: Add Ivy Bridge-EP uncore support
perf/x86/intel: Fix SNB-EP CBO and PCU uncore PMU filter management
perf/x86: Avoid kfree() in CPU_{STARTING,DYING}
uprobes/perf: Avoid perf_trace_buf_prepare/submit if ->perf_events is empty
uprobes/tracing: Don't pass addr=ip to perf_trace_buf_submit()
uprobes/tracing: Change create_trace_uprobe() to support uretprobes
uprobes/tracing: Make seq_printf() code uretprobe-friendly
uprobes/tracing: Make register_uprobe_event() paths uretprobe-friendly
uprobes/tracing: Make uprobe_{trace,perf}_print() uretprobe-friendly
uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()
uprobes/tracing: Introduce uprobe_{trace,perf}_print() helpers
uprobes/tracing: Generalize struct uprobe_trace_entry_head
uprobes/tracing: Kill the pointless local_save_flags/preempt_count calls
uprobes/tracing: Kill the pointless seq_print_ip_sym() call
uprobes/tracing: Kill the pointless task_pt_regs() calls
...
During the 3.10 merge, a conflict happened and the resolution was
almost, but not quite, correct. An if statement was reversed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[ Duh. That was just silly of me - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Along with the usual minor fixes and clean ups there are a few major
changes with this pull request.
1) Multiple buffers for the ftrace facility
This feature has been requested by many people over the last few years.
I even heard that Google was about to implement it themselves. I finally
had time and cleaned up the code such that you can now create multiple
instances of the ftrace buffer and have different events go to different
buffers. This way, a low frequency event will not be lost in the noise
of a high frequency event.
Note, currently only events can go to different buffers, the tracers
(ie. function, function_graph and the latency tracers) still can only
be written to the main buffer.
2) The function tracer triggers have now been extended.
The function tracer had two triggers. One to enable tracing when a
function is hit, and one to disable tracing. Now you can record a
stack trace on a single (or many) function(s), take a snapshot of the
buffer (copy it to the snapshot buffer), and you can enable or disable
an event to be traced when a function is hit.
3) A perf clock has been added.
A "perf" clock can be chosen to be used when tracing. This will cause
ftrace to use the same clock as perf uses, and hopefully this will make
it easier to interleave the perf and ftrace data for analysis.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRfnTPAAoJEOdOSU1xswtMqYYH/1WIdrwXmxHflErnYkCIr3sU
QtYae2K5A1HcgiqOvRJrdWMOt016iMx5CaQQyBFM1vvMiPY0sTWRmwNxDfZzz9LN
10jRvWEzZSLtzl+a9mkFWLEpr5nR/QODOxkWFCnRWscp46sp04LSTxGDYsOnPQZB
sam/AQ1h4xA+DqDBChm9BDEUEPorGleTlN54LBaCGgSFGvrbF+eAg2s4vHNAQAvQ
8d5xjSE9zC7J+FqbVxvJTbKI3+EqKL6hMsJKsKfi0SI+FuxBaFMSltXck5zKyTI4
HpNJzXCmw+v90Tju7oMkPHh6RTbESPCHoGU+wqE52fM6m7oScVeuI/kfc6USwU4=
=W1n+
-----END PGP SIGNATURE-----
Merge tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Along with the usual minor fixes and clean ups there are a few major
changes with this pull request.
1) Multiple buffers for the ftrace facility
This feature has been requested by many people over the last few
years. I even heard that Google was about to implement it themselves.
I finally had time and cleaned up the code such that you can now
create multiple instances of the ftrace buffer and have different
events go to different buffers. This way, a low frequency event will
not be lost in the noise of a high frequency event.
Note, currently only events can go to different buffers, the tracers
(ie function, function_graph and the latency tracers) still can only
be written to the main buffer.
2) The function tracer triggers have now been extended.
The function tracer had two triggers. One to enable tracing when a
function is hit, and one to disable tracing. Now you can record a
stack trace on a single (or many) function(s), take a snapshot of the
buffer (copy it to the snapshot buffer), and you can enable or disable
an event to be traced when a function is hit.
3) A perf clock has been added.
A "perf" clock can be chosen to be used when tracing. This will cause
ftrace to use the same clock as perf uses, and hopefully this will
make it easier to interleave the perf and ftrace data for analysis."
* tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (82 commits)
tracepoints: Prevent null probe from being added
tracing: Compare to 1 instead of zero for is_signed_type()
tracing: Remove obsolete macro guard _TRACE_PROFILE_INIT
ftrace: Get rid of ftrace_profile_bits
tracing: Check return value of tracing_init_dentry()
tracing: Get rid of unneeded key calculation in ftrace_hash_move()
tracing: Reset ftrace_graph_filter_enabled if count is zero
tracing: Fix off-by-one on allocating stat->pages
kernel: tracing: Use strlcpy instead of strncpy
tracing: Update debugfs README file
tracing: Fix ftrace_dump()
tracing: Rename trace_event_mutex to trace_event_sem
tracing: Fix comment about prefix in arch_syscall_match_sym_name()
tracing: Convert trace_destroy_fields() to static
tracing: Move find_event_field() into trace_events.c
tracing: Use TRACE_MAX_PRINT instead of constant
tracing: Use pr_warn_once instead of open coded implementation
ring-buffer: Add ring buffer startup selftest
tracing: Bring Documentation/trace/ftrace.txt up to date
tracing: Add "perf" trace_clock
...
Conflicts:
kernel/trace/ftrace.c
kernel/trace/trace.c
Conflicts:
arch/x86/kernel/cpu/perf_event_intel.c
Merge in the latest fixes before applying new patches, resolve the conflict.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This reverts commit 3a366e614d.
Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.
Jens says:
"It's not quite clear why that is yet, so I think we should just revert
the commit for 3.9 final (which I'm assuming is pretty close).
The wifi is crap at the LSF hotel, so sending this email instead of
queueing up a revert and pull request."
Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
perf_trace_buf_prepare() + perf_trace_buf_submit() make no sense
if this task/CPU has no active counters. Change uprobe_perf_print()
to return if hlist_empty(call->perf_events).
Note: this is not uprobe-specific, we can change other users too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
uprobe_perf_print() passes addr=ip to perf_trace_buf_submit() for
no reason. This sets perf_sample_data->addr for PERF_SAMPLE_ADDR,
we already have perf_sample_data->ip initialized if PERF_SAMPLE_IP.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Finally change create_trace_uprobe() to check if argv[0][0] == 'r'
and pass the correct "is_ret" to alloc_trace_uprobe().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Change probes_seq_show() and print_uprobe_event() to check
is_ret_probe() and print the correct data.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Change uprobe_event_define_fields(), and __set_print_fmt() to check
is_ret_probe() and use the appropriate format/fields.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Change uprobe_trace_print() and uprobe_perf_print() to check
is_ret_probe() and fill ring_buffer_event accordingly.
Also change uprobe_trace_func() and uprobe_perf_func() to not
_print() if is_ret_probe() is true. Note that we keep ->handler()
nontrivial even for uretprobe, we need this for filtering and for
other potential extensions.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Create the new functions we need to support uretprobes, and change
alloc_trace_uprobe() to initialize consumer.ret_handler if the new
"is_ret" argument is true. Curently this argument is always false,
so the new code is never called and is_ret_probe(tu) is false too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Extract the output code from uprobe_trace_func() and uprobe_perf_func()
into the new helpers, they will be used by ->ret_handler() too. We also
add the unused "unsigned long func" argument in advance, to simplify the
next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
struct uprobe_trace_entry_head has a single member for reporting,
"unsigned long ip". If we want to support uretprobes we need to
create another struct which has "func" and "ret_ip" and duplicate
a lot of functions, like trace_kprobe.c does.
To avoid this copy-and-paste horror we turn ->ip into ->vaddr[]
and add couple of trivial helpers to calculate sizeof/data. This
uglifies the code a bit, but this allows us to avoid a lot more
complications later, when we add the support for ret-probes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
uprobe_trace_func() is never called with irqs or preemption
disabled, no need to ask preempt_count() or local_save_flags().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Anton Arapov <anton@redhat.com>
seq_print_ip_sym(ip) in print_uprobe_event() is pointless,
kallsyms_lookup(ip) can not resolve a user-space address.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
uprobe_trace_func() and uprobe_perf_func() do not need task_pt_regs(),
we already have "struct pt_regs *regs".
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
It seems that function profiler's hash size is fixed at 1024. Add and
use FTRACE_PROFILE_HASH_BITS instead and update hash size macro.
Link: http://lkml.kernel.org/r/1365551750-4504-1-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_graph_count can be decreased with a "!" pattern, so that
the enabled flag should be updated too.
Link: http://lkml.kernel.org/r/1365663698-2413-1-git-send-email-namhyung@kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As ftrace_filter_lseek is now used with ftrace_pid_fops, it needs to
be moved out of the #ifdef CONFIG_DYNAMIC_FTRACE section as the
ftrace_pid_fops is defined when DYNAMIC_FTRACE is not.
Cc: stable@vger.kernel.org
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently set_ftrace_pid and set_graph_function files use seq_lseek
for their fops. However seq_open() is called only for FMODE_READ in
the fops->open() so that if an user tries to seek one of those file
when she open it for writing, it sees NULL seq_file and then panic.
It can be easily reproduced with following command:
$ cd /sys/kernel/debug/tracing
$ echo 1234 | sudo tee -a set_ftrace_pid
In this example, GNU coreutils' tee opens the file with fopen(, "a")
and then the fopen() internally calls lseek().
Link: http://lkml.kernel.org/r/1365663302-2170-1-git-send-email-namhyung@kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On the failure path, stat->start and stat->pages will refer same page.
So it'll attempt to free the same page again and get kernel panic.
Link: http://lkml.kernel.org/r/1364820385-32027-1-git-send-email-namhyung@kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use strlcpy() instead of strncpy() as it will always add a '\0'
to the end of the string even if the buffer is smaller than what
is being copied.
Link: http://lkml.kernel.org/r/51624254.30301@asianux.com
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing control loop used by perf spits out a warning
if the called function is not a control function. This is because
the control function references a per cpu allocated data structure
on struct ftrace_ops that is not allocated for other types of
functions.
commit 0a016409e4 "ftrace: Optimize the function tracer list loop"
Had an optimization done to all function tracing loops to optimize
for a single registered ops. Unfortunately, this allows for a slight
race when tracing starts or ends, where the stub function might be
called after the current registered ops is removed. In this case we
get the following dump:
root# perf stat -e ftrace:function sleep 1
[ 74.339105] WARNING: at include/linux/ftrace.h:209 ftrace_ops_control_func+0xde/0xf0()
[ 74.349522] Hardware name: PRIMERGY RX200 S6
[ 74.357149] Modules linked in: sg igb iTCO_wdt ptp pps_core iTCO_vendor_support i7core_edac dca lpc_ich i2c_i801 coretemp edac_core crc32c_intel mfd_core ghash_clmulni_intel dm_multipath acpi_power_meter pcspk
r microcode vhost_net tun macvtap macvlan nfsd kvm_intel kvm auth_rpcgss nfs_acl lockd sunrpc uinput xfs libcrc32c sd_mod crc_t10dif sr_mod cdrom mgag200 i2c_algo_bit drm_kms_helper ttm qla2xxx mptsas ahci drm li
bahci scsi_transport_sas mptscsih libata scsi_transport_fc i2c_core mptbase scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[ 74.446233] Pid: 1377, comm: perf Tainted: G W 3.9.0-rc1 #1
[ 74.453458] Call Trace:
[ 74.456233] [<ffffffff81062e3f>] warn_slowpath_common+0x7f/0xc0
[ 74.462997] [<ffffffff810fbc60>] ? rcu_note_context_switch+0xa0/0xa0
[ 74.470272] [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0
[ 74.478117] [<ffffffff81062e9a>] warn_slowpath_null+0x1a/0x20
[ 74.484681] [<ffffffff81102ede>] ftrace_ops_control_func+0xde/0xf0
[ 74.491760] [<ffffffff8162f400>] ftrace_call+0x5/0x2f
[ 74.497511] [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f
[ 74.503486] [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f
[ 74.509500] [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50
[ 74.516088] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.522268] [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50
[ 74.528837] [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0
[ 74.536696] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.542878] [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50
[ 74.548869] [<ffffffff81105c67>] unregister_ftrace_function+0x27/0x50
[ 74.556243] [<ffffffff8111eadf>] perf_ftrace_event_register+0x9f/0x140
[ 74.563709] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.569887] [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50
[ 74.575898] [<ffffffff8111e94e>] perf_trace_destroy+0x2e/0x50
[ 74.582505] [<ffffffff81127ba9>] tp_perf_event_destroy+0x9/0x10
[ 74.589298] [<ffffffff811295d0>] free_event+0x70/0x1a0
[ 74.595208] [<ffffffff8112a579>] perf_event_release_kernel+0x69/0xa0
[ 74.602460] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.608667] [<ffffffff8112a640>] put_event+0x90/0xc0
[ 74.614373] [<ffffffff8112a740>] perf_release+0x10/0x20
[ 74.620367] [<ffffffff811a3044>] __fput+0xf4/0x280
[ 74.625894] [<ffffffff811a31de>] ____fput+0xe/0x10
[ 74.631387] [<ffffffff81083697>] task_work_run+0xa7/0xe0
[ 74.637452] [<ffffffff81014981>] do_notify_resume+0x71/0xb0
[ 74.643843] [<ffffffff8162fa92>] int_signal+0x12/0x17
To fix this a new ftrace_ops flag is added that denotes the ftrace_list_end
ftrace_ops stub as just that, a stub. This flag is now checked in the
control loop and the function is not called if the flag is set.
Thanks to Jovi for not just reporting the bug, but also pointing out
where the bug was in the code.
Link: http://lkml.kernel.org/r/514A8855.7090402@redhat.com
Link: http://lkml.kernel.org/r/1364377499-1900-15-git-send-email-jovi.zhangwei@huawei.com
Tested-by: WANG Chao <chaowang@redhat.com>
Reported-by: WANG Chao <chaowang@redhat.com>
Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If we reenable ftrace via syctl, we currently set ftrace_trace_function
based on the previous simplistic algorithm. This is inconsistent with
what update_ftrace_function does. So better call that helper instead.
Link: http://lkml.kernel.org/r/5151D26F.1070702@siemens.com
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The commit 34600f0e9 "tracing: Fix race with max_tr and changing tracers"
fixed the updating of the main buffers with the race of changing
tracers, but left out the fix to the updating of just a per cpu buffer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For NUL terminated string we always need to set '\0' at the end.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: rostedt@goodmis.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/516243B7.9020405@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For NUL terminated string we always need to set '\0' at the end.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: rostedt@goodmis.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/51624254.30301@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Exported so it can be used by bcache's tracepoints
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
Update the README file in debugfs/tracing to something more useful.
What's currently in the file is very old and what it shows doesn't
have much use. Heck, it tells you how to mount debugfs! But to read
this file you would have already needed to mount it.
Replace the file with current up-to-date information. It's rather
limited, but what do you expect from a pseudo README file.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_dump() had a lot of issues. What ftrace_dump() does, is when
ftrace_dump_on_oops is set (via a kernel parameter or sysctl), it
will dump out the ftrace buffers to the console when either a oops,
panic, or a sysrq-z occurs.
This was written a long time ago when ftrace was fragile to recursion.
But it wasn't written well even for that.
There's a possible deadlock that can occur if a ftrace_dump() is happening
and an NMI triggers another dump. This is because it grabs a lock
before checking if the dump ran.
It also totally disables ftrace, and tracing for no good reasons.
As the ring_buffer now checks if it is read via a oops or NMI, where
there's a chance that the buffer gets corrupted, it will disable
itself. No need to have ftrace_dump() do the same.
ftrace_dump() is now cleaned up where it uses an atomic counter to
make sure only one dump happens at a time. A simple atomic_inc_return()
is enough that is needed for both other CPUs and NMIs. No need for
a spinlock, as if one CPU is running the dump, no other CPU needs
to do it too.
The tracing_on variable is turned off and not turned on. The original
code did this, but it wasn't pretty. By just disabling this variable
we get the result of not seeing traces that happen between crashes.
For sysrq-z, it doesn't get turned on, but the user can always write
a '1' to the tracing_on file. If they are using sysrq-z, then they should
know about tracing_on.
The new code is much easier to read and less error prone. No more
deadlock possibility when an NMI triggers here.
Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_event_mutex is an rw semaphore now, not a mutex, change the name.
Link: http://lkml.kernel.org/r/513D843B.40109@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ Forward ported to my new code ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ppc64 has its own syscall prefix like ".SyS" or ".sys". Make the
comment in arch_syscall_match_sym_name() more understandable.
Link: http://lkml.kernel.org/r/513D842F.40205@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_destroy_fields() is not used outside of the file. It can be
a static function.
Link: http://lkml.kernel.org/r/513D842A.2000907@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
By moving find_event_field() and trace_find_field() into trace_events.c,
the ftrace_common_fields list and trace_get_fields() can become local to
the trace_events.c file.
find_event_field() is renamed to trace_find_event_field() to conform to
the tracing global function names.
Link: http://lkml.kernel.org/r/513D8426.9070109@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ rostedt: Modified trace_find_field() to trace_find_event_field() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
TRACE_MAX_PRINT macro is defined, but is not used.
Link: http://lkml.kernel.org/r/513D8421.4070404@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use pr_warn_once, instead of making an open coded implementation.
Link: http://lkml.kernel.org/r/513D8419.20400@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When testing my large changes to the ftrace system, there was
a bug that looked like the ring buffer was dropping events.
I wrote up a quick integrity checker of the ring buffer to
see if it was.
Although the bug ended up being something stupid I did in ftrace,
and had nothing to do with the ring buffer, I figured if I spent
the time to write up this test, I might as well include it in the
kernel.
I cleaned it up a bit, as the original version was rather ugly.
Not saying this version is pretty, but it's a beauty queen
compared to what I original wrote.
To enable the start up test, set CONFIG_RING_BUFFER_STARTUP_TEST.
Note, it runs for 10 seconds, so it will slow your boot time
by at least 10 more seconds.
What it does is documented in both the comments and the Kconfig
help.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function trace_clock() calls "local_clock()" which is exactly
the same clock that perf uses. I'm not sure why perf doesn't call
trace_clock(), as trace_clock() doesn't have any users.
But now it does. As trace_clock() calls local_clock() like perf does,
I added the trace_clock "perf" option that uses trace_clock().
Now the ftrace buffers can use the same clock as perf uses. This
will be useful when perf starts reading the ftrace buffers, and will
be able to interleave them with the same clock data.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a simple trace clock called "uptime" for those that are
interested in the uptime of the trace. It uses jiffies as that's
the safest method, as other uptime clocks grab seq locks, which could
cause a deadlock if taken from an event or function tracer.
Requested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the only way to stop the latency tracers from doing function
tracing is to fully disable the function tracer from the proc file
system:
echo 0 > /proc/sys/kernel/ftrace_enabled
This is a big hammer approach as it disables function tracing for
all users. This includes kprobes, perf, stack tracer, etc.
Instead, create a function-trace option that the latency tracers can
check to determine if it should enable function tracing or not.
This option can be set or cleared even while the tracer is active
and the tracers will disable or enable function tracing depending
on how the option was set.
Instead of using the proc file, disable latency function tracing with
echo 0 > /debug/tracing/options/function-trace
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the depth reported in the stack tracer stack_trace file
does not match the stack_max_size file. This is because the stack_max_size
includes the overhead of stack tracer itself while the depth does not.
The first time a max is triggered, a calculation is not performed that
figures out the overhead of the stack tracer and subtracts it from
the stack_max_size variable. The overhead is stored and is subtracted
from the reported stack size for comparing for a new max.
Now the stack_max_size corresponds to the reported depth:
# cat stack_max_size
4640
# cat stack_trace
Depth Size Location (48 entries)
----- ---- --------
0) 4640 32 _raw_spin_lock+0x18/0x24
1) 4608 112 ____cache_alloc+0xb7/0x22d
2) 4496 80 kmem_cache_alloc+0x63/0x12f
3) 4416 16 mempool_alloc_slab+0x15/0x17
[...]
While testing against and older gcc on x86 that uses mcount instead
of fentry, I found that pasing in ip + MCOUNT_INSN_SIZE let the
stack trace show one more function deep which was missing before.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When gcc 4.6 on x86 is used, the function tracer will use the new
option -mfentry which does a call to "fentry" at every function
instead of "mcount". The significance of this is that fentry is
called as the first operation of the function instead of the mcount
usage of being called after the stack.
This causes the stack tracer to show some bogus results for the size
of the last function traced, as well as showing "ftrace_call" instead
of the function. This is due to the stack frame not being set up
by the function that is about to be traced.
# cat stack_trace
Depth Size Location (48 entries)
----- ---- --------
0) 4824 216 ftrace_call+0x5/0x2f
1) 4608 112 ____cache_alloc+0xb7/0x22d
2) 4496 80 kmem_cache_alloc+0x63/0x12f
The 216 size for ftrace_call includes both the ftrace_call stack
(which includes the saving of registers it does), as well as the
stack size of the parent.
To fix this, if CC_USING_FENTRY is defined, then the stack_tracer
will reserve the first item in stack_dump_trace[] array when
calling save_stack_trace(), and it will fill it in with the parent ip.
Then the code will look for the parent pointer on the stack and
give the real size of the parent's stack pointer:
# cat stack_trace
Depth Size Location (14 entries)
----- ---- --------
0) 2640 48 update_group_power+0x26/0x187
1) 2592 224 update_sd_lb_stats+0x2a5/0x4ac
2) 2368 160 find_busiest_group+0x31/0x1f1
3) 2208 256 load_balance+0xd9/0x662
I'm Cc'ing stable, although it's not urgent, as it only shows bogus
size for item #0, the rest of the trace is legit. It should still be
corrected in previous stable releases.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use the stack of stack_trace_call() instead of check_stack() as
the test pointer for max stack size. It makes it a bit cleaner
and a little more accurate.
Adding stable, as a later fix depends on this patch.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Altough the trace_dump_stack() already skips three functions in
the call to stack trace, which gets the stack trace to start
at the caller of the function, the caller may want to skip some
more too (as it may have helper functions).
Add a skip argument to the trace_dump_stack() that lets the caller
skip back tracing functions that it doesn't care about.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add triggers to function tracer that lets an event get enabled or
disabled when a function is called:
format is:
<function>:enable_event:<system>:<event>[:<count>]
<function>:disable_event:<system>:<event>[:<count>]
echo 'schedule:enable_event:sched:sched_switch' > /debug/tracing/set_ftrace_filter
Every time schedule is called, it will enable the sched_switch event.
echo 'schedule:disable_event:sched:sched_switch:2' > /debug/tracing/set_ftrace_filter
The first two times schedule is called while the sched_switch
event is enabled, it will disable it. It will not count for a time
that the event is already disabled (or enabled for enable_event).
[ fixed return without mutex_unlock() - thanks to Dan Carpenter and smatch ]
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to let triggers enable or disable events, we need a 'soft'
method for doing so. For example, if a function probe is added that
lets a user enable or disable events when a function is called, that
change must be done without taking locks or a mutex, and definitely
it can't sleep. But the full enabling of a tracepoint is expensive.
By adding a 'SOFT_DISABLE' flag, and converting the flags to be updated
without the protection of a mutex (using set/clear_bit()), this soft
disable flag can be used to allow critical sections to enable or disable
events from being traced (after the event has been placed into "SOFT_MODE").
Some caveats though: The comm recorder (to map pids with a comm) can not
be soft disabled (yet). If you disable an event with with a "soft"
disable and wait a while before reading the trace, the comm cache may be
replaced and you'll get a bunch of <...> for comms in the trace.
Reading the "enable" file for an event that is disabled will now give
you "0*" where the '*' denotes that the tracepoint is still active but
the event itself is "disabled".
[ fixed _BIT used in & operation : thanks to Dan Carpenter and smatch ]
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The entries to the probe hash must be freed after a synchronize_sched()
after the entry has been removed from the hash.
As the entries are registered with ops that may have their own callbacks,
and these callbacks may sleep, we can not use call_rcu_sched() because
the rcu callbacks registered with that are called from a softirq context.
Instead of using call_rcu_sched(), manually save the entries on a free_list
and at the end of the loop that removes the entries, do a synchronize_sched()
and then go through the free_list, freeing the entries.
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a function probe is created, each function that the probe is
attached to, a "callback" method is called. On release of the probe,
each function entry calls the "free" method.
First, "callback" is a confusing name and does not really match what
it does. Callback sounds like it will be called when the probe
triggers. But that's not the case. This is really an "init" function,
so lets rename it as such.
Secondly, both "init" and "free" do not pass enough information back
to the handlers. Pass back the ops, ip and data for each time the
method is called. We have the information, might as well use it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
echo 'schedule:snapshot:1' > /debug/tracing/set_ftrace_filter
This will cause the scheduler to trigger a snapshot the next time
it's called (you can use any function that's not called by NMI).
Even though it triggers only once, you still need to remove it with:
echo '!schedule:snapshot:0' > /debug/tracing/set_ftrace_filter
The :1 can be left off for the first command:
echo 'schedule:snapshot' > /debug/tracing/set_ftrace_filter
But this will cause all calls to schedule to trigger a snapshot.
This must be removed without the ':0'
echo '!schedule:snapshot' > /debug/tracing/set_ftrace_filter
As adding a "count" is a different operation (internally).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add alloc_snapshot() and free_snapshot() to allocate and free the
snapshot buffer respectively, and use these to remove duplicate
code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the function probe enables all functions and runs a "hash"
against every function call to see if it should call a probe. This
is extremely wasteful.
Note, a probe is something like:
echo schedule:traceoff > /debug/tracing/set_ftrace_filter
When schedule is called, the probe will disable tracing. But currently,
it has a call back for *all* functions, and checks to see if the
called function is the probe that is needed.
The probe function has been created before ftrace was rewritten to
allow for more than one "op" to be registered by the function tracer.
When probes were created, it couldn't limit the functions without also
limiting normal function calls. But now we can, it's about time
to update the probe code.
Todo, have separate ops for different entries. That is, assign
a ftrace_ops per probe, instead of one op for all probes. But
as there's not many probes assigned, this may not be that urgent.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing probes that trigger traceon or traceoff can be
set to unlimited, or given a count of # of times to execute.
By separating these two types of probes, we can then use the dynamic
ftrace function filtering directly, and remove the brute force
"check if this function called is my probe" routines in ftrace.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The only thing ftrace_trace_onoff_unreg() does is to do a strcmp()
against the cmd parameter to determine what op to unregister. But
this compare is also done after the location that this function is
called (and returns). By moving the check for '!' to unregister after
the strcmp(), the callback function itself can just do the unregister
and we can get rid of the helper function.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove some duplicate code and replace it with a helper function.
This makes the code a it cleaner.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add EXPORT_SYMBOL_GPL() to let the tracing_snapshot() functions be
called from modules.
Also add a test to see if the snapshot was called from NMI context
and just warn in the tracing buffer if so, and return.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a few places that ftrace uses trace_printk() for internal
use, but this requires context (normal, softirq, irq, NMI) buffers
to keep things lockless. But the trace_puts() does not, as it can
write the string directly into the ring buffer. Make a internal helper
for trace_puts() and have the internal functions use that.
This way the extra context buffers are not used.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_printk() is extremely fast and is very handy as it can be
used in any context (including NMIs!). But it still requires scanning
the fmt string for parsing the args. Even the trace_bprintk() requires
a scan to know what args will be saved, although it doesn't copy the
format string itself.
Several times trace_printk() has no args, and wastes cpu cycles scanning
the fmt string.
Adding trace_puts() allows the developer to use an even faster
tracing method that only saves the pointer to the string in the
ring buffer without doing any format parsing at all. This will
help remove even more of the "Heisenbug" effect, when debugging.
Also fixed up the F_printk()s for the ftrace internal bprint and print events.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The changce to add the trace_buffer struct to have the trace array
have both the main buffer and max buffer broke the branch tracer
because the change did not update that code. As the branch tracer
adds a significant amount of overhead, and must be selected via
a selection (not a allyesconfig) it was missed in testing.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If debugging the kernel, and the developer wants to use
tracing_snapshot() in places where tracing_snapshot_alloc() may
be difficult (or more likely, the developer is lazy and doesn't
want to bother with tracing_snapshot_alloc() at all), then adding
alloc_snapshot
to the kernel command line parameter will tell ftrace to allocate
the snapshot buffer (if configured) when it allocates the main
tracing buffer.
I also noticed that ring_buffer_expanded and tracing_selftest_disabled
had inconsistent use of boolean "true" and "false" with "0" and "1".
I cleaned that up too.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move the tracing startup selftest code into its own function and
when not enabled, always have that function succeed.
This makes the register_tracer() function much more readable.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ring buffer updates when done while the ring buffer is active,
needs to be completed on the CPU that is used for the ring buffer
per_cpu buffer. To accomplish this, schedule_work_on() is used to
schedule work on the given CPU.
Now there's no reason to use schedule_work_on() if the process
doing the update happens to be on the CPU that it is processing.
It has already filled the requirement. Instead, just do the work
and continue.
This is needed for tracing_snapshot_alloc() where it may be called
really early in boot, where the work queues have not been set up yet.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The new snapshot feature is quite handy. It's a way for the user
to take advantage of the spare buffer that, until then, only
the latency tracers used to "snapshot" the buffer when it hit
a max latency. Now users can trigger a "snapshot" manually when
some condition is hit in a program. But a snapshot currently can
not be triggered by a condition inside the kernel.
With the addition of tracing_snapshot() and tracing_snapshot_alloc(),
snapshots can now be taking when a condition is hit, and the
developer wants to snapshot the case without stopping the trace.
Note, any snapshot will overwrite the old one, so take care
in how this is done.
These new functions are to be used like tracing_on(), tracing_off()
and trace_printk() are. That is, they should never be called
in the mainline Linux kernel. They are solely for the purpose
of debugging.
The tracing_snapshot() will not allocate a buffer, but it is
safe to be called from any context (except NMIs). But if a
snapshot buffer isn't allocated when it is called, it will write
to the live buffer, complaining about the lack of a snapshot
buffer, and then stop tracing (giving you the "permanent snapshot").
tracing_snapshot_alloc() will allocate the snapshot buffer if
it was not already allocated and then take the snapshot. This routine
*may sleep*, and must be called from context that can sleep.
The allocation is done with GFP_KERNEL and not atomic.
If you need a snapshot in an atomic context, say in early boot,
then it is best to call the tracing_snapshot_alloc() before then,
where it will allocate the buffer, and then you can use the
tracing_snapshot() anywhere you want and still get snapshots.
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a ref count to the trace_array structure and prevent removal
of instances that have open descriptors.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the per_cpu directory to the created tracing instances:
cd /sys/kernel/debug/tracing/instances
mkdir foo
ls foo/per_cpu/cpu0
buffer_size_kb snapshot_raw trace trace_pipe_raw
snapshot stats trace_pipe
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the "snapshot" file to the the multi-buffer instances.
cd /sys/kernel/debug/tracing/instances
mkdir foo
ls foo
buffer_size_kb buffer_total_size_kb events free_buffer set_event
snapshot trace trace_clock trace_marker trace_options trace_pipe
tracing_on
cat foo/snapshot
# tracer: nop
#
#
# * Snapshot is freed *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a bit of duplicate code in creating the trace buffers for
the normal trace buffer and the max trace buffer among the instances
and the main global_trace. This code can be consolidated and cleaned
up a bit making the code cleaner and more readable as well as less
duplication.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The snapshot buffer belongs to the trace array not the tracer that is
running. The trace array should be the data structure that keeps track
of whether or not the snapshot buffer is allocated, not the tracer
desciptor. Having the trace array keep track of it makes modifications
so much easier.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a 'snapshot_raw' per_cpu file that allows tools to read the raw
binary data of the snapshot buffer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the preempt or irq latency tracers are enabled, they require
the ring buffer to be able to swap the per cpu sub buffers between
two main buffers. This adds a slight overhead to tracing as the
trace recording needs to perform some checks to synchronize
between recording and swaps that might be happening on other CPUs.
The config RING_BUFFER_ALLOW_SWAP is set when a user of the ring
buffer needs the "swap cpu" feature, otherwise the extra checks
are not implemented and removed from the tracing overhead.
The snapshot feature will swap per CPU if the RING_BUFFER_ALLOW_SWAP
config is set. But that only gets set by things like OPROFILE
and the irqs and preempt latency tracers.
This config is added to let the user decide to include this feature
with the snapshot agnostic from whether or not another user of
the ring buffer sets this config.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the snapshot file into the per_cpu tracing directories to allow
them to be read for an individual cpu. This also allows to clear
an individual cpu from the snapshot buffer.
If the kernel allows it (CONFIG_RING_BUFFER_ALLOW_SWAP is set), then
echoing in '1' into one of the per_cpu snapshot files will do an
individual cpu buffer swap instead of the entire file.
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the way the latency tracers and snapshot feature works
is to have a separate trace_array called "max_tr" that holds the
snapshot buffer. For latency tracers, this snapshot buffer is used
to swap the running buffer with this buffer to save the current max
latency.
The only items needed for the max_tr is really just a copy of the buffer
itself, the per_cpu data pointers, the time_start timestamp that states
when the max latency was triggered, and the cpu that the max latency
was triggered on. All other fields in trace_array are unused by the
max_tr, making the max_tr mostly bloat.
This change removes the max_tr completely, and adds a new structure
called trace_buffer, that holds the buffer pointer, the per_cpu data
pointers, the time_start timestamp, and the cpu where the latency occurred.
The trace_array, now has two trace_buffers, one for the normal trace and
one for the max trace or snapshot. By doing this, not only do we remove
the bloat from the max_trace but the instances of traces can now use
their own snapshot feature and not have just the top level global_trace have
the snapshot feature and latency tracers for itself.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The snapshot utility is extremely useful, and does not add any more
overhead in memory when another latency tracer is enabled. They use
the snapshot underneath. There's no reason to hide the snapshot file
when a latency tracer has been enabled in the kernel.
If any of the latency tracers (irq, preempt or wakeup) is enabled
then also select the snapshot facility.
Note, snapshot can be enabled without the latency tracers enabled.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently we do not know what buffer a module event was enabled in.
On unload, it is safest to clear all buffer instances, not just the
top level buffer.
Todo: Clear only the buffer that the event was used in. The
infrastructure is there to do this, but it makes the code a bit
more complex. Lets get the current code vetted before we add that.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, when a module with events is unloaded, the trace buffer is
cleared. This is just a safety net in case the module might have some
strange callback when its event is outputted. But there's no reason
to reset the buffer if the module didn't have any of its events traced.
Add a flag to the event "call" structure called WAS_ENABLED and gets set
when the event is ever enabled, and this flag never gets cleared. When a
module gets unloaded, if any of its events have this flag set, then the
trace buffer will get cleared.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The move of blocked readers to the ring buffer left out the
init of the wait queue that is used. Tests missed this due to running
stress tests against the buffers, which didn't allow for any
readers to end up waiting. Running a simple read and wait triggered
a bug.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As we've added __init annotation to field-defining functions, we should
add __refdata annotation to event_call variables, which reference those
functions.
Link: http://lkml.kernel.org/r/51343C1F.2050502@huawei.com
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The new multi-buffers added a descriptor that kept track of module
events, and the directories they use, with struct ftace_module_file_ops.
This is used to add a ref count to keep modules from unloading while
their files are being accessed.
As the descriptor is only needed when CONFIG_MODULES is enabled, it
is only declared when the config is enabled. But that struct is
dereferenced in a few areas outside the #ifdef CONFIG_MODULES.
By adding some helper routines and moving code around a little,
events can be compiled again without modules.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the conversion of the data array to per cpu, sparse now complains
about the use of per_cpu_ptr() on the variable. But The variable is
allocated with alloc_percpu() and is fine to use. But since the structure
that contains the data variable does not annotate it as such, sparse
gives out a lot of false warnings.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
These two functions are called during kernel boot only.
Link: http://lkml.kernel.org/r/51258796.7020704@huawei.com
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move duplicate code in event print functions to a helper function.
This shrinks the size of the kernel by ~13K.
text data bss dec hex filename
6596137 1743966 10138672 18478775 119f6b7 vmlinux.o.old
6583002 1743849 10138672 18465523 119c2f3 vmlinux.o.new
Link: http://lkml.kernel.org/r/51258746.2060304@huawei.com
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move the logic to wake up on ring buffer data into the ring buffer
code itself. This simplifies the tracing code a lot and also has the
added benefit that waiters on one of the instance buffers can be woken
only when data is added to that instance instead of data added to
any instance.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the ring buffer is empty, a read to trace_pipe_raw wont block.
The tracing code has the infrastructure to wake up waiting readers,
but the trace_pipe_raw doesn't take advantage of that.
When a read is done to trace_pipe_raw without the O_NONBLOCK flag
set, have the read block until there's data in the requested buffer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_pipe_raw never implemented polling and this was casing
issues for several utilities. This is now implemented.
Blocked reads still are on the TODO list.
Reported-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently only the splice NONBLOCK flag is checked to determine if
the splice read should block or not. But the file descriptor NONBLOCK
flag also needs to be checked.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The names used to display the field and type in the event format
files are copied, as well as the system name that is displayed.
All these names are created by constant values passed in.
If one of theses values were to be removed by a module, the module
would also be required to remove any event it created.
By using the strings directly, we can save over 100K of memory.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The event structures used by the trace events are mostly persistent,
but they are also allocated by kmalloc, which is not the best at
allocating space for what is used. By converting these kmallocs
into kmem_cache_allocs, we can save over 50K of space that is
permanently allocated.
After boot we have:
slab name active allocated size
--------- ------ --------- ----
ftrace_event_file 979 1005 56 67 1
ftrace_event_field 2301 2310 48 77 1
The ftrace_event_file has at boot up 979 active objects out of
1005 allocated in the slabs. Each object is 56 bytes. In a normal
kmalloc, that would allocate 64 bytes for each object.
1005 - 979 = 26 objects not used
26 * 56 = 1456 bytes wasted
But if we used kmalloc:
64 - 56 = 8 bytes unused per allocation
8 * 979 = 7832 bytes wasted
7832 - 1456 = 6376 bytes in savings
Doing the same for ftrace_event_field where there's 2301 objects
allocated in a slab that can hold 2310 with 48 bytes each we have:
2310 - 2301 = 9 objects not used
9 * 48 = 432 bytes wasted
A kmalloc would also use 64 bytes per object:
64 - 48 = 16 bytes unused per allocation
16 * 2301 = 36816 bytes wasted!
36816 - 432 = 36384 bytes in savings
This change gives us a total of 42760 bytes in savings. At least
on my machine, but as there's a lot of these persistent objects
for all configurations that use trace points, this is a net win.
Thanks to Ezequiel Garcia for his trace_analyze presentation which
pointed out the wasted space in my code.
Cc: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the new descriptors used to allow multiple buffers in the
tracing directory added, the kernel command line parameter
trace_events=... no longer works. This is because the top level
(global) trace array now has a list of descriptors associated
with the events and the files in the debugfs directory. But in
early bootup, when the command line is processed and the events
enabled, the trace array list of events has not been set up yet.
Without the list of events in the trace array, the setting of
events to record will fail because it would not match any events.
The solution is to set up the top level array in two stages.
The first is to just add the ftrace file descriptors that just point
to the events. This will allow events to be enabled and start tracing.
The second stage is called after the filesystem is set up, and this
stage will create the debugfs event files and directories associated
with the trace array events.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a method to the hijacked dentry descriptor of the
"instances" directory to allow for rmdir to remove an
instance of a multibuffer.
Example:
cd /debug/tracing/instances
mkdir hello
ls
hello/
rmdir hello
ls
Like the mkdir method, the i_mutex is dropped for the instances
directory. The instances directory is created at boot up and can
not be renamed or removed. The trace_types_lock mutex is used to
synchronize adding and removing of instances.
I've run several stress tests with different threads trying to
create and delete directories of the same name, and it has stood
up fine.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the interface ("instances" directory) to add multiple buffers
to ftrace. To create a new instance, simply do a mkdir in the
instances directory:
This will create a directory with the following:
# cd instances
# mkdir foo
# ls foo
buffer_size_kb free_buffer trace_clock trace_pipe
buffer_total_size_kb set_event trace_marker tracing_enabled
events/ trace trace_options tracing_on
Currently only events are able to be set, and there isn't a way
to delete a buffer when one is created (yet).
Note, the i_mutex lock is dropped from the parent "instances"
directory during the mkdir operation. As the "instances" directory
can not be renamed or deleted (created on boot), I do not see
any harm in dropping the lock. The creation of the sub directories
is protected by trace_types_lock mutex, which only lets one
instance get into the code path at a time. If two tasks try to
create or delete directories of the same name, only one will occur
and the other will fail with -EEXIST.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the syscall events record into the global buffer. But if
multiple buffers are in place, then we need to have syscall events
record in the proper buffers.
By adding descriptors to pass to the syscall event functions, the
syscall events can now record into the buffers that have been assigned
to them (one event may be applied to mulitple buffers).
This will allow tracing high volume syscalls along with seldom occurring
syscalls without losing the seldom syscall events.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The global and max-tr currently use static per_cpu arrays for the CPU data
descriptors. But in order to get new allocated trace_arrays, they need to
be allocated per_cpu arrays. Instead of using the static arrays, switch
the global and max-tr to use allocated data.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pass the struct ftrace_event_file *ftrace_file to the
trace_event_buffer_lock_reserve() (new function that replaces the
trace_current_buffer_lock_reserver()).
The ftrace_file holds a pointer to the trace_array that is in use.
In the case of multiple buffers with different trace_arrays, this
allows different events to be recorded into different buffers.
Also fixed some of the stale comments in include/trace/ftrace.h
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The global_trace variable in kernel/trace/trace.c has been kept 'static' and
local to that file so that it would not be used too much outside of that
file. This has paid off, even though there were lots of changes to make
the trace_array structure more generic (not depending on global_trace).
Removal of a lot of direct usages of global_trace is needed to be able to
create more trace_arrays such that we can add multiple buffers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both RING_BUFFER_ALL_CPUS and TRACE_PIPE_ALL_CPU are defined as
-1 and used to say that all the ring buffers are to be modified
or read (instead of just a single cpu, which would be >= 0).
There's no reason to keep TRACE_PIPE_ALL_CPU as it is also started
to be used for more than what it was created for, and now that
the ring buffer code added a generic RING_BUFFER_ALL_CPUS define,
we can clean up the trace code to use that instead and remove
the TRACE_PIPE_ALL_CPU macro.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace events for ftrace are all defined via global variables.
The arrays of events and event systems are linked to a global list.
This prevents multiple users of the event system (what to enable and
what not to).
By adding descriptors to represent the event/file relation, as well
as to which trace_array descriptor they are associated with, allows
for more than one set of events to be defined. Once the trace events
files have a link between the trace event and the trace_array they
are associated with, we can create multiple trace_arrays that can
record separate events in separate buffers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The latency tracers require the buffers to be in overwrite mode,
otherwise they get screwed up. Force the buffers to stay in overwrite
mode when latency tracers are enabled.
Added a flag_changed() method to the tracer structure to allow
the tracers to see what flags are being changed, and also be able
to prevent the change from happing.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Changing the overwrite mode for the ring buffer via the trace
option only sets the normal buffer. But the snapshot buffer could
swap with it, and then the snapshot would be in non overwrite mode
and the normal buffer would be in overwrite mode, even though the
option flag states otherwise.
Keep the two buffers overwrite modes in sync.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Seems that the tracer flags have never been protected from
synchronous writes. Luckily, admins don't usually modify the
tracing flags via two different tasks. But if scripts were to
be used to modify them, then they could get corrupted.
Move the trace_types_lock that protects against tracers changing
to also protect the flags being set.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Because function tracing is very invasive, and can even trace
calls to rcu_read_lock(), RCU access in function tracing is done
with preempt_disable_notrace(). This requires a synchronize_sched()
for updates and not a synchronize_rcu().
Function probes (traceon, traceoff, etc) must be freed after
a synchronize_sched() after its entry has been removed from the
hash. But call_rcu() is used. Fix this by using call_rcu_sched().
Also fix the usage to use hlist_del_rcu() instead of hlist_del().
Cc: stable@vger.kernel.org
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Although the swap is wrapped with a spin_lock, the assignment
of the temp buffer used to swap is not within that lock.
It needs to be moved into that lock, otherwise two swaps
happening on two different CPUs, can end up using the wrong
temp buffer to assign in the swap.
Luckily, all current callers of the swap function appear to have
their own locks. But in case something is added that allows two
different callers to call the swap, then there's a chance that
this race can trigger and corrupt the buffers.
New code is coming soon that will allow for this race to trigger.
I've Cc'd stable, so this bug will not show up if someone backports
one of the changes that can trigger this bug.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf fixes from Ingo Molnar:
"Misc minor fixes mostly related to tracing"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
s390: Fix a header dependencies related build error
tracing: update documentation of snapshot utility
tracing: Do not return EINVAL in snapshot when not allocated
tracing: Add help of snapshot feature when snapshot is empty
ftrace: Update the kconfig for DYNAMIC_FTRACE
To use the tracing snapshot feature, writing a '1' into the snapshot
file causes the snapshot buffer to be allocated if it has not already
been allocated and dose a 'swap' with the main buffer, so that the
snapshot now contains what was in the main buffer, and the main buffer
now writes to what was the snapshot buffer.
To free the snapshot buffer, a '0' is written into the snapshot file.
To clear the snapshot buffer, any number but a '0' or '1' is written
into the snapshot file. But if the file is not allocated it returns
-EINVAL error code. This is rather pointless. It is better just to
do nothing and return success.
Acked-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When cat'ing the snapshot file, instead of showing an empty trace
header like the trace file does, show how to use the snapshot
feature.
Also, this is a good place to show if the snapshot has been allocated
or not. Users may want to "pre allocate" the snapshot to have a fast
"swap" of the current buffer. Otherwise, a swap would be slow and might
fail as it would need to allocate the snapshot buffer, and that might
fail under tight memory constraints.
Here's what it looked like before:
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
Here's what it looks like now:
# tracer: nop
#
#
# * Snapshot is freed *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
Acked-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This adds core architecture support for Imagination's Meta processor
cores, followed by some later miscellaneous arch/metag cleanups and
fixes which I kept separate to ease review:
- Support for basic Meta 1 (ATP) and Meta 2 (HTP) core architecture
- A few fixes all over, particularly for symbol prefixes
- A few privilege protection fixes
- Several cleanups (setup.c includes, split out a lot of metag_ksyms.c)
- Fix some missing exports
- Convert hugetlb to use vm_unmapped_area()
- Copy device tree to non-init memory
- Provide dma_get_sgtable()
Signed-off-by: James Hogan <james.hogan@imgtec.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJRMmVXAAoJEKHZs+irPybfivgP/inEXqJyfw59omQdjwvYcU/a
/u0MJ3UKSNS3U+HknfaFCy/Nwk1dqPLjqqyVC1V6AbUPBXlaEwGcimlNRx2uRjdq
Uh96upMLHsNuF/xiiR477g3RwY0egIJdM1R1bGi3mZ3vVrNQGF+wbni6f61xCWGz
M/4rDglpQvE79oLhYdgj6tidZtHQT0YWtERA9W90zkQWXGYmpFPKBKbfZAi5+rKQ
U6Gpg26orUugzXNaxltJEYKE8gjLTppEabx8DARnItZ4zCMy4dw5RBJ35RFvQw6e
eSmfgTy9w9WqBMY2+QMSgU0KQt1IITCzX7OlOXC0jALQJXoU0WWbOELlBVQLCwF1
T0OcR/5ZP/hIlOk5Dh+e9U3AtbASXdMtqA0ZUe78woH1CBf7Nc/0c0vRg23EdMh8
lnHDJxT/UqskoOYLI4kgWbEdLDy4uTh19U2pVi7VCo7ksLB9Bj9Xc8VSKgscSXTl
OwzN+c4Jgtu8FDFTp+Af4AT8pYGJ08j8L2ErsV2sOv3Q44U5WXdrMz3GSgwXj8+4
wZk3HvdkQVkMD5sJCUZgAswaN6BnbB0pHdCz4wMQ8jR/Ogs015Ipk64Ecym9S/4n
uES7PnDtt/4lb5EyX2ScbvdnZTAFTaaP7OOhC77BOQvbQjIW1tkAcxWJqRry86uS
iM0BFgK6Ohx3geqa5Ft0
=65cR
-----END PGP SIGNATURE-----
Merge tag 'metag-v3.9-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag
Pull new ImgTec Meta architecture from James Hogan:
"This adds core architecture support for Imagination's Meta processor
cores, followed by some later miscellaneous arch/metag cleanups and
fixes which I kept separate to ease review:
- Support for basic Meta 1 (ATP) and Meta 2 (HTP) core architecture
- A few fixes all over, particularly for symbol prefixes
- A few privilege protection fixes
- Several cleanups (setup.c includes, split out a lot of
metag_ksyms.c)
- Fix some missing exports
- Convert hugetlb to use vm_unmapped_area()
- Copy device tree to non-init memory
- Provide dma_get_sgtable()"
* tag 'metag-v3.9-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag: (61 commits)
metag: Provide dma_get_sgtable()
metag: prom.h: remove declaration of metag_dt_memblock_reserve()
metag: copy devicetree to non-init memory
metag: cleanup metag_ksyms.c includes
metag: move mm/init.c exports out of metag_ksyms.c
metag: move usercopy.c exports out of metag_ksyms.c
metag: move setup.c exports out of metag_ksyms.c
metag: move kick.c exports out of metag_ksyms.c
metag: move traps.c exports out of metag_ksyms.c
metag: move irq enable out of irqflags.h on SMP
genksyms: fix metag symbol prefix on crc symbols
metag: hugetlb: convert to vm_unmapped_area()
metag: export clear_page and copy_page
metag: export metag_code_cache_flush_all
metag: protect more non-MMU memory regions
metag: make TXPRIVEXT bits explicit
metag: kernel/setup.c: sort includes
perf: Enable building perf tools for Meta
metag: add boot time LNKGET/LNKSET check
metag: add __init to metag_cache_probe()
...
Some 32 bit architectures require 64 bit values to be aligned (for
example Meta which has 64 bit read/write instructions). These require 8
byte alignment of event data too, so use
!CONFIG_HAVE_64BIT_ALIGNED_ACCESS instead of !CONFIG_64BIT ||
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to decide alignment, and align
buffer_data_page::data accordingly.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org> (previous version subtly different)
Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:
- The big cfq/blkcg update from Tejun and and Vivek.
- Additional block and writeback tracepoints from Tejun.
- Improvement of the should sort (based on queues) logic in the plug
flushing.
- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.
- Various little fixes.
You'll get two trivial merge conflicts, which should be easy enough to
fix up"
Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42: "hlist: drop the node parameter from iterators").
* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The prompt to enable DYNAMIC_FTRACE (the ability to nop and
enable function tracing at run time) had a confusing statement:
"enable/disable ftrace tracepoints dynamically"
This was written before tracepoints were added to the kernel,
but now that tracepoints have been added, this is very confusing
and has confused people enough to give wrong information during
presentations.
Not only that, I looked at the help text, and it still references
that dreaded daemon that use to wake up once a second to update
the nop locations and brick NICs, that hasn't been around for over
five years.
Time to bring the text up to the current decade.
Cc: stable@vger.kernel.org
Reported-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- Rework of the ACPI namespace scanning code from Rafael J. Wysocki
with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
Toshi Kani, and Yinghai Lu.
- ACPI power resources handling and ACPI device PM update from
Rafael J. Wysocki.
- ACPICA update to version 20130117 from Bob Moore and Lv Zheng
with contributions from Aaron Lu, Chao Guan, Jesper Juhl, and
Tim Gardner.
- Support for Intel Lynxpoint LPSS from Mika Westerberg.
- cpuidle update from Len Brown including Intel Haswell support, C1
state for intel_idle, removal of global pm_idle.
- cpuidle fixes and cleanups from Daniel Lezcano.
- cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri
with contributions from Stratos Karafotis and Rickard Andersson.
- Intel P-states driver for Sandy Bridge processors from
Dirk Brandewie.
- cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.
- cpufreq fixes related to ordering issues between acpi-cpufreq and
powernow-k8 from Borislav Petkov and Matthew Garrett.
- cpufreq support for Calxeda Highbank processors from Mark Langsdorf
and Rob Herring.
- cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
from Shawn Guo.
- cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
and Inderpal Singh.
- Support for "lightweight suspend" from Zhang Rui.
- Removal of the deprecated power trace API from Paul Gortmaker.
- Assorted updates from Andreas Fleig, Colin Ian King,
Davidlohr Bueso, Joseph Salisbury, Kees Cook, Li Fei,
Nishanth Menon, ShuoX Liu, Srinivas Pandruvada, Tejun Heo,
Thomas Renninger, and Yasuaki Ishimatsu.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJRIsArAAoJEKhOf7ml8uNsD6MP/j7C4NA+GTq6RdwoJt+Yki0K
9Ep8I4pEuRFoN/oskv24EyQhpGJIk6UxWcJ/DWFBc+1VhmKORta7k2Idv/wlJA77
s7AcDveA9xcDh+TVfbh87TeuiMSXiSdDZbiaQO+wMizWJAF3F84AnjiAqqqyQcSK
bA5/Siz/vWlt9PyYDaQtHTVE4lpvPuVcQdYewsdaH2PsmUjvIg/TUzg28CTrdyvv
eHOdBK9R0/OLQLhzRbL0VOGJ//wEl+HJRO0QEhTKPgdQ1e/VH/4Zu5WSzF8P/x4C
s2f8U4IKQqulDuDHXtpMpelFm7hRWgsOqZLkcyXLs+0dvSM9CTPO6P0ZaImxUctk
5daHWEsXUnCErDQawt1mcZP8l6qnxofMQIfLXyPVzvlSnHyToTmrtXa1v2u4AuL/
hOo4MYWsFNUmRdtGFFGlExGgEDZ4G5NwiYjRBl/6XJ3v4nhnnMbuzxP8scpoe5m1
8tjroJHZFUUs/mFU/H+oRbHzSzXPmp1sddNaTg4OpVmTn3DDh6ljnFhiItd1Ndw0
5ldVbSe6ETq5RoK0TbzvQOeVpa9F3JfqbrXLQPqfd2iz/No41LQYG1uShRYuXKuA
wfEcc+c9VMd3FILu05pGwBnU8VS9VbxTYMz7xDxg6b29Ywnb7u+Q1ycCk2gFYtkS
E2oZDuyewTJxaskzYsNr
=wijn
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
- Rework of the ACPI namespace scanning code from Rafael J. Wysocki
with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
Toshi Kani, and Yinghai Lu.
- ACPI power resources handling and ACPI device PM update from Rafael
J Wysocki.
- ACPICA update to version 20130117 from Bob Moore and Lv Zheng with
contributions from Aaron Lu, Chao Guan, Jesper Juhl, and Tim Gardner.
- Support for Intel Lynxpoint LPSS from Mika Westerberg.
- cpuidle update from Len Brown including Intel Haswell support, C1
state for intel_idle, removal of global pm_idle.
- cpuidle fixes and cleanups from Daniel Lezcano.
- cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri with
contributions from Stratos Karafotis and Rickard Andersson.
- Intel P-states driver for Sandy Bridge processors from Dirk
Brandewie.
- cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.
- cpufreq fixes related to ordering issues between acpi-cpufreq and
powernow-k8 from Borislav Petkov and Matthew Garrett.
- cpufreq support for Calxeda Highbank processors from Mark Langsdorf
and Rob Herring.
- cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
from Shawn Guo.
- cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
and Inderpal Singh.
- Support for "lightweight suspend" from Zhang Rui.
- Removal of the deprecated power trace API from Paul Gortmaker.
- Assorted updates from Andreas Fleig, Colin Ian King, Davidlohr Bueso,
Joseph Salisbury, Kees Cook, Li Fei, Nishanth Menon, ShuoX Liu,
Srinivas Pandruvada, Tejun Heo, Thomas Renninger, and Yasuaki
Ishimatsu.
* tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (267 commits)
PM idle: remove global declaration of pm_idle
unicore32 idle: delete stray pm_idle comment
openrisc idle: delete pm_idle
mn10300 idle: delete pm_idle
microblaze idle: delete pm_idle
m32r idle: delete pm_idle, and other dead idle code
ia64 idle: delete pm_idle
cris idle: delete idle and pm_idle
ARM64 idle: delete pm_idle
ARM idle: delete pm_idle
blackfin idle: delete pm_idle
sparc idle: rename pm_idle to sparc_idle
sh idle: rename global pm_idle to static sh_idle
x86 idle: rename global pm_idle to static x86_idle
APM idle: register apm_cpu_idle via cpuidle
cpufreq / intel_pstate: Add kernel command line option disable intel_pstate.
cpufreq / intel_pstate: Change to disallow module build
tools/power turbostat: display SMI count by default
intel_idle: export both C1 and C1E
ACPI / hotplug: Fix concurrency issues and memory leaks
...
Pull scheduler changes from Ingo Molnar:
"Main changes:
- scheduler side full-dynticks (user-space execution is undisturbed
and receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready, from Frederic
Weisbecker.
- Initial sched.h split-up changes, by Clark Williams
- select_idle_sibling() performance improvement by Mike Galbraith:
" 1 tbench pair (worst case) in a 10 core + SMT package:
pre 15.22 MB/sec 1 procs
post 252.01 MB/sec 1 procs "
- sched_rr_get_interval() ABI fix/change. We think this detail is not
used by apps (so it's not an ABI in practice), but lets keep it
under observation.
- misc RT scheduling cleanups, optimizations"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
cputime: Remove irqsave from seqlock readers
sched, powerpc: Fix sched.h split-up build failure
cputime: Restore CPU_ACCOUNTING config defaults for PPC64
sched/rt: Move rt specific bits into new header file
sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
sched: Move sched.h sysctl bits into separate header
sched: Fix signedness bug in yield_to()
sched: Fix select_idle_sibling() bouncing cow syndrome
sched/rt: Further simplify pick_rt_task()
sched/rt: Do not account zero delta_exec in update_curr_rt()
cputime: Safely read cputime of full dynticks CPUs
kvm: Prepare to add generic guest entry/exit callbacks
cputime: Use accessors to read task cputime stats
cputime: Allow dynamic switch between tick/virtual based cputime accounting
cputime: Generic on-demand virtual cputime accounting
cputime: Move default nsecs_to_cputime() to jiffies based cputime file
cputime: Librarize per nsecs resolution cputime definitions
cputime: Avoid multiplication overflow on utime scaling
context_tracking: Export context state for generic vtime
...
Fix up conflict in kernel/context_tracking.c due to comment additions.
Pull perf changes from Ingo Molnar:
"There are lots of improvements, the biggest changes are:
Main kernel side changes:
- Improve uprobes performance by adding 'pre-filtering' support, by
Oleg Nesterov.
- Make some POWER7 events available in sysfs, equivalent to what was
done on x86, from Sukadev Bhattiprolu.
- tracing updates by Steve Rostedt - mostly misc fixes and smaller
improvements.
- Use perf/event tracing to report PCI Express advanced errors, by
Tony Luck.
- Enable northbridge performance counters on AMD family 15h, by Jacob
Shin.
- This tracing commit:
tracing: Remove the extra 4 bytes of padding in events
changes the ABI. All involved parties (PowerTop in particular)
seem to agree that it's safe to do now with the introduction of
libtraceevent, but the devil is in the details ...
Main tooling side changes:
- Add 'event group view', from Namyung Kim:
To use it, 'perf record' should group events when recording. And
then perf report parses the saved group relation from file header
and prints them together if --group option is provided. You can
use the 'perf evlist' command to see event group information:
$ perf record -e '{ref-cycles,cycles}' noploop 1
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.385 MB perf.data (~16807 samples) ]
$ perf evlist --group
{ref-cycles,cycles}
With this example, default perf report will show you each event
separately.
You can use --group option to enable event group view:
$ perf report --group
...
# group: {ref-cycles,cycles}
# ========
# Samples: 7K of event 'anon group { ref-cycles, cycles }'
# Event count (approx.): 6876107743
#
# Overhead Command Shared Object Symbol
# ................ ....... ................. ..........................
99.84% 99.76% noploop noploop [.] main
0.07% 0.00% noploop ld-2.15.so [.] strcmp
0.03% 0.00% noploop [kernel.kallsyms] [k] timerqueue_del
0.03% 0.03% noploop [kernel.kallsyms] [k] sched_clock_cpu
0.02% 0.00% noploop [kernel.kallsyms] [k] account_user_time
0.01% 0.00% noploop [kernel.kallsyms] [k] __alloc_pages_nodemask
0.00% 0.00% noploop [kernel.kallsyms] [k] native_write_msr_safe
0.00% 0.11% noploop [kernel.kallsyms] [k] _raw_spin_lock
0.00% 0.06% noploop [kernel.kallsyms] [k] find_get_page
0.00% 0.02% noploop [kernel.kallsyms] [k] rcu_check_callbacks
0.00% 0.02% noploop [kernel.kallsyms] [k] __current_kernel_time
As you can see the Overhead column now contains both of ref-cycles
and cycles and header line shows group information also - 'anon
group { ref-cycles, cycles }'. The output is sorted by period of
group leader first.
- Initial GTK+ annotate browser, from Namhyung Kim.
- Add option for runtime switching perf data file in perf report,
just press 's' and a menu with the valid files found in the current
directory will be presented, from Feng Tang.
- Add support to display whole group data for raw columns, from Jiri
Olsa.
- Add per processor socket count aggregation in perf stat, from
Stephane Eranian.
- Add interval printing in 'perf stat', from Stephane Eranian.
- 'perf test' improvements
- Add support for wildcards in tracepoint system name, from Jiri
Olsa.
- Add anonymous huge page recognition, from Joshua Zhu.
- perf build-id cache now can show DSOs present in a perf.data file
that are not in the cache, to integrate with build-id servers being
put in place by organizations such as Fedora.
- perf top now shares more of the evsel config/creation routines with
'record', paving the way for further integration like 'top'
snapshots, etc.
- perf top now supports DWARF callchains.
- Fix mmap limitations on 32-bit, fix from David Miller.
- 'perf bench numa mem' NUMA performance measurement suite
- ... and lots of fixes, performance improvements, cleanups and other
improvements I failed to list - see the shortlog and git log for
details."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (270 commits)
perf/x86/amd: Enable northbridge performance counters on AMD family 15h
perf/hwbp: Fix cleanup in case of kzalloc failure
perf tools: Fix build with bison 2.3 and older.
perf tools: Limit unwind support to x86 archs
perf annotate: Make it to be able to skip unannotatable symbols
perf gtk/annotate: Fail early if it can't annotate
perf gtk/annotate: Show source lines with gray color
perf gtk/annotate: Support multiple event annotation
perf ui/gtk: Implement basic GTK2 annotation browser
perf annotate: Fix warning message on a missing vmlinux
perf buildid-cache: Add --update option
uprobes/perf: Avoid uprobe_apply() whenever possible
uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE
uprobes/perf: Teach trace_uprobe/perf code to pre-filter
uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's
uprobes: Introduce uprobe_apply()
perf: Introduce hw_perf_event->tp_target and ->tp_list
uprobes/perf: Always increment trace_uprobe->nhit
uprobes/tracing: Kill uprobe_trace_consumer, embed uprobe_consumer into trace_uprobe
uprobes/tracing: Introduce is_trace_uprobe_enabled()
...
Commit: c1bf08ac "ftrace: Be first to run code modification on modules"
changed ftrace module notifier's priority to INT_MAX in order to
process the ftrace nops before anything else could touch them
(namely kprobes). This was the correct thing to do.
Unfortunately, the ftrace module notifier also contains the ftrace
clean up code. As opposed to the set up code, this code should be
run *after* all the module notifiers have run in case a module is doing
correct clean-up and unregisters its ftrace hooks. Basically, ftrace
needs to do clean up on module removal, as it needs to know about code
being removed so that it doesn't try to modify that code. But after it
removes the module from its records, if a ftrace user tries to remove
a probe, that removal will fail due as the record of that code segment
no longer exists.
Nothing really bad happens if the probe removal is called after ftrace
did the clean up, but the ftrace removal function will return an error.
Correct code (such as kprobes) will produce a WARN_ON() if it fails
to remove the probe. As people get annoyed by frivolous warnings, it's
best to do the ftrace clean up after everything else.
By splitting the ftrace_module_notifier into two notifiers, one that
does the module load setup that is run at high priority, and the other
that is called for module clean up that is run at low priority, the
problem is solved.
Cc: stable@vger.kernel.org
Reported-by: Frank Ch. Eigler <fche@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing of ia32 compat system calls has been a bit of a pain as they
use different system call numbers than the 64bit equivalents.
I wrote a simple 'lls' program that lists files. I compiled it as a i686
ELF binary and ran it under a x86_64 box. This is the result:
echo 0 > /debug/tracing/tracing_on
echo 1 > /debug/tracing/events/syscalls/enable
echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on
grep lls /debug/tracing/trace
[.. skipping calls before TS_COMPAT is set ...]
lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420)
lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000
lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22)
lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000
lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4)
lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe
lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000)
lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3
lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060)
lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0
lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2)
lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000
lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614)
lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0
lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000)
lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3
lls-1127 [005] d... 936.409279: sys_close(fd: 3)
lls-1127 [005] d... 936.421550: sys_close -> 0x200
lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0)
lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0
lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802)
lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000
lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000)
lls-1127 [005] d... 936.421580: sys_capget -> 0x0
lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812)
lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000
lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32)
lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000
lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0)
lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0
lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22)
Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are
obviously incorrect, and confusing.
Other efforts have been made to fix this:
https://lkml.org/lkml/2012/3/26/367
But the real solution is to rewrite the syscall internals and come up
with a fixed solution. One that doesn't require all the kluge that the
current solution has.
Thus for now, instead of outputting incorrect data, simply ignore them.
With this patch the changes now have:
#> grep lls /debug/tracing/trace
#>
Compat system calls simply are not traced. If users need compat
syscalls, then they should just use the raw syscall tracepoints.
For an architecture to make their compat syscalls ignored, it must
define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also
define an arch_trace_is_compat_syscall() function that will return true
if the current task should ignore tracing the syscall.
I want to stress that this change does not affect actual syscalls in any
way, shape or form. It is only used within the tracing system and
doesn't interfere with the syscall logic at all. The changes are
consolidated nicely into trace_syscalls.c and asm/ftrace.h.
I had to make one small modification to asm/thread_info.h and that was
to remove the include of asm/ftrace.h. As asm/ftrace.h required the
current_thread_info() it was causing include hell. That include was
added back in 2008 when the function graph tracer was added:
commit caf4b323 "tracing, x86: add low level support for ftrace return tracing"
It does not need to be included there.
Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
uprobe_perf_open/close call the costly uprobe_apply() every time,
we can avoid it if:
- "nr_systemwide != 0" is not changed.
- There is another process/thread with the same ->mm.
- copy_proccess() does inherit_event(). dup_mmap() preserves the
inserted breakpoints.
- event->attr.enable_on_exec == T, we can rely on uprobe_mmap()
called by exec/mmap paths.
- tp_target is exiting. Only _close() checks PF_EXITING, I don't
think TRACE_REG_PERF_OPEN can hit the dying task too often.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Change uprobe_trace_func() and uprobe_perf_func() to return "int". Change
uprobe_dispatcher() to return "trace_ret | perf_ret" although this is not
needed, currently TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually exclusive.
The only functional change is that uprobe_perf_func() checks the filtering
too and returns UPROBE_HANDLER_REMOVE if nobody wants to trace current.
Testing:
# perf probe -x /lib/libc.so.6 syscall
# perf record -e probe_libc:syscall -i perl -e 'fork; syscall -1 for 1..10; wait'
# perf report --show-total-period
100.00% 10 perl libc-2.8.so [.] syscall
Before this patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 20
A child process doesn't have a counter, but still it hits this breakoint
"copied" by dup_mmap().
After the patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 11
The child process hits this int3 only once and does unapply_uprobe().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Finally implement uprobe_perf_filter() which checks ->nr_systemwide or
->perf_events to figure out whether we need to insert the breakpoint.
uprobe_perf_open/close are changed to do uprobe_apply(true/false) when
the new perf event comes or goes away.
Note that currently this is very suboptimal:
- uprobe_register() called by TRACE_REG_PERF_REGISTER becomes a
heavy nop, consumer->filter() always returns F at this stage.
As it was already discussed we need uprobe_register_only() to
avoid the costly register_for_each_vma() when possible.
- uprobe_apply() is oftenly overkill. Unless "nr_systemwide != 0"
changes we need uprobe_apply_mm(), unapply_uprobe() is almost
what we need.
- uprobe_apply() can be simply avoided sometimes, see the next
changes.
Testing:
# perf probe -x /lib/libc.so.6 syscall
# perl -e 'syscall -1 while 1' &
[1] 530
# perf record -e probe_libc:syscall perl -e 'syscall -1 for 1..10; sleep 1'
# perf report --show-total-period
100.00% 10 perl libc-2.8.so [.] syscall
Before this patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 79291
A huge ->nrhit == 79291 reflects the fact that the background process
530 constantly hits this breakpoint too, even if doesn't contribute to
the output.
After the patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 10
This shows that only the target process was punished by int3.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Introduce "struct trace_uprobe_filter" which records the "active"
perf_event's attached to ftrace_event_call. For the start we simply
use list_head, we can optimize this later if needed. For example, we
do not really need to record an event with ->parent != NULL, we can
rely on parent->child_list. And we can certainly do some optimizations
for the case when 2 events have the same ->tp_target or tp_target->mm.
Change trace_uprobe_register() to process TRACE_REG_PERF_OPEN/CLOSE
and add/del this perf_event to the list.
We can probably avoid any locking, but lets start with the "obvioulsy
correct" trace_uprobe_filter->rwlock which protects everything.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Move tu->nhit++ from uprobe_trace_func() to uprobe_dispatcher().
->nhit counts how many time we hit the breakpoint inserted by this
uprobe, we do not want to loose this info if uprobe was enabled by
sys_perf_event_open().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
trace_uprobe->consumer and "struct uprobe_trace_consumer" add the
unnecessary indirection and complicate the code for no reason.
This patch simply embeds uprobe_consumer into "struct trace_uprobe",
all other changes only fix the compilation errors.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
probe_event_enable/disable() check tu->consumer != NULL to avoid the
wrong uprobe_register/unregister().
We are going to kill this pointer and "struct uprobe_trace_consumer",
so we add the new helper, is_trace_uprobe_enabled(), which can rely
on TP_FLAG_TRACE/TP_FLAG_PROFILE instead.
Note: the current logic doesn't look optimal, it is not clear why
TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually exclusive, we will probably
change this later.
Also kill the unused TP_FLAG_UPROBE.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
probe_event_enable/disable() check tu->inode != NULL at the start.
This is ugly, if igrab() can fail create_trace_uprobe() should not
succeed and "postpone" the failure.
And S_ISREG(inode->i_mode) check added by d24d7dbf is not safe.
Note: alloc_uprobe() should probably check igrab() != NULL as well.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
probe_event_enable() does uprobe_register() and only after that sets
utc->tu and tu->consumer/flags. This can race with uprobe_dispatcher()
which can miss these assignments or see them out of order. Nothing
really bad can happen, but this doesn't look clean/safe.
And this does not allow to use uprobe_consumer->filter() we are going
to add, it is called by uprobe_register() and it needs utc->tu.
Change this code to initialize everything before uprobe_register(), and
reset tu->consumer/flags if it fails. We can't race with event_disable(),
the caller holds event_mutex, and if we could the code would be wrong
anyway.
In fact I think uprobe_trace_consumer should die, it buys nothing but
complicates the code. We can simply add uprobe_consumer into trace_uprobe.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
create_trace_uprobe() does kern_path() to find ->d_inode, but forgets
to do path_put(). We can do this right after igrab().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Change handle_swbp() to set regs->ip = bp_vaddr in advance, this is
what consumer->handler() needs but uprobe_get_swbp_addr() is not
exported.
This also simplifies the code and makes it more consistent across
the supported architectures. handle_swbp() becomes the only caller
of uprobe_get_swbp_addr().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
uprobe_consumer->filter() is pointless in its current form, kill it.
We will add it back, but with the different signature/semantics. Perhaps
we will even re-introduce the callsite in handler_chain(), but not to
just skip uc->handler().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On early boot up, when the ftrace ring buffer is initialized, the
static variable current_trace is initialized to &nop_trace.
Before this initialization, current_trace is NULL and will never
become NULL again. It is always reassigned to a ftrace tracer.
Several places check if current_trace is NULL before it uses
it, and this check is frivolous, because at the point in time
when the checks are made the only way current_trace could be
NULL is if ftrace failed its allocations at boot up, and the
paths to these locations would probably not be possible.
By initializing current_trace to &nop_trace where it is declared,
current_trace will never be NULL, and we can remove all these
checks of current_trace being NULL which never needed to be
checked in the first place.
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Ftrace has a snapshot feature available from kernel space and
latency tracers (e.g. irqsoff) are using it. This patch enables
user applictions to take a snapshot via debugfs.
Add "snapshot" debugfs file in "tracing" directory.
snapshot:
This is used to take a snapshot and to read the output of the
snapshot.
# echo 1 > snapshot
This will allocate the spare buffer for snapshot (if it is
not allocated), and take a snapshot.
# cat snapshot
This will show contents of the snapshot.
# echo 0 > snapshot
This will free the snapshot if it is allocated.
Any other positive values will clear the snapshot contents if
the snapshot is allocated, or return EINVAL if it is not allocated.
Link: http://lkml.kernel.org/r/20121226025300.3252.86850.stgit@liselsia
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
[
Fixed irqsoff selftest and also a conflict with a change
that fixes the update_max_tr.
]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the trace buffer read functions use a static variable
"old_tracer" for detecting if the current tracer changes. This
was suitable for a single trace file ("trace"), but to add a
snapshot feature that will use the same function for its file,
a check against a static variable is not sufficient.
To use the output functions for two different files, instead of
storing the current tracer in a static variable, as the trace
iterator descriptor contains a pointer to the original current
tracer's name, that pointer can now be used to check if the
current tracer has changed between different reads of the trace
file.
Link: http://lkml.kernel.org/r/20121226025252.3252.9276.stgit@liselsia
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For systems with an unstable sched_clock, all cpu_clock() does is enable/
disable local irq during the call to sched_clock_cpu(). And for stable
systems they are same.
trace_clock_global() already disables interrupts, so it can call
sched_clock_cpu() directly.
Link: http://lkml.kernel.org/r/1356576585-28782-2-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a stat about the number of events read from the ring buffer:
# cat /debug/tracing/per_cpu/cpu0/stats
entries: 39869
overrun: 870512
commit overrun: 0
bytes: 1449912
oldest event ts: 6561.368690
now ts: 6565.246426
dropped events: 0
read events: 112 <-- Added
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While debugging the virtual cputime with the function graph tracer
with a max_depth of 1 (most common use of the max_depth so far),
I found that I was missing kernel execution because of a race condition.
The code for the return side of the function has a slight race:
ftrace_pop_return_trace(&trace, &ret, frame_pointer);
trace.rettime = trace_clock_local();
ftrace_graph_return(&trace);
barrier();
current->curr_ret_stack--;
The ftrace_pop_return_trace() initializes the trace structure for
the callback. The ftrace_graph_return() uses the trace structure
for its own use as that structure is on the stack and is local
to this function. Then the curr_ret_stack is decremented which
is what the trace.depth is set to.
If an interrupt comes in after the ftrace_graph_return() but
before the curr_ret_stack, then the called function will get
a depth of 2. If max_depth is set to 1 this function will be
ignored.
The problem is that the trace has already been called, and the
timestamp for that trace will not reflect the time the function
was about to re-enter userspace. Calls to the interrupt will not
be traced because the max_depth has prevented this.
To solve this issue, the ftrace_graph_return() can safely be
moved after the current->curr_ret_stack has been updated.
This way the timestamp for the return callback will reflect
the actual time.
If an interrupt comes in after the curr_ret_stack update and
ftrace_graph_return(), it will be traced. It may look a little
confusing to see it within the other function, but at least
it will not be lost.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
__this_cpu_inc_return() or __this_cpu_dec generates a single instruction,
which is faster than __get_cpu_var operation.
Link: http://lkml.kernel.org/r/50A9C1BD.1060308@gmail.com
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The text in Documentation said it would be removed in 2.6.41;
the text in the Kconfig said removal in the 3.1 release. Either
way you look at it, we are well past both, so push it off a cliff.
Note that the POWER_CSTATE and the POWER_PSTATE are part of the
legacy tracing API. Remove all tracepoints which use these flags.
As can be seen from context, most already have a trace entry via
trace_cpu_idle anyways.
Also, the cpufreq/cpufreq.c PSTATE one is actually unpaired, as
compared to the CSTATE ones which all have a clear start/stop.
As part of this, the trace_power_frequency also becomes orphaned,
so it too is deleted.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Dan's smatch found a compare bug with the result of the
trace_test_and_set_recursion() and comparing to less than
zero. If the function fails, it returns -1, but was saved in
an unsigned int, which will never be less than zero and will
ignore the result of the test if a recursion did happen.
Luckily this is the last of the recursion tests, as the
infrastructure of ftrace would catch recursions before it
got here, except for some few exceptions.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ring_buffer.c use to require declarations from trace.h, but
these have moved to the generic header files. There's nothing
in trace.h that ring_buffer.c requires.
There's some headers that trace.h included that ring_buffer.c
needs, but it's best that it includes them directly, and not
include trace.h.
Also, some things may use ring_buffer.c without having tracing
configured. This removes the dependency that may come in the
future.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using context bit recursion checking, we can help increase the
performance of the ring buffer.
Before this patch:
# echo function > /debug/tracing/current_tracer
# for i in `seq 10`; do ./hackbench 50; done
Time: 10.285
Time: 10.407
Time: 10.243
Time: 10.372
Time: 10.380
Time: 10.198
Time: 10.272
Time: 10.354
Time: 10.248
Time: 10.253
(average: 10.3012)
Now we have:
# echo function > /debug/tracing/current_tracer
# for i in `seq 10`; do ./hackbench 50; done
Time: 9.712
Time: 9.824
Time: 9.861
Time: 9.827
Time: 9.962
Time: 9.905
Time: 9.886
Time: 10.088
Time: 9.861
Time: 9.834
(average: 9.876)
a 4% savings!
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracer had two different versions of function tracing.
The disabling of irqs version and the preempt disable version.
As function tracing in very intrusive and can cause nasty recursion
issues, it has its own recursion protection. But the old method to
do this was a flat layer. If it detected that a recursion was happening
then it would just return without recording.
This made the preempt version (much faster than the irq disabling one)
not very useful, because if an interrupt were to occur after the
recursion flag was set, the interrupt would not be traced at all,
because every function that was traced would think it recursed on
itself (due to the context it preempted setting the recursive flag).
Now that we have a recursion flag for every context level, we
no longer need to worry about that. We can disable preemption,
set the current context recursion check bit, and go on. If an
interrupt were to come along, it would check its own context bit
and happily continue to trace.
As the preempt version is faster than the irq disable version,
there's no more reason to keep the preempt version around.
And the irq disable version still had an issue with missing
out on tracing NMI code.
Remove the irq disable function tracer version and have the
preempt disable version be the default (and only version).
Before this patch we had from running:
# echo function > /debug/tracing/current_tracer
# for i in `seq 10`; do ./hackbench 50; done
Time: 12.028
Time: 11.945
Time: 11.925
Time: 11.964
Time: 12.002
Time: 11.910
Time: 11.944
Time: 11.929
Time: 11.941
Time: 11.924
(average: 11.9512)
Now we have:
# echo function > /debug/tracing/current_tracer
# for i in `seq 10`; do ./hackbench 50; done
Time: 10.285
Time: 10.407
Time: 10.243
Time: 10.372
Time: 10.380
Time: 10.198
Time: 10.272
Time: 10.354
Time: 10.248
Time: 10.253
(average: 10.3012)
a 13.8% savings!
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When function tracing occurs, the following steps are made:
If arch does not support a ftrace feature:
call internal function (uses INTERNAL bits) which calls...
If callback is registered to the "global" list, the list
function is called and recursion checks the GLOBAL bits.
then this function calls...
The function callback, which can use the FTRACE bits to
check for recursion.
Now if the arch does not suppport a feature, and it calls
the global list function which calls the ftrace callback
all three of these steps will do a recursion protection.
There's no reason to do one if the previous caller already
did. The recursion that we are protecting against will
go through the same steps again.
To prevent the multiple recursion checks, if a recursion
bit is set that is higher than the MAX bit of the current
check, then we know that the check was made by the previous
caller, and we can skip the current check.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently for recursion checking in the function tracer, ftrace
tests a task_struct bit to determine if the function tracer had
recursed or not. If it has, then it will will return without going
further.
But this leads to races. If an interrupt came in after the bit
was set, the functions being traced would see that bit set and
think that the function tracer recursed on itself, and would return.
Instead add a bit for each context (normal, softirq, irq and nmi).
A check of which context the task is in is made before testing the
associated bit. Now if an interrupt preempts the function tracer
after the previous context has been set, the interrupt functions
can still be traced.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is lots of places that perform:
op = rcu_dereference_raw(ftrace_control_list);
while (op != &ftrace_list_end) {
Add a helper macro to do this, and also optimize for a single
entity. That is, gcc will optimize a loop for either no iterations
or more than one iteration. But usually only a single callback
is registered to the function tracer, thus the optimized case
should be a single pass. to do this we now do:
op = rcu_dereference_raw(list);
do {
[...]
} while (likely(op = rcu_dereference_raw((op)->next)) &&
unlikely((op) != &ftrace_list_end));
An op is always registered (ftrace_list_end when no callbacks is
registered), thus when a single callback is registered, the link
list looks like:
top => callback => ftrace_list_end => NULL.
The likely(op = op->next) still must be performed due to the race
of removing the callback, where the first op assignment could
equal ftrace_list_end. In that case, the op->next would be NULL.
But this is unlikely (only happens in a race condition when
removing the callback).
But it is very likely that the next op would be ftrace_list_end,
unless more than one callback has been registered. This tells
gcc what the most common case is and makes the fast path with
the least amount of branches.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing recursion self test should not crash
the machine if the resursion test fails. If it detects that
the function tracing is recursing when it should not be, then
bail, don't go into an infinite recursive loop.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If one of the function tracers set by the global ops is not recursion
safe, it can still be called directly without the added recursion
supplied by the ftrace infrastructure.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The test that checks function recursion does things differently
if the arch does not support all ftrace features. But that really
doesn't make a difference with how the test runs, and either way
the count variable should be 2 at the end.
Currently the test wrongly fails for archs that don't support all
the ftrace features.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a race condition between the setting of a new tracer and
the update of the max trace buffers (the swap). When a new tracer
is added, it sets current_trace to nop_trace before disabling
the old tracer. At this moment, if the old tracer uses update_max_tr(),
the update may trigger the warning against !current_trace->use_max-tr,
as nop_trace doesn't have that set.
As update_max_tr() requires that interrupts be disabled, we can
add a check to see if current_trace == nop_trace and bail if it
does. Then when disabling the current_trace, set it to nop_trace
and run synchronize_sched(). This will make sure all calls to
update_max_tr() have completed (it was called with interrupts disabled).
As a clean up, this commit also removes shrinking and recreating
the max_tr buffer if the old and new tracers both have use_max_tr set.
The old way use to always shrink the buffer, and then expand it
for the next tracer. This is a waste of time.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As trace_clock is used by other things besides tracing, and it
does not require anything from trace.h, it is best not to include
the header file in trace_clock.c.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Due to a userspace issue with PowerTop v2beta, which hardcoded
the offset of event fields that it was using, it broke when
we removed the Big Kernel Lock counter from the event header.
(commit e6e1e2593 "tracing: Remove lock_depth from event entry")
Because this broke userspace, it was determined that we must
keep those 4 bytes around.
(commit a3a4a5acd "Regression: partial revert "tracing: Remove lock_depth from event entry"")
This unfortunately wastes space in the ring buffer. 4 bytes per
event, where a lot of events are just 24 bytes. That's 16% of the
buffer wasted. A million events will add 4 megs of white space
into the buffer.
It was later noticed that PowerTop v2beta could not work on systems
where the kernel was 64 bit but the userspace was 32 bits.
The reason was because the offsets are different between the
two and the hard coded offset of one would not work with the other.
With PowerTop v2 final, it implemented the same interface that both
perf and trace-cmd use. That is, it reads the format file of
the event to find the offsets of the fields it needs. This fixes
the problem with running powertop on a 32 bit userspace running
on a 64 bit kernel. It also no longer requires the 4 byte padding.
As PowerTop v2 has been out for a while, and is included in all
major distributions, it is time that we can safely remove the
4 bytes of padding. Users of PowerTop v2beta should upgrade to
PowerTop v2 final.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move SAVE_REGS support flag into Kconfig and rename
it to CONFIG_DYNAMIC_FTRACE_WITH_REGS. This also introduces
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which indicates
the architecture depending part of ftrace has a code
that saves full registers.
On the other hand, CONFIG_DYNAMIC_FTRACE_WITH_REGS indicates
the code is enabled.
Link: http://lkml.kernel.org/r/20120928081516.3560.72534.stgit@ltc138.sdl.hitachi.co.jp
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the file max_graph_depth to the debug tracing directory that lets
the user define the depth of the function graph.
A very useful operation is to set the depth to 1. Then it traces only
the first function that is called when entering the kernel. This can
be used to determine what system operations interrupt a process.
For example, to work on NOHZ processes (single tasks running without
a timer tick), if any interrupt goes off and preempts that task, this
code will show it happening.
# cd /sys/kernel/debug/tracing
# echo 1 > max_graph_depth
# echo function_graph > current_tracer
# cat per_cpu/cpu/<cpu-of-process>/trace
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's now a check in tracing_reset_online_cpus() if the buffer is
allocated or NULL. No need to do a check before calling it with max_tr.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
max_tr->buffer could be NULL in the tracing_reset{_online_cpus}. In this
case, a NULL pointer dereference happens, so we should return immediately
from these functions.
Note, the current code does not call tracing_reset*() with max_tr when
its buffer is NULL, but future code will. This patch is needed to prevent
the future code from crashing.
Link: http://lkml.kernel.org/r/20121219070234.31200.93863.stgit@liselsia
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some functions in the syscall tracing is used only locally to
the file, but they are labeled global. Convert them to static functions.
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Without this patch, we can register a uprobe event for a directory.
Enabling such a uprobe event would anyway fail.
Example:
$ echo 'p /bin:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
However dirctories cannot be valid targets for uprobe.
Hence verify if the target is a regular file during the probe
registration.
Link: http://lkml.kernel.org/r/20130103004212.690763002@goodmis.org
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ cleaned up whitespace and removed redundant IS_DIR() check ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
typeof(&buffer) is a pointer to array of 1024 char, or char (*)[1024].
But, typeof(&buffer[0]) is a pointer to char which match the return type of get_trace_buf().
As well-known, the value of &buffer is equal to &buffer[0].
so return this_cpu_ptr(&percpu_buffer->buffer[0]) can avoid type cast.
Link: http://lkml.kernel.org/r/50A1A800.3020102@gmail.com
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The original ring-buffer code had special checks at the start
of rb_advance_iter() and instead of repeating them again at the
end of the function if a certain condition existed, I just did
a recursive call to rb_advance_iter() because the special condition
would cause rb_advance_iter() to return early (after the checks).
But as things have changed, the special checks no longer exist
and the only thing done for the special_condition is to call
rb_inc_iter() and return. Instead of doing a confusing recursive call,
just call rb_inc_iter instead.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If some other kernel subsystem has a module notifier, and adds a kprobe
to a ftrace mcount point (now that kprobes work on ftrace points),
when the ftrace notifier runs it will fail and disable ftrace, as well
as kprobes that are attached to ftrace points.
Here's the error:
WARNING: at kernel/trace/ftrace.c:1618 ftrace_bug+0x239/0x280()
Hardware name: Bochs
Modules linked in: fat(+) stap_56d28a51b3fe546293ca0700b10bcb29__8059(F) nfsv4 auth_rpcgss nfs dns_resolver fscache xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack lockd sunrpc ppdev parport_pc parport microcode virtio_net i2c_piix4 drm_kms_helper ttm drm i2c_core [last unloaded: bid_shared]
Pid: 8068, comm: modprobe Tainted: GF 3.7.0-0.rc8.git0.1.fc19.x86_64 #1
Call Trace:
[<ffffffff8105e70f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff81134106>] ? __probe_kernel_read+0x46/0x70
[<ffffffffa0180000>] ? 0xffffffffa017ffff
[<ffffffffa0180000>] ? 0xffffffffa017ffff
[<ffffffff8105e76a>] warn_slowpath_null+0x1a/0x20
[<ffffffff810fd189>] ftrace_bug+0x239/0x280
[<ffffffff810fd626>] ftrace_process_locs+0x376/0x520
[<ffffffff810fefb7>] ftrace_module_notify+0x47/0x50
[<ffffffff8163912d>] notifier_call_chain+0x4d/0x70
[<ffffffff810882f8>] __blocking_notifier_call_chain+0x58/0x80
[<ffffffff81088336>] blocking_notifier_call_chain+0x16/0x20
[<ffffffff810c2a23>] sys_init_module+0x73/0x220
[<ffffffff8163d719>] system_call_fastpath+0x16/0x1b
---[ end trace 9ef46351e53bbf80 ]---
ftrace failed to modify [<ffffffffa0180000>] init_once+0x0/0x20 [fat]
actual: cc:bb:d2:4b:e1
A kprobe was added to the init_once() function in the fat module on load.
But this happened before ftrace could have touched the code. As ftrace
didn't run yet, the kprobe system had no idea it was a ftrace point and
simply added a breakpoint to the code (0xcc in the cc:bb:d2:4b:e1).
Then when ftrace went to modify the location from a call to mcount/fentry
into a nop, it didn't see a call op, but instead it saw the breakpoint op
and not knowing what to do with it, ftrace shut itself down.
The solution is to simply give the ftrace module notifier the max priority.
This should have been done regardless, as the core code ftrace modification
also happens very early on in boot up. This makes the module modification
closer to core modification.
Link: http://lkml.kernel.org/r/20130107140333.593683061@goodmis.org
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reported-by: Frank Ch. Eigler <fche@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 0fb9656d "tracing: Make tracing_enabled be equal to tracing_on"
changes the behaviour of trace_pipe, ie. it makes trace_pipe return if
we've read something and tracing is enabled, and this means that we have
to 'cat trace_pipe' again and again while running tests.
IMO the right way is if tracing is enabled, we always block and wait for
ring buffer, or we may lose what we want since ring buffer's size is limited.
Link: http://lkml.kernel.org/r/1358132051-5410-1-git-send-email-bo.li.liu@oracle.com
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
bio_{front|back}_merge tracepoints report a bio merging into an
existing request but didn't specify which request the bio is being
merged into. Add @req to it. This makes it impossible to share the
event template with block_bio_queue - split it out.
@req isn't used or exported to userland at this point and there is no
userland visible behavior change. Later changes will make use of the
extra parameter.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio completion didn't kick block_bio_complete TP. Only dm was
explicitly triggering the TP on IO completion. This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.
This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.
* Explicit trace_block_bio_complete() invocation removed from dm and
the trace point is unexported.
* @rq dropped from trace_block_bio_complete(). bios may fly around
w/o queue associated. Verifying and accessing the assocaited queue
belongs to TP probes.
* blktrace now gets both request and bio completions. Make it ignore
bio completions if request completion path is happening.
This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.
v2: With this change, block_bio_complete TP could be invoked on sg
commands which have bio's with %NULL bi_bdev. Update TP
assignment code to check whether bio->bi_bdev is %NULL before
dereferencing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 02404baf1b "tracing: Remove deprecated tracing_enabled file"
removed the tracing_enabled file as it never worked properly and
the tracing_on file should be used instead. But the tracing_on file
didn't call into the tracers start/stop routines like the
tracing_enabled file did. This caused trace-cmd to break when it
enabled the irqsoff tracer.
If you just did "echo irqsoff > current_tracer" then it would work
properly. But the tool trace-cmd disables tracing first by writing
"0" into the tracing_on file. Then it writes "irqsoff" into
current_tracer and then writes "1" into tracing_on. Unfortunately,
the above commit changed the irqsoff tracer to check the tracing_on
status instead of the tracing_enabled status. If it's disabled then
it does not start the tracer internals.
The problem is that writing "1" into tracing_on does not call the
tracers "start" routine like writing "1" into tracing_enabled did.
This makes the irqsoff tracer not start when using the trace-cmd
tool, and is a regression for userspace.
Simple fix is to have the tracing_on file call the tracers start()
method when being enabled (and the stop() method when disabled).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The latest change to allow trace options to be set on the command
line also broke the trace_options file.
The zeroing of the last byte of the option name that is echoed into
the trace_option file was removed with the consolidation of some
of the code. The compare between the option and what was written to
the trace_options file fails because the string holding the data
written doesn't terminate with a null character.
A zero needs to be added to the end of the string copied from
user space.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The rcutorture tests need to be able to trace the time of the
beginning of an RCU read-side critical section, and thus need access
to trace_clock_local(). This commit therefore adds a the needed
EXPORT_SYMBOL_GPL().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Pull minor tracing updates and fixes from Steven Rostedt:
"It seems that one of my old pull requests have slipped through.
The changes are contained to just the files that I maintain, and are
changes from others that I told I would get into this merge window.
They have already been in linux-next for several weeks, and should be
well tested."
* 'tip/perf/core-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove unnecessary WARN_ONCE's from tracing_buffers_splice_read
tracing: Remove unneeded checks from the stack tracer
tracing: Add a resize function to make one buffer equivalent to another buffer
But the kernel decided to call it "origin" instead. Fix most of the
sites.
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
Pull perf fixes from Ingo Molnar:
"These are late-v3.7 pending fixes for tracing."
Fix up trivial conflict in kernel/trace/ring_buffer.c: the NULL pointer
fix clashed with the change of type of the 'ret' variable.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ring-buffer: Fix race between integrity check and readers
ring-buffer: Fix NULL pointer if rb_set_head_page() fails
ftrace: Clear bits properly in reset_iter_read()
I've legally changed my name with New York State, the US Social Security
Administration, et al. This patch propagates the name change and change
in initials and login to comments in the kernel source as well.
Signed-off-by: Nadia Yvette Chambers <nyc@holomorphy.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The function rb_check_pages() was added to make sure the ring buffer's
pages were sane. This check is done when the ring buffer size is modified
as well as when the iterator is released (closing the "trace" file),
as that was considered a non fast path and a good place to do a sanity
check.
The problem is that the check does not have any locks around it.
If one process were to read the trace file, and another were to read
the raw binary file, the check could happen while the reader is reading
the file.
The issues with this is that the check requires to clear the HEAD page
before doing the full check and it restores it afterward. But readers
require the HEAD page to exist before it can read the buffer, otherwise
it gives a nasty warning and disables the buffer.
By adding the reader lock around the check, this keeps the race from
happening.
Cc: stable@vger.kernel.org # 3.6
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function rb_set_head_page() searches the list of ring buffer
pages for a the page that has the HEAD page flag set. If it does
not find it, it will do a WARN_ON(), disable the ring buffer and
return NULL, as this should never happen.
But if this bug happens to happen, not all callers of this function
can handle a NULL pointer being returned from it. That needs to be
fixed.
Cc: stable@vger.kernel.org # 3.0+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
WARN shouldn't be used as a means of communicating failure to a userspace programmer.
Link: http://lkml.kernel.org/r/20120725153908.GA25203@redhat.com
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It seems that 'ftrace_enabled' flag should not be used inside the tracer
functions. The ftrace core is using this flag for internal purposes, and
the flag wasn't meant to be used in tracers' runtime checks.
stack tracer is the only tracer that abusing the flag. So stop it from
serving as a bad example.
Also, there is a local 'stack_trace_disabled' flag in the stack tracer,
which is never updated; so it can be removed as well.
Link: http://lkml.kernel.org/r/1342637761-9655-1-git-send-email-anton.vorontsov@linaro.org
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Trace buffer size is now per-cpu, so that there are the following two
patterns in resizing of buffers.
(1) resize per-cpu buffers to same given size
(2) resize per-cpu buffers to another trace_array's buffer size
for each CPU (such as preparing the max_tr which is equivalent
to the global_trace's size)
__tracing_resize_ring_buffer() can be used for (1), and had
implemented (2) inside it for resetting the global_trace to the
original size.
(2) was also implemented in another place. So this patch assembles
them in a new function - resize_buffer_duplicate_size().
Link: http://lkml.kernel.org/r/20121017025616.2627.91226.stgit@falsita
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is a typo here where '&' is used instead of '|' and it turns the
statement into a noop. The original code is equivalent to:
iter->flags &= ~((1 << 2) & (1 << 4));
Link: http://lkml.kernel.org/r/20120609161027.GD6488@elgon.mountain
Cc: stable@vger.kernel.org # all of them
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Show raw time stamp values for stats per cpu if you choose counter or tsc mode
for trace_clock. Although a unit of tracing time stamp is nsec in local or global mode,
the units in counter and TSC mode are tracing counter and cycles respectively.
Link: http://lkml.kernel.org/r/1352837903-32191-3-git-send-email-dhsharp@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to promote interoperability between userspace tracers and ftrace,
add a trace_clock that reports raw TSC values which will then be recorded
in the ring buffer. Userspace tracers that also record TSCs are then on
exactly the same time base as the kernel and events can be unambiguously
interlaced.
Tested: Enabled a tracepoint and the "tsc" trace_clock and saw very large
timestamp values.
v2:
Move arch-specific bits out of generic code.
v3:
Rename "x86-tsc", cleanups
v7:
Generic arch bits in Kbuild.
Google-Bug-Id: 6980623
Link: http://lkml.kernel.org/r/1352837903-32191-1-git-send-email-dhsharp@google.com
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add trace_options to the kernel command line parameter to be able to
set options at early boot. For example, to enable stack dumps of
events, add the following:
trace_options=stacktrace
This along with the trace_event option, you can get not only
traces of the events but also the stack dumps with them.
Requested-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Have the ring buffer commit function use the irq_work infrastructure to
wake up any waiters waiting on the ring buffer for new data. The irq_work
was created for such a purpose, where doing the actual wake up at the
time of adding data is too dangerous, as an event or function trace may
be in the midst of the work queue locks and cause deadlocks. The irq_work
will either delay the action to the next timer interrupt, or trigger an IPI
to itself forcing an interrupt to do the work (in a safe location).
With irq_work, all ring buffer commits can safely do wakeups, removing
the need for the ring buffer commit "nowake" variants, which were used
by events and function tracing. All commits can now safely use the
normal commit, and the "nowake" variants can be removed.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing_enabled file was used as a quick way to stop
tracers, and try to bring down overhead for things like
the latency tracers (irqsoff, wakeup, etc). But it didn't
work that well.
The tracing_on file was created as a really fast way to
stop recording into the ftrace ring buffer and can interact
with the kernel. That is a tracing_off() call in the kernel
can disable recording of events, and then from userspace one
could echo 1 into the tracing_on file to continue it. The
tracing_enabled function did too much to allow for this.
The tracing_on has taken over as a way to start and stop tracing
and the tracing_enabled file should not be used. But because of
its existance, it still confuses people. Over a year ago the
following commit was added:
commit 6752ab4a9c
Author: Steven Rostedt <srostedt@redhat.com>
Date: Tue Feb 8 13:54:06 2011 -0500
tracing: Deprecate tracing_enabled for tracing_on
This commit added a WARN_ON() if the tracing_enabled file's variable
was changed. After this was added, only LatencyTop complained, and
they soon fixed their tool as there was no reason that LatencyTop
should touch this file as it was using the perf ring buffers which
this file does not interact with. But since that time no one else
has complained about this WARN_ON(). Thus it is safe to assume that
this file is no longer needed. Time to get rid of it.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing_enabled file has been deprecated as it never was able
to serve its purpose well. The tracing_on file has taken over.
Instead of having code to keep tracing_enabled, have the tracing_enabled
file just set tracing_on, and remove the tracing_enabled variable.
This allows us to remove the tracing_enabled file. The reason that
the remove is in a different change set and not removed here is
in case we find some lonely userspace tool that requires the file
to exist. Then the removal patch will get reverted, but this one
will not.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function register_tracer() is only used by kernel core code,
that never needs to remove the tracer. As trace_events have become
the main way to add new tracing to the kernel, the need to
unregister a tracer has diminished. Remove the unused function
unregister_tracer(). If a need arises where we need it, then we
can always add it back.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The open function used by available_events is the same as set_event even
though it uses different seq functions. This causes a side effect of
writing into available_events clearing all events, even though
available_events is suppose to be read only.
There's no reason to keep a single function for just the open and have
both use different functions for everything else. It is a little
confusing and causes strange behavior. Just have each have their own
function.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ring_buffer_oldest_event_ts() should return a value of u64 type, because
ring_buffer_per_cpu->buffer_page->buffer_data_page->time_stamp is u64 type.
Link: http://lkml.kernel.org/r/1349998076-15495-5-git-send-email-dhsharp@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Because the "tsc" clock isn't in nanoseconds, the ring buffer must be
reset when changing clocks so that incomparable timestamps don't end up
in the same trace.
Tested: Confirmed switching clocks resets the trace buffer.
Google-Bug-Id: 6980623
Link: http://lkml.kernel.org/r/1349998076-15495-3-git-send-email-dhsharp@google.com
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions defined in include/trace/syscalls.h are not used directly
since struct ftrace_event_class was introduced. Remove them from the
header file and rearrange the ftrace_event_class declarations in
trace_syscalls.c.
Link: http://lkml.kernel.org/r/1339112785-21806-2-git-send-email-vnagarnaik@google.com
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove ftrace_format_syscall() declaration; it is neither defined nor
used. Also update a comment and formatting.
Link: http://lkml.kernel.org/r/1339112785-21806-1-git-send-email-vnagarnaik@google.com
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Whenever an event is registered, the comm of tasks are saved at
every task switch instead of saving them at every event. But if
an event isn't executed much, the comm cache will be filled up
by tasks that did not record the event and you lose out on the comms
that did.
Here's an example, if you enable the following events:
echo 1 > /debug/tracing/events/kvm/kvm_cr/enable
echo 1 > /debug/tracing/events/net/net_dev_xmit/enable
Note, there's no kvm running on this machine so the first event will
never be triggered, but because it is enabled, the storing of comms
will continue. If we now disable the network event:
echo 0 > /debug/tracing/events/net/net_dev_xmit/enable
and look at the trace:
cat /debug/tracing/trace
sshd-2672 [001] ..s2 375.731616: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 375.731617: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 375.859356: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 375.859357: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 375.947351: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 375.947352: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 376.035383: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 376.035383: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 377.563806: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=226 rc=0
sshd-2672 [001] ..s1 377.563807: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=226 rc=0
sshd-2672 [001] ..s2 377.563834: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6be0 len=114 rc=0
sshd-2672 [001] ..s1 377.563842: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6be0 len=114 rc=0
We see that process 2672 which triggered the events has the comm "sshd".
But if we run hackbench for a bit and look again:
cat /debug/tracing/trace
<...>-2672 [001] ..s2 375.731616: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 375.731617: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 375.859356: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 375.859357: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 375.947351: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 375.947352: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 376.035383: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 376.035383: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 377.563806: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=226 rc=0
<...>-2672 [001] ..s1 377.563807: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=226 rc=0
<...>-2672 [001] ..s2 377.563834: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6be0 len=114 rc=0
<...>-2672 [001] ..s1 377.563842: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6be0 len=114 rc=0
The stored "sshd" comm has been flushed out and we get a useless "<...>".
But by only storing comms after a trace event occurred, we can run
hackbench all day and still get the same output.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functon tracing_sched_wakeup_trace() does an open coded unlock
commit and save stack. This is what the trace_nowake_buffer_unlock_commit()
is for.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If comm recording is not enabled when trace_printk() is used then
you just get this type of output:
[ adding trace_printk("hello! %d", irq); in do_IRQ ]
<...>-2843 [001] d.h. 80.812300: do_IRQ: hello! 14
<...>-2734 [002] d.h2 80.824664: do_IRQ: hello! 14
<...>-2713 [003] d.h. 80.829971: do_IRQ: hello! 14
<...>-2814 [000] d.h. 80.833026: do_IRQ: hello! 14
By enabling the comm recorder when trace_printk is enabled:
hackbench-6715 [001] d.h. 193.233776: do_IRQ: hello! 21
sshd-2659 [001] d.h. 193.665862: do_IRQ: hello! 21
<idle>-0 [001] d.h1 193.665996: do_IRQ: hello! 21
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since tracing is not used by 99% of Linux users, even though tracing
may be configured in, it does not make sense to allocate 1.4 Megs
per CPU for the ring buffers if they are not used. Thus, on boot up
the ring buffers are set to a minimal size until something needs the
and they are expanded.
This works well for events and tracers (function, etc), but for the
asynchronous use of trace_printk() which can write to the ring buffer
at any time, does not expand the buffers.
On boot up a check is made to see if any trace_printk() is used to
see if the trace_printk() temp buffer pages should be allocated. This
same code can be used to expand the buffers as well.
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The existing 'overrun' counter is incremented when the ring
buffer wraps around, with overflow on (the default). We wanted
a way to count requests lost from the buffer filling up with
overflow off, too. I decided to add a new counter instead
of retro-fitting the existing one because it seems like a
different statistic to count conceptually, and also because
of how the code was structured.
Link: http://lkml.kernel.org/r/1310765038-26399-1-git-send-email-slavapestov@google.com
Signed-off-by: Slava Pestov <slavapestov@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
print_max and use_max_tr in struct tracer are "int" variables and
used like flags. This is wasteful, so change the type to "bool".
Link: http://lkml.kernel.org/r/20121002082710.9807.86393.stgit@falsita
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's times during debugging that it is helpful to see traces of early
boot functions. But the tracers are initialized at device_initcall()
which is quite late during the boot process. Setting the kernel command
line parameter ftrace=function will not show anything until the function
tracer is initialized. This prevents being able to trace functions before
device_initcall().
There's no reason that the tracers need to be initialized so late in the
boot process. Move them up to core_initcall() as they still need to come
after early_initcall() which initializes the tracing buffers.
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There don't have any 'r' prefix in uprobe event naming, remove it.
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
With a system where, num_present_cpus < num_possible_cpus, even if all
CPUs are online, non-present CPUs don't have per_cpu buffers allocated.
If per_cpu/<cpu>/buffer_size_kb is modified for such a CPU, it can cause
a panic due to NULL dereference in ring_buffer_resize().
To fix this, resize operation is allowed only if the per-cpu buffer has
been initialized.
Link: http://lkml.kernel.org/r/1349912427-6486-1-git-send-email-vnagarnaik@google.com
Cc: stable@vger.kernel.org # 3.5+
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull virtio changes from Rusty Russell:
"New workflow: same git trees pulled by linux-next get sent straight to
Linus. Git is awkward at shuffling patches compared with quilt or mq,
but that doesn't happen often once things get into my -next branch."
* 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (24 commits)
lguest: fix occasional crash in example launcher.
virtio-blk: Disable callback in virtblk_done()
virtio_mmio: Don't attempt to create empty virtqueues
virtio_mmio: fix off by one error allocating queue
drivers/virtio/virtio_pci.c: fix error return code
virtio: don't crash when device is buggy
virtio: remove CONFIG_VIRTIO_RING
virtio: add help to CONFIG_VIRTIO option.
virtio: support reserved vqs
virtio: introduce an API to set affinity for a virtqueue
virtio-ring: move queue_index to vring_virtqueue
virtio_balloon: not EXPERIMENTAL any more.
virtio-balloon: dependency fix
virtio-blk: fix NULL checking in virtblk_alloc_req()
virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
virtio-blk: Add bio-based IO path for virtio-blk
virtio: console: fix error handling in init() function
tools: Fix pthread flag for Makefile of trace-agent used by virtio-trace
tools: Add guest trace agent as a user tool
virtio/console: Allocate scatterlist according to the current pipe size
...
and no longer use its debugfs knobs. The change slightly touches
kernel/trace directory, but it got the needed ack from Steven Rostedt:
http://lkml.org/lkml/2012/8/21/688
2. Added maintainers entry;
3. A bunch of fixes, nothing special.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJQbjoFAAoJEGgI9fZJve1bZgUP/A/ZwFGfdnochDgRhK5p7ljY
baRZpgSh2B+BxIDTEPLfVh6HbOivmYJ8WF0unD9kKTzCCS71ZUMiLB25G/bV4lnZ
fawAOhGfOLG3rXmldxf6nJllHr9JpoSmVEHypvjFNcbjYZ04zhe7jM+YsaWmBw68
eHXQkOSdfPPpKXZ2B0Eef/EoGWhORW0kTD7xFlorsxYAkksSheY0PC0nYgCFhvCZ
168y9pi4T4lucr4s44x8AJ/r/5BQ1jEQAY/A2qUE/iBfRFP4XyE1Oao4OHtVDYdU
KjVPA1VYmwkKSfnkiVFrpb/94IyrKslblgR8nX0kK3L/ccFYjQix4nd9jR+n857s
xfAuj9nfhUO6fI5qoaVSOBufxKyPp1S7X8INEAJ7WQ0c9VoMv00biK9M77ifDGZg
ll/Ecq1CADtcbOnQXf6qwGwRKmpR+qgPkIzpNXcuGMuM4AEPwtckOhCyXFr37Txk
6ZoGM8IIaBJ0yXxHkfpUA7l9ZF0gXR+qHMQCwpUS8tIMx35On+IbybEaKbniKEi1
AURgQ7ZimVYAHPi0Y0L00+EKI3IPVQJvCFH7SG+wUfLWcbEtNbTv3MAer5o3DANJ
GMnWBwNw9ClTydWKI0GMNmnWpFukWhd4OXleyl2+q4qRJi3HhNacrok3s/2r+CnT
QRg8i/0SDvxGuXazrTZT
=1HAE
-----END PGP SIGNATURE-----
Merge tag 'for-v3.7' of git://git.infradead.org/users/cbou/linux-pstore
Pull pstore changes from Anton Vorontsov:
1) We no longer ad-hoc to the function tracer "high level"
infrastructure and no longer use its debugfs knobs. The change
slightly touches kernel/trace directory, but it got the needed ack
from Steven Rostedt:
http://lkml.org/lkml/2012/8/21/688
2) Added maintainers entry;
3) A bunch of fixes, nothing special.
* tag 'for-v3.7' of git://git.infradead.org/users/cbou/linux-pstore:
pstore: Avoid recursive spinlocks in the oops_in_progress case
pstore/ftrace: Convert to its own enable/disable debugfs knob
pstore/ram: Add missing platform_device_unregister
MAINTAINERS: Add pstore maintainers
pstore/ram: Mark ramoops_pstore_write_buf() as notrace
pstore/ram: Fix printk format warning
pstore/ram: Fix possible NULL dereference
Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.
The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.
The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.
Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.
Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.
While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.
Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.
After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."
Fix up some fairly trivial conflicts in netfilter uid/git logging code.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...
Use generic steal operation on pipe buffer to allow stealing
ring buffer's read page from pipe buffer.
Note that this could reduce the performance of splice on the
splice_write side operation without affinity setting.
Since the ring buffer's read pages are allocated on the
tracing-node, but the splice user does not always execute
splice write side operation on the same node. In this case,
the page will be accessed from the another node.
Thus, it is strongly recommended to assign the splicing
thread to corresponding node.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This patch splits trace event initialization in two stages:
* ftrace enable
* sysfs event entry creation
This allows to capture trace events from an earlier point
by using 'trace_event' kernel parameter and is important
to trace boot-up allocations.
Note that, in order to enable events at core_initcall,
it's necessary to move init_ftrace_syscalls() from
core_initcall to early_initcall.
Link: http://lkml.kernel.org/r/1347461277-25302-1-git-send-email-elezegarcia@gmail.com
Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In our application, we have trace markers spread through user-space.
We have markers in GL, X, etc. These are super handy for Chrome's
about:tracing feature (Chrome + system + kernel trace view), but
can be very distracting when you're trying to debug a kernel issue.
I normally, use "grep -v tracing_mark_write" but it would be nice
if I could just temporarily disable markers all together.
Link: http://lkml.kernel.org/r/1347066739-26285-1-git-send-email-msb@chromium.org
CC: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- When tracing capture the kuid.
- When displaying the data to user space convert the kuid into the
user namespace of the process that opened the report file.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Commit 56449f437 "tracing: make the trace clocks available generally",
in April 2009, made trace_clock available unconditionally, since
CONFIG_X86_DS used it too.
Commit faa4602e47 "x86, perf, bts, mm: Delete the never used BTS-ptrace code",
in March 2010, removed CONFIG_X86_DS, and now only CONFIG_RING_BUFFER (split
out from CONFIG_TRACING for general use) has a dependency on trace_clock. So,
only compile in trace_clock with CONFIG_RING_BUFFER or CONFIG_TRACING
enabled.
Link: http://lkml.kernel.org/r/20120903024513.GA19583@leaf
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With this patch we no longer reuse function tracer infrastructure, now
we register our own tracer back-end via a debugfs knob.
It's a bit more code, but that is the only downside. On the bright side we
have:
- Ability to make persistent_ram module removable (when needed, we can
move ftrace_ops struct into a module). Note that persistent_ram is still
not removable for other reasons, but with this patch it's just one
thing less to worry about;
- Pstore part is more isolated from the generic function tracer. We tried
it already by registering our own tracer in available_tracers, but that
way we're loosing ability to see the traces while we record them to
pstore. This solution is somewhere in the middle: we only register
"internal ftracer" back-end, but not the "front-end";
- When there is only pstore tracing enabled, the kernel will only write
to the pstore buffer, omitting function tracer buffer (which, of course,
still can be enabled via 'echo function > current_tracer').
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
The function graph has a test to check if the frame pointer is
corrupted, which can happen with various options of gcc with mcount.
But this is not an issue with -mfentry as -mfentry does not need nor use
frame pointers for function graph tracing.
Link: http://lkml.kernel.org/r/20120807194059.773895870@goodmis.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Thanks to Andi Kleen, gcc 4.6.0 now supports -mfentry for x86
(and hopefully soon for other archs). What this does is to have
the function profiler start at the beginning of the function
instead of after the stack is set up. As plain -pg (mcount) is
called after the stack is set up, and in some cases can have issues
with the function graph tracer. It also requires frame pointers to
be enabled.
The -mfentry now calls __fentry__ at the beginning of the function.
This allows for compiling without frame pointers and even has the
ability to access parameters if needed.
If the architecture and the compiler both support -mfentry then
use that instead.
Link: http://lkml.kernel.org/r/20120807194059.392617243@goodmis.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
. Fix include order for bison/flex-generated C files, from Ben Hutchings
. Build fixes and documentation corrections from David Ahern
. Group parsing support, from Jiri Olsa
. UI/gtk refactorings and improvements from Namhyung Kim
. NULL deref fix for perf script, from Namhyung Kim
. Assorted cleanups from Robert Richter
. Let O= makes handle relative paths, from Steven Rostedt
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJQMkGhAAoJENZQFvNTUqpAqjsQAJE5iD1LFogC8o/WjvRHz0TY
Y0x+sR/XfW61KYpeq5g+UaKuFU3P44ijCoyks3y5sza97DkYgUwMpEHlLXFSM8Pp
sNOapqY57s24nq3MLrhH1V9w+cSE+m2u/Gi5fGLCQekio9gkOBwYxNGk7vpKri/n
LBRsMozBu/mZjMy20uWOb7Uk8xsAToh+TFaAtjyQ9Snn9nNJj49NUAp37uN888H/
ducMLq32HN5v/6Zd3q6IWdDWgZsHLkIa3R5FIs/GNe3Dih07gtYLmDol4ktPbTFm
yoaWpP5wbtu/62EZlJwE393vMuoeqN/96394ZZQGFafhHVxN4+rcBhXbejBs0T2b
wk/0CzntW8bbUAI/cl3SB9aui//FWOxcjG9aDQ7PsmHzPw1Q4VD0F9Mcod4p+dRX
PsA9q/tST1eAiwzWYthDtj81U7iChINcXKhoZn2xn6+0+aMH+6FFNBmCH8MR5aCU
BvrXhTJjvau/Ym/sILl4Tf4wfssTq49yMsn/YKCwLJ0hg0XlTObWfQRy2MOayXH9
NJvUE+9GSXoTEKhmr1AfTYEG9vObaXZyFwAI74xvPPwUYojCb4ZjEKmG0egW+VGk
IJKFCaJZwwVsGau4aIbFAMP12/L8Qs/Ox91ddCJ0j5TIlSGMaqW5lbV1N1crzlTT
a0GsN49NvhbFttBXrcNX
=0a2X
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
* Fix include order for bison/flex-generated C files, from Ben Hutchings
* Build fixes and documentation corrections from David Ahern
* Group parsing support, from Jiri Olsa
* UI/gtk refactorings and improvements from Namhyung Kim
* NULL deref fix for perf script, from Namhyung Kim
* Assorted cleanups from Robert Richter
* Let O= makes handle relative paths, from Steven Rostedt
* perf script python fixes, from Feng Tang.
* Improve 'perf lock' error message when the needed tracepoints
are not present, from David Ahern.
* Initial bash completion support, from Frederic Weisbecker
* Allow building without libelf, from Namhyung Kim.
* Support DWARF CFI based unwind to have callchains when %bp
based unwinding is not possible, from Jiri Olsa.
* Symbol resolution fixes, while fixing support PPC64 files with an .opt ELF
section was the end goal, several fixes for code that handles all
architectures and cleanups are included, from Cody Schafer.
* Add a description for the JIT interface, from Andi Kleen.
* Assorted fixes for Documentation and build in 32 bit, from Robert Richter
* Add support for non-tracepoint events in perf script python, from Feng Tang
* Cache the libtraceevent event_format associated to each evsel early, so that we
avoid relookups, i.e. calling pevent_find_event repeatedly when processing
tracepoint events.
[ This is to reduce the surface contact with libtraceevents and make clear what
is that the perf tools needs from that lib: so far parsing the common and per
event fields. ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
syscall_get_nr can return -1 in the case that the task is not executing
a system call.
This patch fixes perf_syscall_{enter,exit} to check that the syscall
number is valid before using it as an index into a bitmap.
Link: http://lkml.kernel.org/r/1345137254-7377-1-git-send-email-will.deacon@arm.com
Cc: Jason Baron <jbaron@redhat.com>
Cc: Wade Farnsworth <wade_farnsworth@mentor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add missing initialization for ret variable. Its initialization
is based on the re_cnt variable, which is being set deep down
in the ftrace_function_filter_re function.
I'm not sure compilers would be smart enough to see this in near
future, so killing the warning this way.
Link: http://lkml.kernel.org/r/1340120894-9465-2-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The warkeup_rt self test used msleep() calls to wait for real time
tasks to wake up and run. On bare-metal hardware, this was enough as
the scheduler should let the RT task run way before the non-RT task
wakes up from the msleep(). If it did not, then that would mean the
scheduler was broken.
But when dealing with virtual machines, this is a different story.
If the RT task wakes up on a VCPU, it's up to the host to decide when
that task gets to schedule, which can be far behind the time that the
non-RT task wakes up. In this case, the test would fail incorrectly.
As we are not testing the scheduler, but instead the wake up tracing,
we can use completions to wait and not depend on scheduler timings
to see if events happen on time.
Link: http://lkml.kernel.org/r/1343663105.3847.7.camel@fedora
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf fixes from Ingo Molnar:
"Fix merge window fallout and fix sleep profiling (this was always
broken, so it's not a fix for the merge window - we can skip this one
from the head of the tree)."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/trace: Add ability to set a target task for events
perf/x86: Fix USER/KERNEL tagging of samples properly
perf/x86/intel/uncore: Make UNCORE_PMU_HRTIMER_INTERVAL 64-bit
A few events are interesting not only for a current task.
For example, sched_stat_* events are interesting for a task
which wakes up. For this reason, it will be good if such
events will be delivered to a target task too.
Now a target task can be set by using __perf_task().
The original idea and a draft patch belongs to Peter Zijlstra.
I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
Inspired-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a new filter update interface ftrace_set_filter_ip()
to set ftrace filter by ip address, not only glob pattern.
Link: http://lkml.kernel.org/r/20120605102808.27845.67952.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add selftests to test the save-regs functionality of ftrace.
If the arch supports saving regs, then it will make sure that regs is
at least not NULL in the callback.
If the arch does not support saving regs, it makes sure that the
registering of the ftrace_ops that requests saving regs fails.
It then tests the registering of the ftrace_ops succeeds if the
'IF_SUPPORTED' flag is set. Then it makes sure that the regs passed to
the function is NULL.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add selftests to test the function tracing recursion protection actually
does work. It also tests if a ftrace_ops states it will perform its own
protection. Although, even if the ftrace_ops states it will protect itself,
the ftrace infrastructure may still provide protection if the arch does
not support all features or another ftrace_ops is registered.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As more users of the function tracer utility are being added, they do
not always add the necessary recursion protection. To protect from
function recursion due to tracing, if the callback ftrace_ops does not
specifically specify that it protects against recursion (by setting
the FTRACE_OPS_FL_RECURSION_SAFE flag), the list operation will be
called by the mcount trampoline which adds recursion protection.
If the flag is set, then the function will be called directly with no
extra protection.
Note, the list operation is called if more than one function callback
is registered, or if the arch does not support all of the function
tracer features.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Here's the big staging tree merge for the 3.6-rc1 merge window.
There are some patches in here outside of drivers/staging/, notibly the iio
code (which is still stradeling the staging / not staging boundry), the pstore
code, and the tracing code. All of these have gotten ackes from the various
subsystem maintainers to be included in this tree. The pstore and tracing
patches are related, and are coming here as they replace one of the android
staging drivers.
Otherwise, the normal staging mess. Lots of cleanups and a few new drivers
(some iio drivers, and the large csr wireless driver abomination.)
Note, you will get a merge issue with the following files:
drivers/staging/comedi/drivers/s626.h
drivers/staging/gdm72xx/netlink_k.c
both of which should be trivial for you to handle.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAlAQiD8ACgkQMUfUDdst+ykxhgCeMUjvc+1RTtSprzvkzpejgoUU
6A4AnAleWMnkaCD8vruGnRdGl/Qtz51+
=mN6M
-----END PGP SIGNATURE-----
Merge tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging tree patches from Greg Kroah-Hartman:
"Here's the big staging tree merge for the 3.6-rc1 merge window.
There are some patches in here outside of drivers/staging/, notibly
the iio code (which is still stradeling the staging / not staging
boundry), the pstore code, and the tracing code. All of these have
gotten acks from the various subsystem maintainers to be included in
this tree. The pstore and tracing patches are related, and are coming
here as they replace one of the android staging drivers.
Otherwise, the normal staging mess. Lots of cleanups and a few new
drivers (some iio drivers, and the large csr wireless driver
abomination.)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fixed up trivial conflicts in drivers/staging/comedi/drivers/s626.h and
drivers/staging/gdm72xx/netlink_k.c
* tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1108 commits)
staging: csr: delete a bunch of unused library functions
staging: csr: remove csr_utf16.c
staging: csr: remove csr_pmem.h
staging: csr: remove CsrPmemAlloc
staging: csr: remove CsrPmemFree()
staging: csr: remove CsrMemAllocDma()
staging: csr: remove CsrMemCalloc()
staging: csr: remove CsrMemAlloc()
staging: csr: remove CsrMemFree() and CsrMemFreeDma()
staging: csr: remove csr_util.h
staging: csr: remove CsrOffSetOf()
stating: csr: remove unneeded #includes in csr_util.c
staging: csr: make CsrUInt16ToHex static
staging: csr: remove CsrMemCpy()
staging: csr: remove CsrStrLen()
staging: csr: remove CsrVsnprintf()
staging: csr: remove CsrStrDup
staging: csr: remove CsrStrChr()
staging: csr: remove CsrStrNCmp
staging: csr: remove CsrStrCmp
...
Add a way to have different functions calling different trampolines.
If a ftrace_ops wants regs saved on the return, then have only the
functions with ops registered to save regs. Functions registered by
other ops would not be affected, unless the functions overlap.
If one ftrace_ops registered functions A, B and C and another ops
registered fucntions to save regs on A, and D, then only functions
A and D would be saving regs. Function B and C would work as normal.
Although A is registered by both ops: normal and saves regs; this is fine
as saving the regs is needed to satisfy one of the ops that calls it
but the regs are ignored by the other ops function.
x86_64 implements the full regs saving, and i386 just passes a NULL
for regs to satisfy the ftrace_ops passing. Where an arch must supply
both regs and ftrace_ops parameters, even if regs is just NULL.
It is OK for an arch to pass NULL regs. All function trace users that
require regs passing must add the flag FTRACE_OPS_FL_SAVE_REGS when
registering the ftrace_ops. If the arch does not support saving regs
then the ftrace_ops will fail to register. The flag
FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED may be set that will prevent the
ftrace_ops from failing to register. In this case, the handler may
either check if regs is not NULL or check if ARCH_SUPPORTS_FTRACE_SAVE_REGS.
If the arch supports passing regs it will set this macro and pass regs
for ops that request them. All other archs will just pass NULL.
Link: Link: http://lkml.kernel.org/r/20120711195745.107705970@goodmis.org
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Return as the 4th paramater to the function tracer callback the pt_regs.
Later patches that implement regs passing for the architectures will require
having the ftrace_ops set the SAVE_REGS flag, which will tell the arch
to take the time to pass a full set of pt_regs to the ftrace_ops callback
function. If the arch does not support it then it should pass NULL.
If an arch can pass full regs, then it should define:
ARCH_SUPPORTS_FTRACE_SAVE_REGS to 1
Link: http://lkml.kernel.org/r/20120702201821.019966811@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the function tracer starts to get more features, the support for
theses features will spread out throughout the different architectures
over time. These features boil down to what each arch does in the
mcount trampoline (the ftrace_caller).
Currently there's two features that are not the same throughout the
archs.
1) Support to stop function tracing before the callback
2) passing of the ftrace ops
Both of these require placing an indirect function to support the
features if the mcount trampoline does not.
On a side note, for all architectures, when more than one callback
is registered to the function tracer, an intermediate 'list' function
is called by the mcount trampoline to iterate through the callbacks
that are registered.
Instead of making a separate function for each of these features,
and requiring several indirect calls, just use the single 'list' function
as the intermediate, to handle all cases. If an arch does not support
the 'stop function tracing' or the passing of ftrace ops, just force
it to use the list function that will handle the features required.
This makes the code cleaner and simpler and removes a lot of
#ifdefs in the code.
Link: http://lkml.kernel.org/r/20120612225424.495625483@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the function trace callback receives only the ip and parent_ip
of the function that it traced. It would be more powerful to also return
the ops that registered the function as well. This allows the same function
to act differently depending on what ftrace_ops registered it.
Link: http://lkml.kernel.org/r/20120612225424.267254552@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the function accepts just one bit, we can use the switch
construction instead of if/else if/...
Just a cosmetic change, there should be no functional changes.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch introduces 'func_ptrace' option, now available in
/sys/kernel/debug/tracing/options when function tracer
is selected.
The patch also adds some tiny code that calls back to pstore
to record the trace. The callback is no-op when PSTORE=n.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If tracer->init() fails, current code will leave current_tracer pointing
to an unusable tracer, which at best makes 'current_tracer' report
inaccurate value.
Fix the issue by pointing current_tracer to nop tracer, and only update
current_tracer with the new one after all the initialization succeeds.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull RCU, perf, and scheduler fixes from Ingo Molnar.
The RCU fix is a revert for an optimization that could cause deadlocks.
One of the scheduler commits (164c33c6ad "sched: Fix fork() error path
to not crash") is correct but not complete (some architectures like Tile
are not covered yet) - the resulting additional fixes are still WIP and
Ingo did not want to delay these pending fixes. See this thread on
lkml:
[PATCH] fork: fix error handling in dup_task()
The perf fixes are just trivial oneliners.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf kvm: Fix segfault with report and mixed guestmount use
perf kvm: Fix regression with guest machine creation
perf script: Fix format regression due to libtraceevent merge
ring-buffer: Fix accounting of entries when removing pages
ring-buffer: Fix crash due to uninitialized new_pages list head
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS/sched: Update scheduler file pattern
sched/nohz: Rewrite and fix load-avg computation -- again
sched: Fix fork() error path to not crash
Clean up and return -ENOMEM on if the kzalloc() fails.
This also prevents a potential crash, as the pointer that failed to
allocate would be later used.
Link: http://lkml.kernel.org/r/20120711063507.GF11812@elgon.mountain
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull block bits from Jens Axboe:
"As vacation is coming up, thought I'd better get rid of my pending
changes in my for-linus branch for this iteration. It contains:
- Two patches for mtip32xx. Killing a non-compliant sysfs interface
and moving it to debugfs, where it belongs.
- A few patches from Asias. Two legit bug fixes, and one killing an
interface that is no longer in use.
- A patch from Jan, making the annoying partition ioctl warning a bit
less annoying, by restricting it to !CAP_SYS_RAWIO only.
- Three bug fixes for drbd from Lars Ellenberg.
- A fix for an old regression for umem, it hasn't really worked since
the plugging scheme was changed in 3.0.
- A few fixes from Tejun.
- A splice fix from Eric Dumazet, fixing an issue with pipe
resizing."
* 'for-linus' of git://git.kernel.dk/linux-block:
scsi: Silence unnecessary warnings about ioctl to partition
block: Drop dead function blk_abort_queue()
block: Mitigate lock unbalance caused by lock switching
block: Avoid missed wakeup in request waitqueue
umem: fix up unplugging
splice: fix racy pipe->buffers uses
drbd: fix null pointer dereference with on-congestion policy when diskless
drbd: fix list corruption by failing but already aborted reads
drbd: fix access of unallocated pages and kernel panic
xen/blkfront: Add WARN to deal with misbehaving backends.
blkcg: drop local variable @q from blkg_destroy()
mtip32xx: Create debugfs entries for troubleshooting
mtip32xx: Remove 'registers' and 'flags' from sysfs
blkcg: fix blkg_alloc() failure path
block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
block: fix return value on cfq_init() failure
mtip32xx: Remove version.h header file inclusion
xen/blkback: Copy id field when doing BLKIF_DISCARD.
When removing pages from the ring buffer, its state is not reset. This
means that the counters need to be correctly updated to account for the
pages removed.
Update the overrun counter to reflect the removed events from the pages.
Link: http://lkml.kernel.org/r/1340998301-1715-1-git-send-email-vnagarnaik@google.com
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The new_pages list head in the cpu_buffer is not initialized. When
adding pages to the ring buffer, if the memory allocation fails in
ring_buffer_resize, the clean up handler tries to free up the allocated
pages from all the cpu buffers. The panic is caused by referencing the
uninitialized new_pages list head.
Initializing the new_pages list head in rb_allocate_cpu_buffer fixes
this.
Link: http://lkml.kernel.org/r/1340391005-10880-1-git-send-email-vnagarnaik@google.com
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>