Follow-up commit to 1e33759c78 ("bpf, trace: add BPF_F_CURRENT_CPU
flag for bpf_perf_event_output") to add the same functionality into
bpf_perf_event_read() helper. The split of index into flags and index
component is also safe here, since such large maps are rejected during
map allocation time.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently have two invocations, which is unnecessary. Fetch it only
once and use the smp_processor_id() variant, so we also get preemption
checks along with it when DEBUG_PREEMPT is set.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some minor cleanups: i) Remove the unlikely() from fd array map lookups
and let the CPU branch predictor do its job, scenarios where there is not
always a map entry are very well valid. ii) Move the attribute type check
in the bpf_perf_event_read() helper a bit earlier so it's consistent wrt
checks with bpf_perf_event_output() helper as well. iii) remove some
comments that are self-documenting in kprobe_prog_is_valid_access() and
therefore make it consistent to tp_prog_is_valid_access() as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"I've been traveling so this accumulates more than week or so of bug
fixing. It perhaps looks a little worse than it really is.
1) Fix deadlock in ath10k driver, from Ben Greear.
2) Increase scan timeout in iwlwifi, from Luca Coelho.
3) Unbreak STP by properly reinjecting STP packets back into the
stack. Regression fix from Ido Schimmel.
4) Mediatek driver fixes (missing malloc failure checks, leaking of
scratch memory, wrong indexing when mapping TX buffers, etc.) from
John Crispin.
5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
Sowa.
6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.
7) Fix netlink notifications in ovs for tunnels, delete link messages
are never emitted because of how the device registry state is
handled. From Nicolas Dichtel.
8) Conntrack module leaks kmemcache on unload, from Florian Westphal.
9) Prevent endless jump loops in nft rules, from Liping Zhang and
Pablo Neira Ayuso.
10) Not early enough spinlock initialization in mlx4, from Eric
Dumazet.
11) Bind refcount leak in act_ipt, from Cong WANG.
12) Missing RCU locking in HTB scheduler, from Florian Westphal.
13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
barrier, using heap for SG and IV, and erroneous use of async flag
when allocating AEAD conext.)
14) RCU handling fix in TIPC, from Ying Xue.
15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
SIT driver, from Simon Horman.
16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.
17) Fix potential deadlock in team enslave, from Ido Schimmel.
18) Memory leak in KCM procfs handling, from Jiri Slaby.
19) ESN generation fix in ipv4 ESP, from Herbert Xu.
20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
WANG.
21) Use after free in netem, from Eric Dumazet.
22) Uninitialized last assert time in multicast router code, from Tom
Goff.
23) Skip raw sockets in sock_diag destruction broadcast, from Willem
de Bruijn.
24) Fix link status reporting in thunderx, from Sunil Goutham.
25) Limit resegmentation of retransmit queue so that we do not
retransmit too large GSO frames. From Eric Dumazet.
26) Delay bpf program release after grace period, from Daniel
Borkmann"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
openvswitch: fix conntrack netlink event delivery
qed: Protect the doorbell BAR with the write barriers.
neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
e1000e: keep VLAN interfaces functional after rxvlan off
cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
bpf, perf: delay release of BPF prog after grace period
net: bridge: fix vlan stats continue counter
tcp: do not send too big packets at retransmit time
ibmvnic: fix to use list_for_each_safe() when delete items
net: thunderx: Fix TL4 configuration for secondary Qsets
net: thunderx: Fix link status reporting
net/mlx5e: Reorganize ethtool statistics
net/mlx5e: Fix number of PFC counters reported to ethtool
net/mlx5e: Prevent adding the same vxlan port
net/mlx5e: Check for BlueFlame capability before allocating SQ uar
net/mlx5e: Change enum to better reflect usage
net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
net/mlx5: Update command strings
net: marvell: Add separate config ANEG function for Marvell 88E1111
...
Function graph tracer currently ignores filters if tracing_thresh is set.
For example, even if set_ftrace_pid is set, then its ignored if tracing_thresh
set, resulting in all processes being traced.
To fix this, we reuse the same entry function as when tracing_thresh is not
set and do everything as in the regular case except for writing the function entry
to the ring buffer.
Link: http://lkml.kernel.org/r/1466228694-2677-1-git-send-email-agnel.joel@gmail.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Joel Fernandes <agnel.joel@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
# echo 1 > options/stacktrace
# echo 1 > events/sched/sched_switch/enable
# cat trace
<idle>-0 [002] d..2 1982.525169: <stack trace>
=> save_stack_trace
=> __ftrace_trace_stack
=> trace_buffer_unlock_commit_regs
=> event_trigger_unlock_commit
=> trace_event_buffer_commit
=> trace_event_raw_event_sched_switch
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> start_secondary
The above shows that we are seeing 6 functions before ever making it to the
caller of the sched_switch event.
# echo stacktrace > events/sched/sched_switch/trigger
# cat trace
<idle>-0 [002] d..3 2146.335208: <stack trace>
=> trace_event_buffer_commit
=> trace_event_raw_event_sched_switch
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> start_secondary
The stacktrace trigger isn't as bad, because it adds its own skip to the
stacktracing, but still has two events extra.
One issue is that if the stacktrace passes its own "regs" then there should
be no addition to the skip, as the regs will not include the functions being
called. This was an issue that was fixed by commit 7717c6be69 ("tracing:
Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()" as adding
the skip number for kprobes made the probes not have any stack at all.
But since this is only an issue when regs is being used, a skip should be
added if regs is NULL. Now we have:
# echo 1 > options/stacktrace
# echo 1 > events/sched/sched_switch/enable
# cat trace
<idle>-0 [000] d..2 1297.676333: <stack trace>
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> rest_init
=> start_kernel
=> x86_64_start_reservations
=> x86_64_start_kernel
# echo stacktrace > events/sched/sched_switch/trigger
# cat trace
<idle>-0 [002] d..3 1370.759745: <stack trace>
=> __schedule
=> schedule
=> schedule_preempt_disabled
=> cpu_startup_entry
=> start_secondary
And kprobes are not touched.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Previously, mmio_print_pcidev() put "user" addresses in the trace buffer.
On most architectures, these are the same as CPU physical addresses, but on
microblaze, mips, powerpc, and sparc, they may be something else, typically
a raw BAR value (a bus address as opposed to a CPU address).
Always expose the CPU physical address to avoid this arch-dependent
behavior.
This change should have no user-visible effect because this file currently
depends on CONFIG_HAVE_MMIOTRACE_SUPPORT, which is only defined for x86,
and pci_resource_to_user() is a no-op on x86.
Link: http://lkml.kernel.org/r/20160511190657.5898.4248.stgit@bhelgaas-glaptop2.roam.corp.google.com
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Because tracepoint callbacks are done with preemption enabled, the trace
events are always called with preempt disable due to the
rcu_read_lock_sched_notrace() in __DO_TRACE(). This causes the preempt count
shown in the recorded trace event to be inaccurate. It is always one more
that what the preempt_count was when the tracepoint was called.
If CONFIG_PREEMPT is enabled, subtract 1 from the preempt_count before
recording it in the trace buffer.
Link: http://lkml.kernel.org/r/20160525132537.GA10808@linutronix.de
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the trace_printk code chooses which static buffer to use based
on what type of atomic context (NMI, IRQ, etc) it's in. Simplify the
code and make it more robust: simply count the nesting depth and choose
a buffer based on the current nesting depth.
The new code will only drop an event if we nest more than 4 deep,
and the old code was guaranteed to malfunction if that happened.
Link: http://lkml.kernel.org/r/07ab03aecfba25fcce8f9a211b14c9c5e2865c58.1464289095.git.luto@kernel.org
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace is very quick to give up on saving the task command line (see
`trace_save_cmdline()`). The workaround for events which really care
about the command line is to explicitly assign it as part of the entry.
However, this doesn't work for kprobe events, as there's no
straightforward way to get access to current->comm. Add a kprobe/uprobe
event variable $comm which provides exactly that.
Link: http://lkml.kernel.org/r/f59b472033b943a370f5f48d0af37698f409108f.1465435894.git.osandov@fb.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Convert set_ftrace_pid to use the bitmap like set_event_pid does. This
allows for instances to use the pid filtering as well, and will allow for
function-fork option to set if the children of a traced function should be
traced or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The addition of PIDs into a pid_list via the write operation of
set_event_pid is a bit complex. The same operation will be needed for
function tracing pids. Move the code into its own generic function in
trace.c, so that we can avoid duplication of this code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To allow other aspects of ftrace to use the pid_list logic, we need to reuse
the seq_file functions. Making the generic part into functions that can be
called by other files will help in this regard.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the filtered_pid functions are going to be used by function tracer as
well as trace_events, move the code into the generic trace.c file.
The functions moved are:
trace_find_filtered_pid()
trace_ignore_this_task()
trace_filter_add_remove_task()
Kernel Doc text was also added.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make the functions used for pid filtering global for tracing, such that the
function tracer can use the pid code as well.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If a task uses a non constant string for the format parameter in
trace_printk(), then the trace_printk_fmt variable is set to NULL. This
variable is then saved in the __trace_printk_fmt section.
The function hold_module_trace_bprintk_format() checks to see if duplicate
formats are used by modules, and reuses them if so (saves them to the list
if it is new). But this function calls lookup_format() that does a strcmp()
to the value (which is now NULL) and can cause a kernel oops.
This wasn't an issue till 3debb0a9dd ("tracing: Fix trace_printk() to print
when not using bprintk()") which added "__used" to the trace_printk_fmt
variable, and before that, the kernel simply optimized it out (no NULL value
was saved).
The fix is simply to handle the NULL pointer in lookup_format() and have the
caller ignore the value if it was NULL.
Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.com
Reported-by: xingzhen <zhengjun.xing@intel.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Fixes: 3debb0a9dd ("tracing: Fix trace_printk() to print when not using bprintk()")
Cc: stable@vger.kernel.org # v3.5+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The blktrace code stores the current time in a 32-bit word in its
user interface. This is a bad idea because 32-bit seconds overflow
at some point.
We probably have until 2106 before this one overflows, as it seems
to use an 'unsigned' variable, but we should confirm that user
space treats it the same way.
Aside from this, we want to stop using 'struct timespec' here,
so I'm adding a comment about the overflow and change the code
to use timespec64 instead to make the loss of range more obvious.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The behavior of perf event arrays are quite different from all
others as they are tightly coupled to perf event fds, f.e. shown
recently by commit e03e7ee34f ("perf/bpf: Convert perf_event_array
to use struct file") to make refcounting on perf event more robust.
A remaining issue that the current code still has is that since
additions to the perf event array take a reference on the struct
file via perf_event_get() and are only released via fput() (that
cleans up the perf event eventually via perf_event_release_kernel())
when the element is either manually removed from the map from user
space or automatically when the last reference on the perf event
map is dropped. However, this leads us to dangling struct file's
when the map gets pinned after the application owning the perf
event descriptor exits, and since the struct file reference will
in such case only be manually dropped or via pinned file removal,
it leads to the perf event living longer than necessary, consuming
needlessly resources for that time.
Relations between perf event fds and bpf perf event map fds can be
rather complex. F.e. maps can act as demuxers among different perf
event fds that can possibly be owned by different threads and based
on the index selection from the program, events get dispatched to
one of the per-cpu fd endpoints. One perf event fd (or, rather a
per-cpu set of them) can also live in multiple perf event maps at
the same time, listening for events. Also, another requirement is
that perf event fds can get closed from application side after they
have been attached to the perf event map, so that on exit perf event
map will take care of dropping their references eventually. Likewise,
when such maps are pinned, the intended behavior is that a user
application does bpf_obj_get(), puts its fds in there and on exit
when fd is released, they are dropped from the map again, so the map
acts rather as connector endpoint. This also makes perf event maps
inherently different from program arrays as described in more detail
in commit c9da161c65 ("bpf: fix clearing on persistent program
array maps").
To tackle this, map entries are marked by the map struct file that
added the element to the map. And when the last reference to that map
struct file is released from user space, then the tracked entries
are purged from the map. This is okay, because new map struct files
instances resp. frontends to the anon inode are provided via
bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
for retrieving a pinned map, but also when an initial instance is
created via map_create(). The rest is resolved by the vfs layer
automatically for us by keeping reference count on the map's struct
file. Any concurrent updates on the map slot are fine as well, it
just means that perf_event_fd_array_release() needs to delete less
of its own entires.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
similar to bpf_perf_event_output() the bpf_perf_event_read() helper
needs to check the type of the perf_event before reading the counter.
Fixes: a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ctx structure passed into bpf programs is different depending on bpf
program type. The verifier incorrectly marked ctx->data and ctx->data_end
access based on ctx offset only. That caused loads in tracing programs
int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
to be incorrectly marked as PTR_TO_PACKET which later caused verifier
to reject the program that was actually valid in tracing context.
Fix this by doing program type specific matching of ctx offsets.
Fixes: 969bf05eb3 ("bpf: direct packet access")
Reported-by: Sasha Goldshtein <goldshtn@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
In bpf_perf_event_read() and bpf_perf_event_output(), we must use
READ_ONCE() for fetching the struct file pointer, which could get
updated concurrently, so we must prevent the compiler from potential
refetching.
We already do this with tail calls for fetching the related bpf_prog,
but not so on stored perf events. Semantics for both are the same
with regards to updates.
Fixes: a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
Fixes: 35578d7984 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a REQ_OP_FLUSH operation that is sent to request_fn
based drivers by the block layer's flush code, instead of
sending requests with the request->cmd_flags REQ_FLUSH bit set.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Have blktrace use the req/bio op accessor to get the REQ_OP.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
1) I forgot that I had another selftest to stress test the ftrace
instance creation. It was actually suppose to go into the 4.6
merge window, but I never committed it. I almost forgot about it
again, but noticed it was missing from your tree.
2) Soumya PN sent me a clean up patch to not disable interrupts when
taking the tasklist_lock for read, as it's unnecessary because
that lock is never taken for write in irq context.
3) Newer gcc's can cause the jump in the function_graph code to the
global ftrace_stub label to be a short jump instead of a long one.
As that jump is dynamically converted to jump to the trace code to
do function graph tracing, and that conversion expects a long jump
it can corrupt the ftrace_stub itself (it's directly after that call).
One way to prevent gcc from using a short jump is to declare the
ftrace_stub as a weak function, which we do here to keep gcc from
optimizing too much.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXQhYQAAoJEKKk/i67LK/82pAH/3XzRCP366HqWnKdvluPB8vX
UnVoXGAX1Eh2ZpvlPIJBXNYOZlnGRMMMAoeI+su31FoJHrzTzfGXvRynTkZPFZtd
XakvHfACjtGtvi2MuCN1t9/d1ty/ob2o05KB9qc+JRlzHM09qTL/HX8hwZeEsMQ4
NYgEY4Y727LOSCrJieLktchpwtie77q8Wq25oiWIVWOyDjpCsPnZyaOqaQSANot9
Gd00cixbMam7Ba1BjoRsRQZaT2pYZ8vt7HDXDBfAOW1oOjalWARLhRg/zww1V3WD
DEptuEeyAgMJS3v76Z6Sbk/QM7hyGUWCcmC2qaN1yc2n1Sh+zBOiN1eyiiUh/2U=
=ERxv
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull motr tracing updates from Steven Rostedt:
"Three more changes.
- I forgot that I had another selftest to stress test the ftrace
instance creation. It was actually suppose to go into the 4.6
merge window, but I never committed it. I almost forgot about it
again, but noticed it was missing from your tree.
- Soumya PN sent me a clean up patch to not disable interrupts when
taking the tasklist_lock for read, as it's unnecessary because that
lock is never taken for write in irq context.
- Newer gcc's can cause the jump in the function_graph code to the
global ftrace_stub label to be a short jump instead of a long one.
As that jump is dynamically converted to jump to the trace code to
do function graph tracing, and that conversion expects a long jump
it can corrupt the ftrace_stub itself (it's directly after that
call). One way to prevent gcc from using a short jump is to
declare the ftrace_stub as a weak function, which we do here to
keep gcc from optimizing too much"
* tag 'trace-v4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/x86: Set ftrace_stub to weak to prevent gcc from using short jumps to it
ftrace: Don't disable irqs when taking the tasklist_lock read_lock
ftracetest: Add instance created, delete, read and enable event test
In ftrace.c inside the function alloc_retstack_tasklist() (which will be
invoked when function_graph tracing is on) the tasklist_lock is being
held as reader while iterating through a list of threads. Here the lock
is being held as reader with irqs disabled. The tasklist_lock is never
write_locked in interrupt context so it is safe to not disable interrupts
for the duration of read_lock in this block which, can be significant,
given the block of code iterates through all threads. Hence changing the
code to call read_lock() and read_unlock() instead of read_lock_irqsave()
and read_unlock_irqrestore().
A similar change was made in commits: 8063e41d2f ("tracing: Change
syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread()")'
and 3472eaa1f1 ("sched: normalize_rt_tasks(): Don't use _irqsave for
tasklist_lock, use task_rq_lock()")'
Link: http://lkml.kernel.org/r/1463500874-77480-1-git-send-email-soumya.p.n@hpe.com
Signed-off-by: Soumya PN <soumya.p.n@hpe.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie, Lennart
Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras, Rashmica Gupta,
Russell Currey, Suraj Jitindar Singh, Thiago Jung Bauermann, Valentin
Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" from Guilherme G. Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme G. Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled from Ian Munsie
- Check periodically the coherent platform function's state from Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes, an erratum
workaround, and a kconfig dependency fix."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXPsGzAAoJEFHr6jzI4aWAVoAP/iKdrDe0eYHlVAE9SqnbsiZs
lgDxdsC8P3fsmP1G9o/HkKhC82zHl/La8Ztz8dtqa+LkSzbfliWP1ztJsI7GsBFo
tyCKzWnX9Rwvd3meHu/o/SQ29TNLm/PbPyyRqpj5QPbJ8XCXkAXR7ZZZqjvcMsJW
/AgIr7Cgf53tl9oZzzl/c7CnNHhMq+NBdA71vhWtUx+T97wfJEGyKW6HhZyHDbEU
iAki7fu77ZpEqC/Fh9swf0dCGBJ+a132NoMVo0AdV7EQLznUYlQpQEqa+1PyHZOP
/ArOzf2mDg6m3PfCo1eiB07v8PnVZ3llEUbVAJNg3GUxbE4SHrqq/kwm0iElm3p/
DvFxerCwdX9vmskJX4wDs+pSZRabXYj9XVMptsgFzA4joWrqqb7mBHqaort88YcY
YSljEt1bHyXmiJ+dBya40qARsWUkCVN7ZgEzdxckq0KI3w7g2tqpqIbO2lClWT6t
B3GpqQ4jp34+d1M14FB91fIGK7tMvOhSInE0Mv9+tPvRsepXqiiU/SwdAtRlr3m2
zs/K+4FYcVjJ3Rmpgc+tI38PbZxHe212I35YN6L1LP+4ZfAtzz0NyKdooTIBtkbO
19pX4WbBjKq8zK+YutrySncBIrbnI6VjW51vtRhgVKZliPFO/6zKagyU6FbxM+E5
udQES+t3F/9gvtxgxtDe
=YvyQ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
Bauermann, Valentin Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan
Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
from Guilherme G Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme
G Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica
Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe
Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic
Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian
Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs
from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled
from Ian Munsie
- Check periodically the coherent platform function's state from
Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
an erratum workaround, and a kconfig dependency fix."
* tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
powerpc/86xx: Fix PCI interrupt map definition
powerpc/86xx: Move pci1 definition to the include file
powerpc/fsl: Fix build of the dtb embedded kernel images
powerpc/fsl: Fix rcpm compatible string
powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
powerpc/fsl-pci: Add a workaround for PCI 5 errata
powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
powerpc/powernv/npu: Add PE to PHB's list
powerpc/powernv: Fix insufficient memory allocation
powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
powerpc/powernv/npu: Enable NVLink pass through
powerpc/powernv/npu: Rework TCE Kill handling
powerpc/powernv/npu: Add set/unset window helpers
powerpc/powernv/ioda2: Export debug helper pe_level_printk()
...
1) With the changing of the code for filtering events by pid, from
a list of pids to a bitmask, we can now easily implement following
forks. With a new tracing option "event-fork" which, when set, will
have tasks with pids in set_event_pid, when they fork, to have their
child pids added to set_event_pid and the child will be traced as well.
Note, if "event-fork" is set and a task with its pid in set_event_pid
exits, its pid will be removed from set_event_pid
2) The addition of Tom Zanussi's hist triggers. This includes a very
thorough documentatino on how to use the hist triggers with events.
This introduces a quick and easy way to get histogram data from
events and their fields.
Some other cleanups and updates were added as well. Like Masami Hiramatsu
added test cases for the event trigger and hist triggers. Also I added
a speed up of filtering by using a temp buffer when filters are set.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXPIv1AAoJEKKk/i67LK/8WZcIAIaaHJMctDCfXPg8OoT1LLI/
yUxgWvQRM7iwGV8YjuaXlyxTDJU0XVoNpPF5ZGiePlRDSCUboNvgcNVHRusJJKqM
oV1BTsq2x5eY12agA8kSOHcqGP7saqa2H+RJ4+3jNB/DTtOwJ8RzodlqWQ7PZbRG
0IDvD7buh9NeDS2am835RB+Xhy/jNBrkoJjpvMNaG5nZypsMq8D524RzyBm6RYjp
p+KLo3/yDc0+khv1hIs1c/w+LXNs7XtpPjpAKBa8B4xOiXndh3IosjX3JnL+0f+6
EvXt6qRfBKCE5o2BM397qjE3V/L0/SfzTijuL1WMd88ZvPGqwcsslQekmxKAb1E=
=WBTB
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This includes two new updates for the ftrace infrastructure.
- With the changing of the code for filtering events by pid, from a
list of pids to a bitmask, we can now easily implement following
forks. With a new tracing option "event-fork" which, when set,
will have tasks with pids in set_event_pid, when they fork, to have
their child pids added to set_event_pid and the child will be
traced as well.
Note, if "event-fork" is set and a task with its pid in
set_event_pid exits, its pid will be removed from set_event_pid
- The addition of Tom Zanussi's hist triggers. This includes a very
thorough documentatino on how to use the hist triggers with events.
This introduces a quick and easy way to get histogram data from
events and their fields.
Some other cleanups and updates were added as well. Like Masami
Hiramatsu added test cases for the event trigger and hist triggers.
Also I added a speed up of filtering by using a temp buffer when
filters are set"
* tag 'trace-v4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (45 commits)
tracing: Use temp buffer when filtering events
tracing: Remove TRACE_EVENT_FL_USE_CALL_FILTER logic
tracing: Remove unused function trace_current_buffer_lock_reserve()
tracing: Remove one use of trace_current_buffer_lock_reserve()
tracing: Have trace_buffer_unlock_commit() call the _regs version with NULL
tracing: Remove unused function trace_current_buffer_discard_commit()
tracing: Move trace_buffer_unlock_commit{_regs}() to local header
tracing: Fold filter_check_discard() into its only user
tracing: Make filter_check_discard() local
tracing: Move event_trigger_unlock_commit{_regs}() to local header
tracing: Don't use the address of the buffer array name in copy_from_user
tracing: Handle tracing_map_alloc_elts() error path correctly
tracing: Add check for NULL event field when creating hist field
tracing: checking for NULL instead of IS_ERR()
tracing: Do not inherit event-fork option for instances
tracing: Fix unsigned comparison to zero in hist trigger code
kselftests/ftrace: Add a test for log2 modifier of hist trigger
tracing: Add hist trigger 'log2' modifier
kselftests/ftrace: Add hist trigger testcases
kselftests/ftrace : Add event trigger testcases
...
Pull livepatching updates from Jiri Kosina:
- remove of our own implementation of architecture-specific relocation
code and leveraging existing code in the module loader to perform
arch-dependent work, from Jessica Yu.
The relevant patches have been acked by Rusty (for module.c) and
Heiko (for s390).
- live patching support for ppc64le, which is a joint work of Michael
Ellerman and Torsten Duwe. This is coming from topic branch that is
share between livepatching.git and ppc tree.
- addition of livepatching documentation from Petr Mladek
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: make object/func-walking helpers more robust
livepatch: Add some basic livepatch documentation
powerpc/livepatch: Add live patching support on ppc64le
powerpc/livepatch: Add livepatch stack to struct thread_info
powerpc/livepatch: Add livepatch header
livepatch: Allow architectures to specify an alternate ftrace location
ftrace: Make ftrace_location_range() global
livepatch: robustify klp_register_patch() API error checking
Documentation: livepatch: outline Elf format and requirements for patch modules
livepatch: reuse module loader code to write relocations
module: s390: keep mod_arch_specific for livepatch modules
module: preserve Elf information for livepatch modules
Elf: add livepatch-specific Elf constants
Pull networking updates from David Miller:
"Highlights:
1) Support SPI based w5100 devices, from Akinobu Mita.
2) Partial Segmentation Offload, from Alexander Duyck.
3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.
4) Allow cls_flower stats offload, from Amir Vadai.
5) Implement bpf blinding, from Daniel Borkmann.
6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.
7) Run TCP more preemptibly, also from Eric Dumazet.
8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.
9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.
10) Improve BPF usage documentation, from Jesper Dangaard Brouer.
11) Support tunneling offloads in qed, from Manish Chopra.
12) aRFS offloading in mlx5e, from Maor Gottlieb.
13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.
14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.
15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.
16) Per-vlan stats in bridging, from Nikolay Aleksandrov.
17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.
18) Checksum neutral ILA in ipv6, from Tom Herbert.
19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot
20) Add VF support to qed driver, from Yuval Mintz"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...
Pull core block layer updates from Jens Axboe:
"This is the core block IO changes for this merge window. Nothing
earth shattering in here, it's mostly just fixes. In detail:
- Fix for a long standing issue where wrong ordering in blk-mq caused
order_to_size() to spew a warning. From Bart.
- Async discard support from Christoph. Basically just splitting our
sync interface into a submit + wait part.
- Add a cleaner interface for flagging whether a device has a write
back cache or not. We've previously overloaded blk_queue_flush()
with this, but let's make it more explicit. Drivers cleaned up and
updated in the drivers pull request. From me.
- Fix for a double check for whether IO accounting is enabled or not.
From Michael Callahan.
- Fix for the async discard from Mike Snitzer, reinstating the early
EOPNOTSUPP return if the device doesn't support discards.
- Also from Mike, export bio_inc_remaining() so dm can drop it's
private copy of it.
- From Ming Lin, add support for passing in an offset for request
payloads.
- Tag function export from Sagi, which will be used in NVMe in the
drivers pull.
- Two blktrace related fixes from Shaohua.
- Propagate NOMERGE flag when making a request from a bio, also from
Shaohua.
- An optimization to not parse cgroup paths in blk-throttle, if we
don't need to. From Shaohua"
* 'for-4.7/core' of git://git.kernel.dk/linux-block:
blk-mq: fix undefined behaviour in order_to_size()
blk-throttle: don't parse cgroup path if trace isn't enabled
blktrace: add missed mask name
blktrace: delete garbage for message trace
block: make bio_inc_remaining() interface accessible again
block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discard
block: Minor blk_account_io_start usage cleanup
block: add __blkdev_issue_discard
block: remove struct bio_batch
block: copy NOMERGE flag from bio to request
block: add ability to flag write back caching on a device
blk-mq: Export tagset iter function
block: add offset in blk_add_request_payload()
writeback: Fix performance regression in wb_over_bg_thresh()
unsigned numbers in the ring-buffer code.
https://bugzilla.kernel.org/show_bug.cgi?id=118001
At first I did not think this was too much of an issue, because the
overflow would be caught later when either too much data was allocated
or it would trigger RB_WARN_ON() which shuts down the ring buffer.
But looking closer into it, I found that the right settings could bypass
the checks and crash the kernel. Luckily, this is only accessible
by root.
The first fix is to convert all the variables into long, such that
we don't get into issues between 32 bit variables being assigned 64 bit
ones. This fixes the RB_WARN_ON() triggering.
The next fix is to get rid of a duplicate DIV_ROUND_UP() that when called
twice with the right value, can cause a kernel crash.
The first DIV_ROUND_UP() is to normalize the input and it is checked
against the minimum allowable value. But then DIV_ROUND_UP() is called
again, which can overflow due to the (a + b - 1)/b, logic. The first
called upped the value, the second can overflow (with the +b part).
The second call to DIV_ROUND_UP() came in via a second change a while ago
and the code is cleaned up to remove it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEbBAABAgAGBQJXOdaqAAoJEKKk/i67LK/8FSAH93vLHClJJFaD5kn8dRhTS7rl
xVHAC5jHCHiKkQqIGI/N7qhzZ7DqiXpIQjs8KcE86Ser65AGNA48aeBKAA6xSQ+k
nghDGhiwLixaMIUFA7SNry4VBEcbACxtLENIhBMWo9fmw85jVTH98B958J6CXdlL
g6OC/PCNmt7eZwPrSB/aqpZ1Jp0Fik3GMXjMtY7axo9D+ONm7LF9qiHT9BcyKxN4
WHC83yDwUsWqLWxuvuhpGAeMu+nCQurRsPebyXwFh4hj56fhWJjv21ZLKtn2MjKL
8VO9sKCVEQTvLRGSzPMNP9lxkeuVp/wPrj2JRvX2JtGOqurnRNt2gqIZn2qPqA==
=Zjyz
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing ring-buffer fixes from Steven Rostedt:
"Hao Qin reported an integer overflow possibility with signed and
unsigned numbers in the ring-buffer code.
https://bugzilla.kernel.org/show_bug.cgi?id=118001
At first I did not think this was too much of an issue, because the
overflow would be caught later when either too much data was allocated
or it would trigger RB_WARN_ON() which shuts down the ring buffer.
But looking closer into it, I found that the right settings could
bypass the checks and crash the kernel. Luckily, this is only
accessible by root.
The first fix is to convert all the variables into long, such that we
don't get into issues between 32 bit variables being assigned 64 bit
ones. This fixes the RB_WARN_ON() triggering.
The next fix is to get rid of a duplicate DIV_ROUND_UP() that when
called twice with the right value, can cause a kernel crash.
The first DIV_ROUND_UP() is to normalize the input and it is checked
against the minimum allowable value. But then DIV_ROUND_UP() is
called again, which can overflow due to the (a + b - 1)/b, logic. The
first called upped the value, the second can overflow (with the +b
part).
The second call to DIV_ROUND_UP() came in via a second change a while
ago and the code is cleaned up to remove it"
* tag 'trace-fixes-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Prevent overflow of size in ring_buffer_resize()
ring-buffer: Use long for nr_pages to avoid overflow failures
- New cpufreq "schedutil" governor (making decisions based on CPU
utilization information provided by the scheduler and capable of
switching CPU frequencies right away if the underlying driver
supports that) and support for fast frequency switching in the
acpi-cpufreq driver (Rafael Wysocki).
- Consolidation of CPU frequency management on ARM platforms allowing
them to get rid of some platform-specific boilerplate code if they
are going to use the cpufreq-dt driver (Viresh Kumar, Finley Xiao,
Marc Gonzalez).
- Support for ACPI _PPC and CPU frequency limits in the intel_pstate
driver (Srinivas Pandruvada).
- Fixes and cleanups in the cpufreq core and generic governor code
(Rafael Wysocki, Sai Gurrappadi).
- intel_pstate driver optimizations and cleanups (Rafael Wysocki,
Philippe Longepe, Chen Yu, Joe Perches).
- cpufreq powernv driver fixes and cleanups (Akshay Adiga, Shilpasri
Bhat).
- cpufreq qoriq driver fixes and cleanups (Jia Hongtao).
- ACPI cpufreq driver cleanups (Viresh Kumar).
- Assorted cpufreq driver updates (Ashwin Chaugule, Geliang Tang,
Javier Martinez Canillas, Paul Gortmaker, Sudeep Holla).
- Assorted cpufreq fixes and cleanups (Joe Perches, Arnd Bergmann).
- Fixes and cleanups in the OPP (Operating Performance Points)
framework, mostly related to OPP sharing, and reorganization of
OF-dependent code in it (Viresh Kumar, Arnd Bergmann, Sudeep Holla).
- New "passive" governor for devfreq (for SoC subsystems that will
rely on someone else for the management of their power resources)
and consolidation of devfreq support for Exynos platforms, coding
style and typo fixes for devfreq (Chanwoo Choi, MyungJoo Ham).
- PM core fixes and cleanups, mostly to make it work better with the
generic power domains (genpd) framework, and updates for that
framework (Ulf Hansson, Thierry Reding, Colin Ian King).
- Intel Broxton support for the intel_idle driver (Len Brown).
- cpuidle core optimization and fix (Daniel Lezcano, Dave Gerlach).
- ARM cpuidle cleanups (Jisheng Zhang).
- Intel Kabylake support for the RAPL power capping driver (Jacob Pan).
- AVS (Adaptive Voltage Switching) rockchip-io driver update (Heiko
Stuebner).
- Updates for the cpupower tool (Arjun Sreedharan, Colin Ian King,
Mattia Dongili, Thomas Renninger).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJXOjLgAAoJEILEb/54YlRxfn0P/RbSPpNlUNBIE8DFrdD9jRdJ
TIpZ7uiHi9tU1ZF17UBbb/SwuWfYVnVmiorZGRfFOtGaoqh0HFZ/nplDz99rK0ku
vW2OnbojMQEUMU3IcUT1y4BsSl0H23f7ZOKrdprALeWxDQmbgnYjrE6vkX6hRtld
A8eeZvIEJ5CzV8S+9aOOOpojW2yXk5dYGdZ7gpQdoM0n7zVLyPnNucJoha3BYmOG
FwKEIe05RpIhfLfGT0CXIRcOzwAZ6ZWKgOrXUrx/AadPbvu/TP9zkI0djYI8ukyv
z2oiO/GExoeGVuUzvy8vY5SiH4NQvViftFzMZepcsmjxmVglohMPRL8VLjZIBckk
DDcqH9e0OQI20jjYT1vIf5+JWBvLxuQfGtyzI0S+sE/elB1zI/3O8p+8N2CuF5n+
my2dawIewnHI/0AdSpJ+K7DVrfwPHAX19axtPX3dJSLh2OuHCPNlAtbxRGAriBfH
Zv9NETxlrch69o2AD4K54DErWV1FsYLznzK5Zms6MC2Ispbb+oiYpacTlZblznvb
H5U2SSNlA5Niir3vVJ01nKRtzxlWoi67CQxbYrGhlaR0nTTxf9HqWgcSiTZrn7Pv
hs+LA2aUfMf3JGjStdORS7S8biQSid5vypfkglpWLZBKHNC9BqqZd9gSM+jF3FVh
ps4mMM4UXY4hnoFDkMBI
=WM89
-----END PGP SIGNATURE-----
Merge tag 'pm-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The majority of changes go into the cpufreq subsystem this time.
To me, quite obviously, the biggest ticket item is the new "schedutil"
governor. Interestingly enough, it's the first new cpufreq governor
since the beginning of the git era (except for some out-of-the-tree
ones).
There are two main differences between it and the existing governors.
First, it uses the information provided by the scheduler directly for
making its decisions, so it doesn't have to track anything by itself.
Second, it can invoke drivers (supporting that feature) to adjust CPU
performance right away without having to spawn work items to be
executed in process context or similar. Currently, the acpi-cpufreq
driver is the only one supporting that mode of operation, but then it
is used on a large number of systems.
The "schedutil" governor as included here is very simple and mostly
regarded as a foundation for future work on the integration of the
scheduler with CPU power management (in fact, there is work in
progress on top of it already). Nevertheless it works and the
preliminary results obtained with it are encouraging.
There also is some consolidation of CPU frequency management for ARM
platforms that can add their machine IDs the the new stub dt-platdev
driver now and that will take care of creating the requisite platform
device for cpufreq-dt, so it is not necessary to do that in platform
code any more. Several ARM platforms are switched over to using this
generic mechanism.
In addition to that, the intel_pstate driver is now going to respect
CPU frequency limits set by the platform firmware (or a BMC) and
provided via the ACPI _PPC object.
The devfreq subsystem is getting a new "passive" governor for SoCs
subsystems that will depend on somebody else to manage their voltage
rails and its support for Samsung Exynos SoCs is consolidated.
The rest is support for new hardware (Intel Broxton support in
intel_idle for one example), bug fixes, optimizations and cleanups in
a number of places.
Specifics:
- New cpufreq "schedutil" governor (making decisions based on CPU
utilization information provided by the scheduler and capable of
switching CPU frequencies right away if the underlying driver
supports that) and support for fast frequency switching in the
acpi-cpufreq driver (Rafael Wysocki)
- Consolidation of CPU frequency management on ARM platforms allowing
them to get rid of some platform-specific boilerplate code if they
are going to use the cpufreq-dt driver (Viresh Kumar, Finley Xiao,
Marc Gonzalez)
- Support for ACPI _PPC and CPU frequency limits in the intel_pstate
driver (Srinivas Pandruvada)
- Fixes and cleanups in the cpufreq core and generic governor code
(Rafael Wysocki, Sai Gurrappadi)
- intel_pstate driver optimizations and cleanups (Rafael Wysocki,
Philippe Longepe, Chen Yu, Joe Perches)
- cpufreq powernv driver fixes and cleanups (Akshay Adiga, Shilpasri
Bhat)
- cpufreq qoriq driver fixes and cleanups (Jia Hongtao)
- ACPI cpufreq driver cleanups (Viresh Kumar)
- Assorted cpufreq driver updates (Ashwin Chaugule, Geliang Tang,
Javier Martinez Canillas, Paul Gortmaker, Sudeep Holla)
- Assorted cpufreq fixes and cleanups (Joe Perches, Arnd Bergmann)
- Fixes and cleanups in the OPP (Operating Performance Points)
framework, mostly related to OPP sharing, and reorganization of
OF-dependent code in it (Viresh Kumar, Arnd Bergmann, Sudeep Holla)
- New "passive" governor for devfreq (for SoC subsystems that will
rely on someone else for the management of their power resources)
and consolidation of devfreq support for Exynos platforms, coding
style and typo fixes for devfreq (Chanwoo Choi, MyungJoo Ham)
- PM core fixes and cleanups, mostly to make it work better with the
generic power domains (genpd) framework, and updates for that
framework (Ulf Hansson, Thierry Reding, Colin Ian King)
- Intel Broxton support for the intel_idle driver (Len Brown)
- cpuidle core optimization and fix (Daniel Lezcano, Dave Gerlach)
- ARM cpuidle cleanups (Jisheng Zhang)
- Intel Kabylake support for the RAPL power capping driver (Jacob
Pan)
- AVS (Adaptive Voltage Switching) rockchip-io driver update (Heiko
Stuebner)
- Updates for the cpupower tool (Arjun Sreedharan, Colin Ian King,
Mattia Dongili, Thomas Renninger)"
* tag 'pm-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (112 commits)
intel_pstate: Clean up get_target_pstate_use_performance()
intel_pstate: Use sample.core_avg_perf in get_avg_pstate()
intel_pstate: Clarify average performance computation
intel_pstate: Avoid unnecessary synchronize_sched() during initialization
cpufreq: schedutil: Make default depend on CONFIG_SMP
cpufreq: powernv: del_timer_sync when global and local pstate are equal
cpufreq: powernv: Move smp_call_function_any() out of irq safe block
intel_pstate: Clean up intel_pstate_get()
cpufreq: schedutil: Make it depend on CONFIG_SMP
cpufreq: governor: Fix handling of special cases in dbs_update()
PM / OPP: Move CONFIG_OF dependent code in a separate file
cpufreq: intel_pstate: Ignore _PPC processing under HWP
cpufreq: arm_big_little: use generic OPP functions for {init, free}_opp_table
PM / OPP: add non-OF versions of dev_pm_opp_{cpumask_, }remove_table
cpufreq: tango: Use generic platdev driver
PM / OPP: pass cpumask by reference
cpufreq: Fix GOV_LIMITS handling for the userspace governor
cpupower: fix potential memory leak
PM / devfreq: style/typo fixes
PM / devfreq: exynos: Add the detailed correlation for Exynos5422 bus
..
* pm-cpufreq: (63 commits)
intel_pstate: Clean up get_target_pstate_use_performance()
intel_pstate: Use sample.core_avg_perf in get_avg_pstate()
intel_pstate: Clarify average performance computation
intel_pstate: Avoid unnecessary synchronize_sched() during initialization
cpufreq: schedutil: Make default depend on CONFIG_SMP
cpufreq: powernv: del_timer_sync when global and local pstate are equal
cpufreq: powernv: Move smp_call_function_any() out of irq safe block
intel_pstate: Clean up intel_pstate_get()
cpufreq: schedutil: Make it depend on CONFIG_SMP
cpufreq: governor: Fix handling of special cases in dbs_update()
cpufreq: intel_pstate: Ignore _PPC processing under HWP
cpufreq: arm_big_little: use generic OPP functions for {init, free}_opp_table
cpufreq: tango: Use generic platdev driver
cpufreq: Fix GOV_LIMITS handling for the userspace governor
cpufreq: mvebu: Move cpufreq code into drivers/cpufreq/
cpufreq: dt: Kill platform-data
mvebu: Use dev_pm_opp_set_sharing_cpus() to mark OPP tables as shared
cpufreq: dt: Identify cpu-sharing for platforms without operating-points-v2
cpufreq: governor: Change confusing struct field and variable names
cpufreq: intel_pstate: Enable PPC enforcement for servers
...
If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE
then the DIV_ROUND_UP() will return zero.
Here's the details:
# echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb
tracing_entries_write() processes this and converts kb to bytes.
18014398509481980 << 10 = 18446744073709547520
and this is passed to ring_buffer_resize() as unsigned long size.
size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
Where DIV_ROUND_UP(a, b) is (a + b - 1)/b
BUF_PAGE_SIZE is 4080 and here
18446744073709547520 + 4080 - 1 = 18446744073709551599
where 18446744073709551599 is still smaller than 2^64
2^64 - 18446744073709551599 = 17
But now 18446744073709551599 / 4080 = 4521260802379792
and size = size * 4080 = 18446744073709551360
This is checked to make sure its still greater than 2 * 4080,
which it is.
Then we convert to the number of buffer pages needed.
nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE)
but this time size is 18446744073709551360 and
2^64 - (18446744073709551360 + 4080 - 1) = -3823
Thus it overflows and the resulting number is less than 4080, which makes
3823 / 4080 = 0
an nr_pages is set to this. As we already checked against the minimum that
nr_pages may be, this causes the logic to fail as well, and we crash the
kernel.
There's no reason to have the two DIV_ROUND_UP() (that's just result of
historical code changes), clean up the code and fix this bug.
Cc: stable@vger.kernel.org # 3.5+
Fixes: 83f40318da ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The size variable to change the ring buffer in ftrace is a long. The
nr_pages used to update the ring buffer based on the size is int. On 64 bit
machines this can cause an overflow problem.
For example, the following will cause the ring buffer to crash:
# cd /sys/kernel/debug/tracing
# echo 10 > buffer_size_kb
# echo 8556384240 > buffer_size_kb
Then you get the warning of:
WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260
Which is:
RB_WARN_ON(cpu_buffer, nr_removed);
Note each ring buffer page holds 4080 bytes.
This is because:
1) 10 causes the ring buffer to have 3 pages.
(10kb requires 3 * 4080 pages to hold)
2) (2^31 / 2^10 + 1) * 4080 = 8556384240
The value written into buffer_size_kb is shifted by 10 and then passed
to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760
3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE
which is 4080. 8761737461760 / 4080 = 2147484672
4) nr_pages is subtracted from the current nr_pages (3) and we get:
2147484669. This value is saved in a signed integer nr_pages_to_update
5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int
turns into the value of -2147482627
6) As the value is a negative number, in update_pages_handler() it is
negated and passed to rb_remove_pages() and 2147482627 pages will
be removed, which is much larger than 3 and it causes the warning
because not all the pages asked to be removed were removed.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001
Cc: stable@vger.kernel.org # 2.6.28+
Fixes: 7a8e76a382 ("tracing: unified trace buffer")
Reported-by: Hao Qin <QEver.cn@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
BLK_TC_NOTIFY is missed in mask_maps, so we can't print out notify or
set mask with 'notify' name.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
commit f4a1d08ce6 introduces a regression. Originally for
BLK_TN_MESSAGE, we add message in trace and return. The commit ignores
the early return and add garbage info.
Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In netdevice.h we removed the structure in net-next that is being
changes in 'net'. In macsec.c and rtnetlink.c we have overlaps
between fixes in 'net' and the u64 attribute changes in 'net-next'.
The mlx5 conflicts have to do with vxlan support dependencies.
Signed-off-by: David S. Miller <davem@davemloft.net>
Filtering of events requires the data to be written to the ring buffer
before it can be decided to filter or not. This is because the parameters of
the filter are based on the result that is written to the ring buffer and
not on the parameters that are passed into the trace functions.
The ftrace ring buffer is optimized for writing into the ring buffer and
committing. The discard procedure used when filtering decides the event
should be discarded is much more heavy weight. Thus, using a temporary
filter when filtering events can speed things up drastically.
Without a temp buffer we have:
# trace-cmd start -p nop
# perf stat -r 10 hackbench 50
0.790706626 seconds time elapsed ( +- 0.71% )
# trace-cmd start -e all
# perf stat -r 10 hackbench 50
1.566904059 seconds time elapsed ( +- 0.27% )
# trace-cmd start -e all -f 'common_preempt_count==20'
# perf stat -r 10 hackbench 50
1.690598511 seconds time elapsed ( +- 0.19% )
# trace-cmd start -e all -f 'common_preempt_count!=20'
# perf stat -r 10 hackbench 50
1.707486364 seconds time elapsed ( +- 0.30% )
The first run above is without any tracing, just to get a based figure.
hackbench takes ~0.79 seconds to run on the system.
The second run enables tracing all events where nothing is filtered. This
increases the time by 100% and hackbench takes 1.57 seconds to run.
The third run filters all events where the preempt count will equal "20"
(this should never happen) thus all events are discarded. This takes 1.69
seconds to run. This is 10% slower than just committing the events!
The last run enables all events and filters where the filter will commit all
events, and this takes 1.70 seconds to run. The filtering overhead is
approximately 10%. Thus, the discard and commit of an event from the ring
buffer may be about the same time.
With this patch, the numbers change:
# trace-cmd start -p nop
# perf stat -r 10 hackbench 50
0.778233033 seconds time elapsed ( +- 0.38% )
# trace-cmd start -e all
# perf stat -r 10 hackbench 50
1.582102692 seconds time elapsed ( +- 0.28% )
# trace-cmd start -e all -f 'common_preempt_count==20'
# perf stat -r 10 hackbench 50
1.309230710 seconds time elapsed ( +- 0.22% )
# trace-cmd start -e all -f 'common_preempt_count!=20'
# perf stat -r 10 hackbench 50
1.786001924 seconds time elapsed ( +- 0.20% )
The first run is again the base with no tracing.
The second run is all tracing with no filtering. It is a little slower, but
that may be well within the noise.
The third run shows that discarding all events only took 1.3 seconds. This
is a speed up of 23%! The discard is much faster than even the commit.
The one downside is shown in the last run. Events that are not discarded by
the filter will take longer to add, this is due to the extra copy of the
event.
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently register functions for events will be called
through the 'reg' field of event class directly without
any check when seting up triggers.
Triggers for events that don't support register through
debug fs (events under events/ftrace are for trace-cmd to
read event format, and most of them don't have a register
function except events/ftrace/functionx) can't be enabled
at all, and an oops will be hit when setting up trigger
for those events, so just not creating them is an easy way
to avoid the oops.
Link: http://lkml.kernel.org/r/1462275274-3911-1-git-send-email-chuhu@redhat.com
Cc: stable@vger.kernel.org # 3.14+
Fixes: 85f2b08268 ("tracing: Add basic event trigger framework")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The only user of trace_current_buffer_lock_reserve() is in the boot up self
tests. Restructure the code a little to have that code use what everything
else uses: trace_event_buffer_lock_reserve().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's no real difference between trace_buffer_unlock_commit() and
trace_buffer_unlock_commit_regs() except that the former passes NULL to
ftrace_stack_trace() instead of regs. Have the former be a static inline of
the latter which passes NULL for regs.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_buffer_unlock_commit() and the _regs() version are only
used within the kernel/trace directory. Move them to the local header and
remove the export as well.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function filter_check_discard() is small and only called by one user,
its code can be folded into that one caller and make the code a bit less
comlplex.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Nothing outside of the tracing directory calls filter_check_discard() or
check_filter_check_discard(). They should not be called by modules. Move
their prototypes into the local tracing header and remove their
EXPORT_SYMBOL() macros.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions event_trigger_unlock_commit() and
event_trigger_unlock_commit_regs() are no longer used outside the tracing
system. Move them out of the generic headers and into the local one.
Along with __event_trigger_test_discard() that is only used by them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the ppc64 big endian ABI, function symbols point to function
descriptors. The symbols which point to the function entry points
have a dot in front of the function name. Consequently, when the
ftrace filter mechanism searches for the symbol corresponding to
an entry point address, it gets the dot symbol.
As a result, ftrace filter users have to be aware of this ABI detail on
ppc64 and prepend a dot to the function name when setting the filter.
The perf probe command insulates the user from this by ignoring the dot
in front of the symbol name when matching function names to symbols,
but the sysfs interface does not. This patch makes the ftrace filter
mechanism do the same when searching symbols.
Fixes the following failure in ftracetest's kprobe_ftrace.tc:
.../kprobe_ftrace.tc: line 9: echo: write error: Invalid argument
That failure is on this line of kprobe_ftrace.tc:
echo _do_fork > set_ftrace_filter
This is because there's no _do_fork entry in the functions list:
# cat available_filter_functions | grep _do_fork
._do_fork
This change introduces no regressions on the perf and ftracetest
testsuite results.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With the following code snippet:
...
char buf[64];
...
if (copy_from_user(&buf, ubuf, cnt))
...
Even though the value of "&buf" equals "buf", but there is no need
to get the address of the "buf" again. Use "buf" instead of "&buf".
Link: http://lkml.kernel.org/r/20160418152329.18b72bea@debian
Signed-off-by: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If tracing_map_elt_alloc() fails, it will return ERR_PTR() instead of
NULL, so change the check to IS_ERROR(). We also need to set the
failed entry in the map->elts array to NULL instead of ERR_PTR() so
tracing_map_free_elts() doesn't try freeing an ERR_PTR().
tracing_map_free_elts() should also zero out what it frees so a
reentrant call won't find previously freed elements.
Link: http://lkml.kernel.org/r/f29d03b00bce3aac8cf151a8a30e6c83e5fee66d.1461610073.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Smatch flagged create_hist_field() as possibly being able to
dereference a NULL pointer, although the current code exits in all
cases where the event field could be NULL, so it's not actually a
problem.
Still, to prevent future changes to the code from overlooking new
cases, make the NULL pointer check explicit and warn once in that
case.
Link: http://lkml.kernel.org/r/cfbc003f534a3e441b4313272fd412310aba6336.1461610073.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the event-fork option requires doing work when enabled and disabled, it
can not be passed down to created instances. The instance must clear this
flag when it is created, and must clear it when its removed.
As more options may be created with this need, a macro ZEROED_TRACE_FLAGS is
created that holds the flags that must not be inherited by the top level
instance, and must be cleared on removal of instances.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec3042 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1 ("samples: bpf: add bpf_perf_event_output example").
The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.
User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.
While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.
Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a BPF_F_CURRENT_CPU flag to optimize the use-case where user space has
per-CPU ring buffers and the eBPF program pushes the data into the current
CPU's ring buffer which saves us an extra helper function call in eBPF.
Also, make sure to properly reserve the remaining flags which are not used.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fengguang Wu's bot found two comparisons of unsigned integers to zero. These
were real bugs, as it would miss error conditions returned to zero.
trace_events_hist.c:426:6-9: WARNING: Unsigned expression compared with zero: idx < 0
trace_events_hist.c:568:5-14: WARNING: Unsigned expression compared with zero: n_entries < 0
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to define 'named' hist triggers. All triggers created
with the same 'name=xxx' option will update the same shared histogram
data.
This expands the hist trigger syntax from this:
# echo hist:keys=xxx ... [ if filter] > event/trigger
to this:
# echo hist:name=xxx:keys=xxx ... [ if filter] > event/trigger
Named histograms must use a 'compatible' set of keys and values, which
means each event added to a set of named triggers must have the same
names and types.
Reading the 'hist' file of any of the participating events will
produce the same output as any other participating event, which is to
be expected since they share the same data.
Link: http://lkml.kernel.org/r/1dbc84ee3322a75daaf5b3ef1d0cc0a2fb682fc7.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Named triggers are sets of triggers that share a common set of trigger
data. An example of functionality that could benefit from this type
of capability would be a set of inlined probes that would each
contribute event counts, for example, to a shared counter data
structure.
The first named trigger registered with a given name owns the common
trigger data that the others subsequently registered with the same
name will reference. The functions defined here allow users to add,
delete, and find named triggers.
It also adds functions to pause and unpause named triggers; since
named triggers act upon common data, they should also be paused and
unpaused as a group.
Link: http://lkml.kernel.org/r/c09ff648360f65b10a3e321eddafe18060b4a04f.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to define any number of hist triggers per trace event.
Any number of hist triggers may be added for a given event, which may
differ by key, value, or filter.
Reading the event's 'hist' file will display the output of all the
hist triggers defined on an event concatenated in the order they were
defined.
Link: http://lkml.kernel.org/r/48a0c8dd34c344571de880fb35e211c6d9a28961.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Similar to enable_event/disable_event triggers, these triggers enable
and disable the aggregation of events into maps rather than enabling
and disabling their writing into the trace buffer.
They can be used to automatically start and stop hist triggers based
on a matching filter condition.
If there's a paused hist trigger on system:event, the following would
start it when the filter condition was hit:
# echo enable_hist:system:event [ if filter] > event/trigger
And the following would disable a running system:event hist trigger:
# echo disable_hist:system:event [ if filter] > event/trigger
See Documentation/trace/events.txt for real examples.
Link: http://lkml.kernel.org/r/f812f086e52c8b7c8ad5443487375e03c96a601f.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If we assume the maximum size for a string field, we don't have to
worry about its position. Since we only allow two keys in a compound
key and having more than one string key in a given compound key
doesn't make much sense anyway, trading a bit of extra space instead
of introducing an arbitrary restriction makes more sense.
We also need to use the event field size for static strings when
copying the contents, otherwise we get random garbage in the key.
Also, cast string return values to avoid warnings on 32-bit compiles.
Finally, rearrange the code without changing any functionality by
moving the compound key updating code into a separate function.
Link: http://lkml.kernel.org/r/8976e1ab04b66bc2700ad1ed0768a2de85ac1983.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The string in a trace event is usually recorded as dynamic array which
is variable length. But current hist code only support fixed length
array so it cannot support most strings.
This patch fixes it by checking filter_type of the field and get
proper pointer with it. With this, it can get a histogram of exec()
based on filenames like below:
# cd /sys/kernel/tracing/events/sched/sched_process_exec
# cat 'hist:key=filename' > trigger
# ps
PID TTY TIME CMD
1 ? 00:00:00 init
29 ? 00:00:00 sh
38 ? 00:00:00 ps
# ls
enable filter format hist id trigger
# cat hist
# trigger info: hist:keys=filename:vals=hitcount:sort=hitcount:size=2048 [active]
{ filename: /usr/bin/ps } hitcount: 1
{ filename: /usr/bin/ls } hitcount: 1
{ filename: /usr/bin/cat } hitcount: 1
Totals:
Hits: 3
Entries: 3
Dropped: 0
Link: http://lkml.kernel.org/r/610180d6df0cfdf11ee205452f3b241dea657233.1457029949.git.tom.zanussi@linux.intel.com
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Added (unsigned long) typecast to fix compile warning ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It's often useful to be able to use a stacktrace as a hash key, for
keeping a count of the number of times a particular call path resulted
in a trace event, for instance. Add a special key named 'stacktrace'
which can be used as key in a 'keys=' param for this purpose:
# echo hist:keys=stacktrace ... \
[ if filter] > event/trigger
Link: http://lkml.kernel.org/r/87515e90b3785232a874a12156174635a348edb1.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to have syscall id fields displayed as syscall names in
the output by appending '.syscall' to field names:
# echo hist:keys=aaa.syscall ... \
[ if filter] > event/trigger
Link: http://lkml.kernel.org/r/2bab1e59933d76a14b545bd2e02f80b8b08ac4d3.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to have common_pid field values displayed as program names
in the output by appending '.execname' to a common_pid field name:
# echo hist:keys=common_pid.execname ... \
[ if filter] > event/trigger
Link: http://lkml.kernel.org/r/e172e81f10f5b8d1f08450e3763c850f39fbf698.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to have address fields displayed as symbols in the output
by appending '.sym' or 'sym-offset' to field names:
# echo hist:keys=aaa.sym,bbb.sym-offset ... \
[ if filter] > event/trigger
Link: http://lkml.kernel.org/r/87d4935821491c0275513f0fbfb9bab8d3d3f079.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to have numeric fields displayed as hex values in the
output by appending '.hex' to field names:
# echo hist:keys=aaa,bbb.hex:vals=ccc.hex ... \
[ if filter] > event/trigger
Link: http://lkml.kernel.org/r/67bd431edda2af5798d7694818f7e8d71b6b3463.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to append 'clear' to an existing trigger in order to have
the hash table cleared.
This expands the hist trigger syntax from this:
# echo hist:keys=xxx:vals=yyy:sort=zzz.descending:pause/cont \
[ if filter] >> event/trigger
to this:
# echo hist:keys=xxx:vals=yyy:sort=zzz.descending:pause/cont/clear \
[ if filter] >> event/trigger
Link: http://lkml.kernel.org/r/ae15dd0d9b2f7af07a37c1ff682063e2dbcdf160.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to append 'pause' or 'continue' to an existing trigger in
order to have it paused or to have a paused trace continue.
This expands the hist trigger syntax from this:
# echo hist:keys=xxx:vals=yyy:sort=zzz.descending \
[ if filter] >> event/trigger
to this:
# echo hist:keys=xxx:vals=yyy:sort=zzz.descending:pause or cont \
[ if filter] >> event/trigger
Link: http://lkml.kernel.org/r/b672a92c14702cb924cdf6fc27ea1809bed04907.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to specify keys and/or values to sort on. With this
addition, keys and values specified using the 'keys=' and 'vals='
keywords can be used to sort the hist trigger output via a new 'sort='
keyword. If multiple sort keys are specified, the output will be
sorted using the second key as a secondary sort key, etc. The default
sort order is ascending; if the user wants a different sort order,
'.descending' can be appended to the specific sort key. Before this
addition, output was always sorted by 'hitcount' in ascending order.
This expands the hist trigger syntax from this:
# echo hist:keys=xxx:vals=yyy \
[ if filter] > event/trigger
to this:
# echo hist:keys=xxx:vals=yyy:sort=zzz.descending \
[ if filter] > event/trigger
Link: http://lkml.kernel.org/r/b30a41db66ba486979c4f987aff5fab500ea53b3.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to specify multiple trace event fields to use in keys by
allowing multiple fields in the 'keys=' keyword. With this addition,
any unique combination of any of the fields named in the 'keys'
keyword will result in a new entry being added to the hash table.
Link: http://lkml.kernel.org/r/0cfa24e6ac3b0dcece7737d94aa1f322ae3afc4b.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow users to specify trace event fields to use in aggregated sums
via a new 'vals=' keyword. Before this addition, the only aggregated
sum supported was the implied value 'hitcount'. With this addition,
'hitcount' is also supported as an explicit value field, as is any
numeric trace event field.
This expands the hist trigger syntax from this:
# echo hist:keys=xxx [ if filter] > event/trigger
to this:
# echo hist:keys=xxx:vals=yyy [ if filter] > event/trigger
Link: http://lkml.kernel.org/r/2a5d1adb5ba6c65d7bb2148e379f2fed47f29a68.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
'hist' triggers allow users to continually aggregate trace events,
which can then be viewed afterwards by simply reading a 'hist' file
containing the aggregation in a human-readable format.
The basic idea is very simple and boils down to a mechanism whereby
trace events, rather than being exhaustively dumped in raw form and
viewed directly, are automatically 'compressed' into meaningful tables
completely defined by the user.
This is done strictly via single-line command-line commands and
without the aid of any kind of programming language or interpreter.
A surprising number of typical use cases can be accomplished by users
via this simple mechanism. In fact, a large number of the tasks that
users typically do using the more complicated script-based tracing
tools, at least during the initial stages of an investigation, can be
accomplished by simply specifying a set of keys and values to be used
in the creation of a hash table.
The Linux kernel trace event subsystem happens to provide an extensive
list of keys and values ready-made for such a purpose in the form of
the event format files associated with each trace event. By simply
consulting the format file for field names of interest and by plugging
them into the hist trigger command, users can create an endless number
of useful aggregations to help with investigating various properties
of the system. See Documentation/trace/events.txt for examples.
hist triggers are implemented on top of the existing event trigger
infrastructure, and as such are consistent with the existing triggers
from a user's perspective as well.
The basic syntax follows the existing trigger syntax. Users start an
aggregation by writing a 'hist' trigger to the event of interest's
trigger file:
# echo hist:keys=xxx [ if filter] > event/trigger
Once a hist trigger has been set up, by default it continually
aggregates every matching event into a hash table using the event key
and a value field named 'hitcount'.
To view the aggregation at any point in time, simply read the 'hist'
file in the same directory as the 'trigger' file:
# cat event/hist
The detailed syntax provides additional options for user control, and
is described exhaustively in Documentation/trace/events.txt and in the
virtual tracing/README file in the tracing subsystem.
Link: http://lkml.kernel.org/r/72d263b5e1853fe9c314953b65833c3aa75479f2.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make it clear exactly how many keys and values are supported through
better defines, and add 1 to the vals count, since normally clients
want support for at least a hitcount and two other values.
Also, note the error return value for tracing_map_add_key/val_field()
in the comments.
Link: http://lkml.kernel.org/r/6696fa02ebc716aa344c27a571a2afaa25e5b4d4.1457029949.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The config option for TRACING_MAP has "default n", which is not needed
because the default of configs is 'n'.
Also, since the TRACING_MAP has no config prompt, there's no reason to
include "If in doubt, say N" in the help text.
Fixed a typo in the comments of tracing_map.h.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add tracing_map, a special-purpose lock-free map for tracing.
tracing_map is designed to aggregate or 'sum' one or more values
associated with a specific object of type tracing_map_elt, which
is associated by the map to a given key.
It provides various hooks allowing per-tracer customization and is
separated out into a separate file in order to allow it to be shared
between multiple tracers, but isn't meant to be generally used outside
of that context.
The tracing_map implementation was inspired by lock-free map
algorithms originated by Dr. Cliff Click:
http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtablehttp://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf
Link: http://lkml.kernel.org/r/b43d68d1add33582a396f553c8ef705a33a6a748.1449767187.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the infrastructure needed to have the PIDs in set_event_pid to
automatically add PIDs of the children of the tasks that have their PIDs in
set_event_pid. This will also remove PIDs from set_event_pid when a task
exits
This is implemented by adding hooks into the fork and exit tracepoints. On
fork, the PIDs are added to the list, and on exit, they are removed.
Add a new option called event_fork that when set, PIDs in set_event_pid will
automatically get their children PIDs added when they fork, as well as any
task that exits will have its PID removed from set_event_pid.
This works for instances as well.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to add the ability to let tasks that are filtered by the events
have their children also be traced on fork (and then not traced on exit),
convert the array into a pid bitmask. Most of the time the number of pids is
only 32768 pids or a 4k bitmask, which is the same size as the default list
currently is, and that list could grow if more pids are listed.
This also greatly simplifies the code.
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "check_ignore_pid" is confusing in trying to figure out if the pid
should be ignored or not. Rename it to "ignore_this_task" which is pretty
straight forward, as a task (not a pid) is passed in, and should if true
should be ignored.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Two new functions in bpf contain a cast from a 'u64' to a
pointer. This works on 64-bit architectures but causes a warning
on all 32-bit architectures:
kernel/trace/bpf_trace.c: In function 'bpf_perf_event_output_tp':
kernel/trace/bpf_trace.c:350:13: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
u64 ctx = *(long *)r1;
This changes the cast to first convert the u64 argument into a uintptr_t,
which is guaranteed to be the same size as a pointer.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 9940d67c93 ("bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs")
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
also be removed since the verifier already makes sure we stay within bounds
on stack buffers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support live patching on powerpc we would like to call
ftrace_location_range(), so make it global.
Signed-off-by: Torsten Duwe <duwe@suse.de>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
during bpf program loading remember the last byte of ctx access
and at the time of attaching the program to tracepoint check that
the program doesn't access bytes beyond defined in tracepoint fields
This also disallows access to __dynamic_array fields, but can be
relaxed in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
needs two wrapper functions to fetch 'struct pt_regs *' to convert
tracepoint bpf context into kprobe bpf context to reuse existing
helper functions
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
register tracepoint bpf program type and let it call the same set
of helper functions as BPF_PROG_TYPE_KPROBE
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
split allows to move expensive update of 'struct trace_entry' to later phase.
Repurpose unused 1st argument of perf_tp_event() to indicate event type.
While splitting use temp variable 'rctx' instead of '*rctx' to avoid
unnecessary loads done by the compiler due to -fno-strict-aliasing
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
avoid memset in perf_fetch_caller_regs, since it's the critical path of all tracepoints.
It's called from perf_sw_event_sched, perf_event_task_sched_in and all of perf_trace_##call
with this_cpu_ptr(&__perf_regs[..]) which are zero initialized by perpcu init logic and
subsequent call to perf_arch_fetch_caller_regs initializes the same fields on all archs,
so we can safely drop memset from all of the above cases and move it into
perf_ftrace_function_call that calls it with stack allocated pt_regs.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.
Doing that is possible after commit 34e2c555f3 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.
The new governor is relatively simple.
The frequency selection formula used by it depends on whether or not
the utilization is frequency-invariant. In the frequency-invariant
case the new CPU frequency is given by
next_freq = 1.25 * max_freq * util / max
where util and max are the last two arguments of cpufreq_update_util().
In turn, if util is not frequency-invariant, the maximum frequency in
the above formula is replaced with the current frequency of the CPU:
next_freq = 1.25 * curr_freq * util / max
The coefficient 1.25 corresponds to the frequency tipping point at
(util / max) = 0.8.
All of the computations are carried out in the utilization update
handlers provided by the new governor. One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).
The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers. If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).
Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes. That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.
The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Currently we check sample type for ftrace:function events
even if it's not created as a sampling event. That prevents
creating ftrace_function event in counting mode.
Make sure we check sample types only for sampling events.
Before:
$ sudo perf stat -e ftrace:function ls
...
Performance counter stats for 'ls':
<not supported> ftrace:function
0.001983662 seconds time elapsed
After:
$ sudo perf stat -e ftrace:function ls
...
Performance counter stats for 'ls':
44,498 ftrace:function
0.037534722 seconds time elapsed
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1458138873-1553-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the
number of unique stack traces needed to be stored.
Move the definition of __irq_entry to <linux/interrupt.h> so that the
users don't need to pull in <linux/ftrace.h>. Also introduce the
__softirq_entry macro which is similar to __irq_entry, but puts the
corresponding functions to the .softirqentry.text section.
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some visible changes:
A new flag was added to distinguish traces done in NMI context.
Preempt tracer now shows functions where preemption is disabled but
interrupts are still enabled.
Other notes:
Updates were done to function tracing to allow better performance
with perf.
Infrastructure code has been added to allow for a new histogram
feature for recording live trace event histograms that can be
configured by simple user commands. The feature itself was just
finished, but needs a round in linux-next before being pulled.
This only includes some infrastructure changes that will be needed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJW8/WPAAoJEKKk/i67LK/8wrAH/j2gU9ZfjVxTu8068TBGWRJP
yvvzq0cK5evB3dsVuUmKKRfU52nSv4J1WcFF569X0RulSLylR0dHlcxFJMn4kkgR
bm0AHRrqOf87ub3VimcpG146iVQij37l5A0SRoFbvSPLQx1KUW18v99x41Ji8dv6
oWXRc6/YhdzEE7l0nUsVjmScQ4b2emsems3cxZzXOY+nRJsiim6i+VaDeatdyey1
csLVqtRCs+x62TVtxG3+GhcLdRoPRbnHAGzrKDFIn1SrQaRXCc54wN5d2hWxjgNI
1laOwaj070lnJiWfBLIP/K+lx+VKRx5/O0rKZX35foLUTqJJKSyjAbKXuMCcSAM=
=2h2K
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Nothing major this round. Mostly small clean ups and fixes.
Some visible changes:
- A new flag was added to distinguish traces done in NMI context.
- Preempt tracer now shows functions where preemption is disabled but
interrupts are still enabled.
Other notes:
- Updates were done to function tracing to allow better performance
with perf.
- Infrastructure code has been added to allow for a new histogram
feature for recording live trace event histograms that can be
configured by simple user commands. The feature itself was just
finished, but needs a round in linux-next before being pulled.
This only includes some infrastructure changes that will be needed"
* tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (22 commits)
tracing: Record and show NMI state
tracing: Fix trace_printk() to print when not using bprintk()
tracing: Remove redundant reset per-CPU buff in irqsoff tracer
x86: ftrace: Fix the misleading comment for arch/x86/kernel/ftrace.c
tracing: Fix crash from reading trace_pipe with sendfile
tracing: Have preempt(irqs)off trace preempt disabled functions
tracing: Fix return while holding a lock in register_tracer()
ftrace: Use kasprintf() in ftrace_profile_tracefs()
ftrace: Update dynamic ftrace calls only if necessary
ftrace: Make ftrace_hash_rec_enable return update bool
tracing: Fix typoes in code comment and printk in trace_nop.c
tracing, writeback: Replace cgroup path to cgroup ino
tracing: Use flags instead of bool in trigger structure
tracing: Add an unreg_all() callback to trigger commands
tracing: Add needs_rec flag to event triggers
tracing: Add a per-event-trigger 'paused' field
tracing: Add get_syscall_name()
tracing: Add event record param to trigger_ops.func()
tracing: Make event trigger functions available
tracing: Make ftrace_event_field checking functions available
...
Use the more common logging method with the eventual goal of removing
pr_warning altogether.
Miscellanea:
- Realign arguments
- Coalesce formats
- Add missing space between a few coalesced formats
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [kernel/power/suspend.c]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The latency tracer format has a nice column to indicate IRQ state, but
this is not able to tell us about NMI state.
When tracing perf interrupt handlers (which often run in NMI context)
it is very useful to see how the events nest.
Link: http://lkml.kernel.org/r/20160318153022.105068893@infradead.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_printk() code will allocate extra buffers if the compile detects
that a trace_printk() is used. To do this, the format of the trace_printk()
is saved to the __trace_printk_fmt section, and if that section is bigger
than zero, the buffers are allocated (along with a message that this has
happened).
If trace_printk() uses a format that is not a constant, and thus something
not guaranteed to be around when the print happens, the compiler optimizes
the fmt out, as it is not used, and the __trace_printk_fmt section is not
filled. This means the kernel will not allocate the special buffers needed
for the trace_printk() and the trace_printk() will not write anything to the
tracing buffer.
Adding a "__used" to the variable in the __trace_printk_fmt section will
keep it around, even though it is set to NULL. This will keep the string
from being printed in the debugfs/tracing/printk_formats section as it is
not needed.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 07d777fe8c "tracing: Add percpu buffers for trace_printk()"
Cc: stable@vger.kernel.org # v3.5+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull networking updates from David Miller:
"Highlights:
1) Support more Realtek wireless chips, from Jes Sorenson.
2) New BPF types for per-cpu hash and arrap maps, from Alexei
Starovoitov.
3) Make several TCP sysctls per-namespace, from Nikolay Borisov.
4) Allow the use of SO_REUSEPORT in order to do per-thread processing
of incoming TCP/UDP connections. The muxing can be done using a
BPF program which hashes the incoming packet. From Craig Gallek.
5) Add a multiplexer for TCP streams, to provide a messaged based
interface. BPF programs can be used to determine the message
boundaries. From Tom Herbert.
6) Add 802.1AE MACSEC support, from Sabrina Dubroca.
7) Avoid factorial complexity when taking down an inetdev interface
with lots of configured addresses. We were doing things like
traversing the entire address less for each address removed, and
flushing the entire netfilter conntrack table for every address as
well.
8) Add and use SKB bulk free infrastructure, from Jesper Brouer.
9) Allow offloading u32 classifiers to hardware, and implement for
ixgbe, from John Fastabend.
10) Allow configuring IRQ coalescing parameters on a per-queue basis,
from Kan Liang.
11) Extend ethtool so that larger link mode masks can be supported.
From David Decotigny.
12) Introduce devlink, which can be used to configure port link types
(ethernet vs Infiniband, etc.), port splitting, and switch device
level attributes as a whole. From Jiri Pirko.
13) Hardware offload support for flower classifiers, from Amir Vadai.
14) Add "Local Checksum Offload". Basically, for a tunneled packet
the checksum of the outer header is 'constant' (because with the
checksum field filled into the inner protocol header, the payload
of the outer frame checksums to 'zero'), and we can take advantage
of that in various ways. From Edward Cree"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
bonding: fix bond_get_stats()
net: bcmgenet: fix dma api length mismatch
net/mlx4_core: Fix backward compatibility on VFs
phy: mdio-thunder: Fix some Kconfig typos
lan78xx: add ndo_get_stats64
lan78xx: handle statistics counter rollover
RDS: TCP: Remove unused constant
RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
net: smc911x: convert pxa dma to dmaengine
team: remove duplicate set of flag IFF_MULTICAST
bonding: remove duplicate set of flag IFF_MULTICAST
net: fix a comment typo
ethernet: micrel: fix some error codes
ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
bpf, dst: add and use dst_tclassid helper
bpf: make skb->tc_classid also readable
net: mvneta: bm: clarify dependencies
cls_bpf: reset class and reuse major in da
ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
ldmvsw: Add ldmvsw.c driver code
...
There is no reason to do it twice: from commit b6f11df26f
("trace: Call tracing_reset_online_cpus before tracer->init()")
resetting of per-CPU buffers done before tracer->init() call.
tracer->init() calls {irqs,preempt,preemptirqs}off_tracer_init() and it
calls __irqsoff_tracer_init(), which resets per-CPU ringbuffer second
time.
It's slowpath, but anyway.
Link: http://lkml.kernel.org/r/1445278226-16187-1-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If tracing contains data and the trace_pipe file is read with sendfile(),
then it can trigger a NULL pointer dereference and various BUG_ON within the
VM code.
There's a patch to fix this in the splice_to_pipe() code, but it's also a
good idea to not let that happen from trace_pipe either.
Link: http://lkml.kernel.org/r/1457641146-9068-1-git-send-email-rabin@rab.in
Cc: stable@vger.kernel.org # 2.6.30+
Reported-by: Rabin Vincent <rabin.vincent@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Joel Fernandes reported that the function tracing of preempt disabled
sections was not being reported when running either the preemptirqsoff or
preemptoff tracers. This was due to the fact that the function tracer
callback for those tracers checked if irqs were disabled before tracing. But
this fails when we want to trace preempt off locations as well.
Joel explained that he wanted to see funcitons where interrupts are enabled
but preemption was disabled. The expected output he wanted:
<...>-2265 1d.h1 3419us : preempt_count_sub <-irq_exit
<...>-2265 1d..1 3419us : __do_softirq <-irq_exit
<...>-2265 1d..1 3419us : msecs_to_jiffies <-__do_softirq
<...>-2265 1d..1 3420us : irqtime_account_irq <-__do_softirq
<...>-2265 1d..1 3420us : __local_bh_disable_ip <-__do_softirq
<...>-2265 1..s1 3421us : run_timer_softirq <-__do_softirq
<...>-2265 1..s1 3421us : hrtimer_run_pending <-run_timer_softirq
<...>-2265 1..s1 3421us : _raw_spin_lock_irq <-run_timer_softirq
<...>-2265 1d.s1 3422us : preempt_count_add <-_raw_spin_lock_irq
<...>-2265 1d.s2 3422us : _raw_spin_unlock_irq <-run_timer_softirq
<...>-2265 1..s2 3422us : preempt_count_sub <-_raw_spin_unlock_irq
<...>-2265 1..s1 3423us : rcu_bh_qs <-__do_softirq
<...>-2265 1d.s1 3423us : irqtime_account_irq <-__do_softirq
<...>-2265 1d.s1 3423us : __local_bh_enable <-__do_softirq
There's a comment saying that the irq disabled check is because there's a
possible race that tracing_cpu may be set when the function is executed. But
I don't remember that race. For now, I added a check for preemption being
enabled too to not record the function, as there would be no race if that
was the case. I need to re-investigate this, as I'm now thinking that the
tracing_cpu will always be correct. But no harm in keeping the check for
now, except for the slight performance hit.
Link: http://lkml.kernel.org/r/1457770386-88717-1-git-send-email-agnel.joel@gmail.com
Fixes: 5e6d2b9cfa "tracing: Use one prologue for the preempt irqs off tracer function tracers"
Cc: stable@vget.kernel.org # 2.6.37+
Reported-by: Joel Fernandes <agnel.joel@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
commit d39cdd2036 ("tracing: Make tracer_flags use the right set_flag
callback") introduces a potential mutex deadlock issue, as it forgets to
free the mutex when allocaing the tracer_flags gets fail.
The issue was found by Dan Carpenter through Smatch static code check tool.
Link: http://lkml.kernel.org/r/1457958941-30265-1-git-send-email-chuhu@redhat.com
Fixes: d39cdd2036 ("tracing: Make tracer_flags use the right set_flag callback")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently dynamic ftrace calls are updated any time
the ftrace_ops is un/registered. If we do this update
only when it's needed, we save lot of time for perf
system wide ftrace function sampling/counting.
The reason is that for system wide sampling/counting,
perf creates event for each cpu in the system.
Each event then registers separate copy of ftrace_ops,
which ends up in FTRACE_UPDATE_CALLS updates. On servers
with many cpus that means serious stall (240 cpus server):
Counting:
# time ./perf stat -e ftrace:function -a sleep 1
Performance counter stats for 'system wide':
370,663 ftrace:function
1.401427505 seconds time elapsed
real 3m51.743s
user 0m0.023s
sys 3m48.569s
Sampling:
# time ./perf record -e ftrace:function -a sleep 1
[ perf record: Woken up 0 times to write data ]
Warning:
Processed 141200 events and lost 5 chunks!
[ perf record: Captured and wrote 10.703 MB perf.data (135950 samples) ]
real 2m31.429s
user 0m0.213s
sys 2m29.494s
There's no reason to do the FTRACE_UPDATE_CALLS update
for each event in perf case, because all the ftrace_ops
always share the same filter, so the updated calls are
always the same.
It's required that only first ftrace_ops registration
does the FTRACE_UPDATE_CALLS update (also sometimes
the second if the first one used the trampoline), but
the rest can be only cheaply linked into the ftrace_ops
list.
Counting:
# time ./perf stat -e ftrace:function -a sleep 1
Performance counter stats for 'system wide':
398,571 ftrace:function
1.377503733 seconds time elapsed
real 0m2.787s
user 0m0.005s
sys 0m1.883s
Sampling:
# time ./perf record -e ftrace:function -a sleep 1
[ perf record: Woken up 0 times to write data ]
Warning:
Processed 261730 events and lost 9 chunks!
[ perf record: Captured and wrote 19.907 MB perf.data (256293 samples) ]
real 1m31.948s
user 0m0.309s
sys 1m32.051s
Link: http://lkml.kernel.org/r/1458138873-1553-6-git-send-email-jolsa@kernel.org
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Change __ftrace_hash_rec_update to return true in case
we need to update dynamic ftrace call records. It return
false in case no update is needed.
Link: http://lkml.kernel.org/r/1458138873-1553-5-git-send-email-jolsa@kernel.org
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- Redesign of cpufreq governors and the intel_pstate driver to
make them use callbacks invoked by the scheduler to trigger CPU
frequency evaluation instead of using per-CPU deferrable timers
for that purpose (Rafael Wysocki).
- Reorganization and cleanup of cpufreq governor code to make it
more straightforward and fix some concurrency problems in it
(Rafael Wysocki, Viresh Kumar).
- Cleanup and improvements of locking in the cpufreq core (Viresh
Kumar).
- Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
Kumar, Eric Biggers).
- intel_pstate driver updates including fixes, optimizations and a
modification to make it enable enable hardware-coordinated P-state
selection (HWP) by default if supported by the processor (Philippe
Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
Franciosi).
- Operating Performance Points (OPP) framework updates to improve
its handling of voltage regulators and device clocks and updates
of the cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).
- Updates of the powernv cpufreq driver to fix initialization
and cleanup problems in it and correct its worker thread handling
with respect to CPU offline, new powernv_throttle tracepoint
(Shilpasri Bhat).
- ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).
- ACPICA updates including one fix for a regression introduced
by previos changes in the ACPICA code (Bob Moore, Lv Zheng,
David Box, Colin Ian King).
- Support for installing ACPI tables from initrd (Lv Zheng).
- Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
Chaugule).
- Support for _HID(ACPI0010) devices (ACPI processor containers)
and ACPI processor driver cleanups (Sudeep Holla).
- Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
Aleksey Makarov).
- Modification of the ACPI PCI IRQ management code to make it treat
255 in the Interrupt Line register as "not connected" on x86 (as
per the specification) and avoid attempts to use that value as
a valid interrupt vector (Chen Fan).
- ACPI APEI fixes related to resource leaks (Josh Hunt).
- Removal of modularity from a few ACPI drivers (BGRT, GHES,
intel_pmic_crc) that cannot be built as modules in practice (Paul
Gortmaker).
- PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
as a valid resource type (Harb Abdulhamid).
- New device ID (future AMD I2C controller) in the ACPI driver for
AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).
- Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).
- cpuidle menu governor optimization to avoid a square root
computation in it (Rasmus Villemoes).
- Fix for potential use-after-free in the generic device properties
framework (Heikki Krogerus).
- Updates of the generic power domains (genpd) framework including
support for multiple power states of a domain, fixes and debugfs
output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
Geert Uytterhoeven).
- Intel RAPL power capping driver updates to reduce IPI overhead in
it (Jacob Pan).
- System suspend/hibernation code cleanups (Eric Biggers, Saurabh
Sengar).
- Year 2038 fix for the process freezer (Abhilash Jindal).
- turbostat utility updates including new features (decoding of more
registers and CPUID fields, sub-second intervals support, GFX MHz
and RC6 printout, --out command line option), fixes (syscall jitter
detection and workaround, reductioin of the number of syscalls made,
fixes related to Xeon x200 processors, compiler warning fixes) and
cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJW50NXAAoJEILEb/54YlRxvr8QAIktC9+ft0y5AmU46hDcBWcK
QutyWJL9X9BS6DWBJZA2qclDYFmhMfi5Fza1se0gQ9TnLB/KrBwHWLsiYoTsb1k+
nPKf214aPk+qAhkVuyB4leNWML9Qz9n9jwku/EYxWWpgtbSRf3+0ioIKZeWWc/8V
JvuaOu4O+g/tkmL7QTrnGWBwhIIssAAV85QPsHkx+g68MrCj4UMMzm7z9G21SPXX
bmP8yIHsczX/XnRsY0W2NSno7Vdk6ImHpDJ26IAZg28WRNPWICHgGYHvB0TTWMvb
tts+yqfF7/7QLRjT/M8k9CzDBDE/DnVqoZ0fNJ+aYr7hNKF32mtAN+jH9ZB9dl/P
fEFapJkPxnWyzAoVoB9Dz0rkcZkYMlbxlLWzUGpaPq0JflUUTzLk0ApSjmMn4HRO
UddwCDdyHTaYThp3gn6GbOb0pIP0SdOVbI1M2QV2x/4PLcT2Ft8Np1+1RFWOeinZ
Bdl9AE890big0808mqbBzw/buETwr9FjHtCdDPXpP0vJpkBLu3nIYRNb0LCt39es
mWMp6dFhGgvGj3D3ahTuV3GI8hdpDkh9SObexa11RCjkTKrXcwEmFxHxLeFXwKYq
alG278bo6cSChRMziS1lis+W/3tsJRN4TXUSv1PPzJHrFgptQVFRStU9ngBKP+pN
WB+itPc4Fw0YHOrAFsrx
=cfty
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI updates from Rafael Wysocki:
"This time the majority of changes go into cpufreq and they are
significant.
First off, the way CPU frequency updates are triggered is different
now. Instead of having to set up and manage a deferrable timer for
each CPU in the system to evaluate and possibly change its frequency
periodically, cpufreq governors set up callbacks to be invoked by the
scheduler on a regular basis (basically on utilization updates). The
"old" governors, "ondemand" and "conservative", still do all of their
work in process context (although that is triggered by the scheduler
now), but intel_pstate does it all in the callback invoked by the
scheduler with no need for any additional asynchronous processing.
Of course, this eliminates the overhead related to the management of
all those timers, but also it allows the cpufreq governor code to be
simplified quite a bit. On top of that, the common code and data
structures used by the "ondemand" and "conservative" governors are
cleaned up and made more straightforward and some long-standing and
quite annoying problems are addressed. In particular, the handling of
governor sysfs attributes is modified and the related locking becomes
more fine grained which allows some concurrency problems to be avoided
(particularly deadlocks with the core cpufreq code).
In principle, the new mechanism for triggering frequency updates
allows utilization information to be passed from the scheduler to
cpufreq. Although the current code doesn't make use of it, in the
works is a new cpufreq governor that will make decisions based on the
scheduler's utilization data. That should allow the scheduler and
cpufreq to work more closely together in the long run.
In addition to the core and governor changes, cpufreq drivers are
updated too. Fixes and optimizations go into intel_pstate, the
cpufreq-dt driver is updated on top of some modification in the
Operating Performance Points (OPP) framework and there are fixes and
other updates in the powernv cpufreq driver.
Apart from the cpufreq updates there is some new ACPICA material,
including a fix for a problem introduced by previous ACPICA updates,
and some less significant changes in the ACPI code, like CPPC code
optimizations, ACPI processor driver cleanups and support for loading
ACPI tables from initrd.
Also updated are the generic power domains framework, the Intel RAPL
power capping driver and the turbostat utility and we have a bunch of
traditional assorted fixes and cleanups.
Specifics:
- Redesign of cpufreq governors and the intel_pstate driver to make
them use callbacks invoked by the scheduler to trigger CPU
frequency evaluation instead of using per-CPU deferrable timers for
that purpose (Rafael Wysocki).
- Reorganization and cleanup of cpufreq governor code to make it more
straightforward and fix some concurrency problems in it (Rafael
Wysocki, Viresh Kumar).
- Cleanup and improvements of locking in the cpufreq core (Viresh
Kumar).
- Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
Kumar, Eric Biggers).
- intel_pstate driver updates including fixes, optimizations and a
modification to make it enable enable hardware-coordinated P-state
selection (HWP) by default if supported by the processor (Philippe
Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
Franciosi).
- Operating Performance Points (OPP) framework updates to improve its
handling of voltage regulators and device clocks and updates of the
cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).
- Updates of the powernv cpufreq driver to fix initialization and
cleanup problems in it and correct its worker thread handling with
respect to CPU offline, new powernv_throttle tracepoint (Shilpasri
Bhat).
- ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).
- ACPICA updates including one fix for a regression introduced by
previos changes in the ACPICA code (Bob Moore, Lv Zheng, David Box,
Colin Ian King).
- Support for installing ACPI tables from initrd (Lv Zheng).
- Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
Chaugule).
- Support for _HID(ACPI0010) devices (ACPI processor containers) and
ACPI processor driver cleanups (Sudeep Holla).
- Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
Aleksey Makarov).
- Modification of the ACPI PCI IRQ management code to make it treat
255 in the Interrupt Line register as "not connected" on x86 (as
per the specification) and avoid attempts to use that value as a
valid interrupt vector (Chen Fan).
- ACPI APEI fixes related to resource leaks (Josh Hunt).
- Removal of modularity from a few ACPI drivers (BGRT, GHES,
intel_pmic_crc) that cannot be built as modules in practice (Paul
Gortmaker).
- PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
as a valid resource type (Harb Abdulhamid).
- New device ID (future AMD I2C controller) in the ACPI driver for
AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).
- Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).
- cpuidle menu governor optimization to avoid a square root
computation in it (Rasmus Villemoes).
- Fix for potential use-after-free in the generic device properties
framework (Heikki Krogerus).
- Updates of the generic power domains (genpd) framework including
support for multiple power states of a domain, fixes and debugfs
output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
Geert Uytterhoeven).
- Intel RAPL power capping driver updates to reduce IPI overhead in
it (Jacob Pan).
- System suspend/hibernation code cleanups (Eric Biggers, Saurabh
Sengar).
- Year 2038 fix for the process freezer (Abhilash Jindal).
- turbostat utility updates including new features (decoding of more
registers and CPUID fields, sub-second intervals support, GFX MHz
and RC6 printout, --out command line option), fixes (syscall jitter
detection and workaround, reductioin of the number of syscalls
made, fixes related to Xeon x200 processors, compiler warning
fixes) and cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu)"
* tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (182 commits)
tools/power turbostat: bugfix: TDP MSRs print bits fixing
tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
tools/power turbostat: call __cpuid() instead of __get_cpuid()
tools/power turbostat: indicate SMX and SGX support
tools/power turbostat: detect and work around syscall jitter
tools/power turbostat: show GFX%rc6
tools/power turbostat: show GFXMHz
tools/power turbostat: show IRQs per CPU
tools/power turbostat: make fewer systems calls
tools/power turbostat: fix compiler warnings
tools/power turbostat: add --out option for saving output in a file
tools/power turbostat: re-name "%Busy" field to "Busy%"
tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
tools/power turbostat: allow sub-sec intervals
ACPI / APEI: ERST: Fixed leaked resources in erst_init
ACPI / APEI: Fix leaked resources
intel_pstate: Do not skip samples partially
intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
...
Pull perf updates from Ingo Molnar:
"Main kernel side changes:
- Big reorganization of the x86 perf support code. The old code grew
organically deep inside arch/x86/kernel/cpu/perf* and its naming
became somewhat messy.
The new location is under arch/x86/events/, using the following
cleaner hierarchy of source code files:
perf/x86: Move perf_event.c .................. => x86/events/core.c
perf/x86: Move perf_event_amd.c .............. => x86/events/amd/core.c
perf/x86: Move perf_event_amd_ibs.c .......... => x86/events/amd/ibs.c
perf/x86: Move perf_event_amd_iommu.[ch] ..... => x86/events/amd/iommu.[ch]
perf/x86: Move perf_event_amd_uncore.c ....... => x86/events/amd/uncore.c
perf/x86: Move perf_event_intel_bts.c ........ => x86/events/intel/bts.c
perf/x86: Move perf_event_intel.c ............ => x86/events/intel/core.c
perf/x86: Move perf_event_intel_cqm.c ........ => x86/events/intel/cqm.c
perf/x86: Move perf_event_intel_cstate.c ..... => x86/events/intel/cstate.c
perf/x86: Move perf_event_intel_ds.c ......... => x86/events/intel/ds.c
perf/x86: Move perf_event_intel_lbr.c ........ => x86/events/intel/lbr.c
perf/x86: Move perf_event_intel_pt.[ch] ...... => x86/events/intel/pt.[ch]
perf/x86: Move perf_event_intel_rapl.c ....... => x86/events/intel/rapl.c
perf/x86: Move perf_event_intel_uncore.[ch] .. => x86/events/intel/uncore.[ch]
perf/x86: Move perf_event_intel_uncore_nhmex.c => x86/events/intel/uncore_nmhex.c
perf/x86: Move perf_event_intel_uncore_snb.c => x86/events/intel/uncore_snb.c
perf/x86: Move perf_event_intel_uncore_snbep.c => x86/events/intel/uncore_snbep.c
perf/x86: Move perf_event_knc.c .............. => x86/events/intel/knc.c
perf/x86: Move perf_event_p4.c ............... => x86/events/intel/p4.c
perf/x86: Move perf_event_p6.c ............... => x86/events/intel/p6.c
perf/x86: Move perf_event_msr.c .............. => x86/events/msr.c
(Borislav Petkov)
- Update various x86 PMU constraint and hw support details (Stephane
Eranian)
- Optimize kprobes for BPF execution (Martin KaFai Lau)
- Rewrite, refactor and fix the Intel uncore PMU driver code (Thomas
Gleixner)
- Rewrite, refactor and fix the Intel RAPL PMU code (Thomas Gleixner)
- Various fixes and smaller cleanups.
There are lots of perf tooling updates as well. A few highlights:
perf report/top:
- Hierarchy histogram mode for 'perf top' and 'perf report',
showing multiple levels, one per --sort entry: (Namhyung Kim)
On a mostly idle system:
# perf top --hierarchy -s comm,dso
Then expand some levels and use 'P' to take a snapshot:
# cat perf.hist.0
- 92.32% perf
58.20% perf
22.29% libc-2.22.so
5.97% [kernel]
4.18% libelf-0.165.so
1.69% [unknown]
- 4.71% qemu-system-x86
3.10% [kernel]
1.60% qemu-system-x86_64 (deleted)
+ 2.97% swapper
#
- Add 'L' hotkey to dynamicly set the percent threshold for
histogram entries and callchains, i.e. dynamicly do what the
--percent-limit command line option to 'top' and 'report' does.
(Namhyung Kim)
perf mem:
- Allow specifying events via -e in 'perf mem record', also listing
what events can be specified via 'perf mem record -e list' (Jiri
Olsa)
perf record:
- Add 'perf record' --all-user/--all-kernel options, so that one
can tell that all the events in the command line should be
restricted to the user or kernel levels (Jiri Olsa), i.e.:
perf record -e cycles:u,instructions:u
is equivalent to:
perf record --all-user -e cycles,instructions
- Make 'perf record' collect CPU cache info in the perf.data file header:
$ perf record usleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data (7 samples) ]
$ perf report --header-only -I | tail -10 | head -8
# CPU cache info:
# L1 Data 32K [0-1]
# L1 Instruction 32K [0-1]
# L1 Data 32K [2-3]
# L1 Instruction 32K [2-3]
# L2 Unified 256K [0-1]
# L2 Unified 256K [2-3]
# L3 Unified 4096K [0-3]
Will be used in 'perf c2c' and eventually in 'perf diff' to
allow, for instance running the same workload in multiple
machines and then when using 'diff' show the hardware difference.
(Jiri Olsa)
- Improved support for Java, using the JVMTI agent library to do
jitdumps that then will be inserted in synthesized
PERF_RECORD_MMAP2 events via 'perf inject' pointed to synthesized
ELF files stored in ~/.debug and keyed with build-ids, to allow
symbol resolution and even annotation with source line info, see
the changeset comments to see how to use it (Stephane Eranian)
perf script/trace:
- Decode data_src values (e.g. perf.data files generated by 'perf
mem record') in 'perf script': (Jiri Olsa)
# perf script
perf 693 [1] 4.088652: 1 cpu/mem-loads,ldlat=30/P: ffff88007d0b0f40 68100142 L1 hit|SNP None|TLB L1 or L2 hit|LCK No <SNIP>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Improve support to 'data_src', 'weight' and 'addr' fields in
'perf script' (Jiri Olsa)
- Handle empty print fmts in 'perf script -s' i.e. when running
python or perl scripts (Taeung Song)
perf stat:
- 'perf stat' now shows shadow metrics (insn per cycle, etc) in
interval mode too. E.g:
# perf stat -I 1000 -e instructions,cycles sleep 1
# time counts unit events
1.000215928 519,620 instructions # 0.69 insn per cycle
1.000215928 752,003 cycles
<SNIP>
- Port 'perf kvm stat' to PowerPC (Hemant Kumar)
- Implement CSV metrics output in 'perf stat' (Andi Kleen)
perf BPF support:
- Support converting data from bpf events in 'perf data' (Wang Nan)
- Print bpf-output events in 'perf script': (Wang Nan).
# perf record -e bpf-output/no-inherit,name=evt/ -e ./test_bpf_output_3.c/map:channel.event=evt/ usleep 1000
# perf script
usleep 4882 21384.532523: evt: ffffffff810e97d1 sys_nanosleep ([kernel.kallsyms])
BPF output: 0000: 52 61 69 73 65 20 61 20 Raise a
0008: 42 50 46 20 65 76 65 6e BPF even
0010: 74 21 00 00 t!..
BPF string: "Raise a BPF event!"
#
- Add API to set values of map entries in a BPF object, be it
individual map slots or ranges (Wang Nan)
- Introduce support for the 'bpf-output' event (Wang Nan)
- Add glue to read perf events in a BPF program (Wang Nan)
- Improve support for bpf-output events in 'perf trace' (Wang Nan)
... and tons of other changes as well - see the shortlog and git log
for details!"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (342 commits)
perf stat: Add --metric-only support for -A
perf stat: Implement --metric-only mode
perf stat: Document CSV format in manpage
perf hists browser: Check sort keys before hot key actions
perf hists browser: Allow thread filtering for comm sort key
perf tools: Add sort__has_comm variable
perf tools: Recalc total periods using top-level entries in hierarchy
perf tools: Remove nr_sort_keys field
perf hists browser: Cleanup hist_browser__fprintf_hierarchy_entry()
perf tools: Remove hist_entry->fmt field
perf tools: Fix command line filters in hierarchy mode
perf tools: Add more sort entry check functions
perf tools: Fix hist_entry__filter() for hierarchy
perf jitdump: Build only on supported archs
tools lib traceevent: Add '~' operation within arg_num_eval()
perf tools: Omit unnecessary cast in perf_pmu__parse_scale
perf tools: Pass perf_hpp_list all the way through setup_sort_list
perf tools: Fix perf script python database export crash
perf jitdump: DWARF is also needed
perf bench mem: Prepare the x86-64 build for upstream memcpy_mcsafe() changes
...
* pm-cpufreq: (94 commits)
intel_pstate: Do not skip samples partially
intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
intel_pstate: Optimize calculation for max/min_perf_adj
intel_pstate: Remove extra conversions in pid calculation
cpufreq: Move scheduler-related code to the sched directory
Revert "cpufreq: postfix policy directory with the first CPU in related_cpus"
cpufreq: Reduce cpufreq_update_util() overhead a bit
cpufreq: Select IRQ_WORK if CPU_FREQ_GOV_COMMON is set
cpufreq: Remove 'policy->governor_enabled'
cpufreq: Rename __cpufreq_governor() to cpufreq_governor()
cpufreq: Relocate handle_update() to kill its declaration
cpufreq: governor: Drop unnecessary checks from show() and store()
cpufreq: governor: Fix race in dbs_update_util_handler()
cpufreq: governor: Make gov_set_update_util() static
cpufreq: governor: Narrow down the dbs_data_mutex coverage
cpufreq: governor: Make dbs_data_mutex static
cpufreq: governor: Relocate definitions of tuners structures
cpufreq: governor: Move per-CPU data to the common code
cpufreq: governor: Make governor private data per-policy
...
if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.
Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
echo nop > /sys/kernel/debug/tracing/options/current_tracer
echo 1 > /sys/kernel/debug/tracing/options/test_nop_accept
echo 0 > /sys/kernel/debug/tracing/options/test_nop_accept
echo 1 > /sys/kernel/debug/tracing/options/test_nop_refuse
Before the fix, the dmesg is a bit ugly since a align issue.
[ 191.973081] nop_test_accept flag set to 1: we accept. Now cat trace_options to see the result
[ 195.156942] nop_test_refuse flag set to 1: we refuse.Now cat trace_options to see the result
After the fix, the dmesg will show aligned log for nop_test_refuse and nop_test_accept.
[ 2718.032413] nop_test_refuse flag set to 1: we refuse. Now cat trace_options to see the result
[ 2734.253360] nop_test_accept flag set to 1: we accept. Now cat trace_options to see the result
Link: http://lkml.kernel.org/r/1457444222-8654-2-git-send-email-chuhu@redhat.com
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
gcc isn't known for handling bool in structures. Instead of using bool, use
an integer mask and use bit flags instead.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a new unreg_all() callback that can be used to remove all
command-specific triggers from an event and arrange to have it called
whenever a trigger file is opened with O_TRUNC set.
Commands that don't want truncate semantics, or existing commands that
don't implement this function simply do nothing and their triggers
remain intact.
Link: http://lkml.kernel.org/r/2b7d62854d01f28c19185e1bbb8f826f385edfba.1449767187.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a new needs_rec flag for triggers that require unconditional
access to trace records in order to function.
Normally a trigger requires access to the contents of a trace record
only if it has a filter associated with it (since filters need the
contents of a record in order to make a filtering decision). Some
types of triggers, such as 'hist' triggers, require access to trace
record contents independent of the presence of filters, so add a new
flag for those triggers.
Link: http://lkml.kernel.org/r/7be8fa38f9b90fdb6c47ca0f98d20a07b9fd512b.1449767187.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a simple per-trigger 'paused' flag, allowing individual triggers
to pause. We could leave it to individual triggers that need this
functionality to do it themselves, but we also want to allow other
events to control pausing, so add it to the trigger data.
Link: http://lkml.kernel.org/r/fed37e4879684d7dcc57fe00ce0cbf170032b06d.1449767187.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some triggers may need access to the trace event, so pass it in. Also
fix up the existing trigger funcs and their callers.
Link: http://lkml.kernel.org/r/543e31e9fc445ef61077421ab219033401c39846.1449767187.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make various event trigger utility functions available outside of
trace_events_trigger.c so that new triggers can be defined outside of
that file.
Link: http://lkml.kernel.org/r/4a40c1695dd43cac6cd475d72e13ffe30ba84bff.1449767187.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make is_string_field() and is_function_field() accessible outside of
trace_event_filters.c for other users of ftrace_event_fields.
Link: http://lkml.kernel.org/r/2d3f00d3311702e556e82eed7754bae6f017939f.1449767187.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When I was updating the ftrace_stress test of ltp. I encountered
a strange phenomemon, excute following steps:
echo nop > /sys/kernel/debug/tracing/current_tracer
echo 0 > /sys/kernel/debug/tracing/options/funcgraph-cpu
bash: echo: write error: Invalid argument
check dmesg:
[ 1024.903855] nop_test_refuse flag set to 0: we refuse.Now cat trace_options to see the result
The reason is that the trace option test will randomly setup trace
option under tracing/options no matter what the current_tracer is.
but the set_tracer_option is always using the set_flag callback
from the current_tracer. This patch adds a pointer to tracer_flags
and make it point to the tracer it belongs to. When the option is
setup, the set_flag of the right tracer will be used no matter
what the the current_tracer is.
And the old dummy_tracer_flags is used for all the tracers which
doesn't have a tracer_flags, having issue to use it to save the
pointer of a tracer. So remove it and use dynamic dummy tracer_flags
for tracers needing a dummy tracer_flags, as a result, there are no
tracers sharing tracer_flags, so remove the check code.
And save the current tracer to trace_option_dentry seems not good as
it may waste mem space when mount the debug/trace fs more than one time.
Link: http://lkml.kernel.org/r/1457444222-8654-1-git-send-email-chuhu@redhat.com
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
[ Fixed up function tracer options to work with the change ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
a tasks "comm" field. But this prevented filtering on a comm field that
is within a trace event (like sched_migrate_task).
When trying to filter on when a program migrated, this change prevented
the filtering of the sched_migrate_task.
To fix this, the event fields are examined first, and then the extra fields
like "comm" and "cpu" are examined. Also, instead of testing to assign
the comm filter function based on the field's name, the generic comm field
is given a new filter type (FILTER_COMM). When this field is used to filter
the type is checked. The same is done for the cpu filter field.
Two new special filter types are added: "COMM" and "CPU". This allows users
to still filter the tasks comm for events that have "comm" as one of their
fields, in cases that users would like to filter sched_migrate_task on the
comm of the task that called the event, and not the comm of the task that
is being migrated.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJW2argAAoJEKKk/i67LK/8b78H/32nYPizDIsK/p2bL1mgbtMl
vrkcfb+maPOC7cjB+CdQmyV4EIVpSn06XFouYghGprdoVocVyBuIflxn0j3Gbymy
zLCg8lR70KTATTqst1wsWMbnh+UvAKNEiXj8jf2qcK2xhgalXMDwsTC4+LDlLugu
YAx89lmsjK1YpP/wIzMww2jQG+07Nhm9gHWXF2MC3egZ+sgYxARnfds0yTcGgS8o
dc/yJGZDCI44JMDNThcCFxNvsmoTa9tpm+JNe2YTht6KCympa+Ht9Jj9MMlD06cq
M5CqMQlok+mrVsW5LbJPCk1u83ynr6d/PcPQuT2nykRx8bGvKjA7AKMPaxw1Jz4=
=ixBz
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"A feature was added in 4.3 that allowed users to filter trace points
on a tasks "comm" field. But this prevented filtering on a comm field
that is within a trace event (like sched_migrate_task).
When trying to filter on when a program migrated, this change
prevented the filtering of the sched_migrate_task.
To fix this, the event fields are examined first, and then the extra
fields like "comm" and "cpu" are examined. Also, instead of testing
to assign the comm filter function based on the field's name, the
generic comm field is given a new filter type (FILTER_COMM). When
this field is used to filter the type is checked. The same is done
for the cpu filter field.
Two new special filter types are added: "COMM" and "CPU". This allows
users to still filter the tasks comm for events that have "comm" as
one of their fields, in cases that users would like to filter
sched_migrate_task on the comm of the task that called the event, and
not the comm of the task that is being migrated"
* tag 'trace-fixes-v4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Do not have 'comm' filter override event 'comm' field
Commit 9f61668073 "tracing: Allow triggers to filter for CPU ids and
process names" added a 'comm' filter that will filter events based on the
current tasks struct 'comm'. But this now hides the ability to filter events
that have a 'comm' field too. For example, sched_migrate_task trace event.
That has a 'comm' field of the task to be migrated.
echo 'comm == "bash"' > events/sched_migrate_task/filter
will now filter all sched_migrate_task events for tasks named "bash" that
migrates other tasks (in interrupt context), instead of seeing when "bash"
itself gets migrated.
This fix requires a couple of changes.
1) Change the look up order for filter predicates to look at the events
fields before looking at the generic filters.
2) Instead of basing the filter function off of the "comm" name, have the
generic "comm" filter have its own filter_type (FILTER_COMM). Test
against the type instead of the name to assign the filter function.
3) Add a new "COMM" filter that works just like "comm" but will filter based
on the current task, even if the trace event contains a "comm" field.
Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU".
Cc: stable@vger.kernel.org # v4.3+
Fixes: 9f61668073 "tracing: Allow triggers to filter for CPU ids and process names"
Reported-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some tracepoint have multiple fields with the same name, "nr", the first
one is a unique syscall ID, the other is a syscall argument:
# cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_io_getevents/format
name: sys_enter_io_getevents
ID: 747
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int nr; offset:8; size:4; signed:1;
field:aio_context_t ctx_id; offset:16; size:8; signed:0;
field:long min_nr; offset:24; size:8; signed:0;
field:long nr; offset:32; size:8; signed:0;
field:struct io_event * events; offset:40; size:8; signed:0;
field:struct timespec * timeout; offset:48; size:8; signed:0;
print fmt: "ctx_id: 0x%08lx, min_nr: 0x%08lx, nr: 0x%08lx, events: 0x%08lx, timeout: 0x%08lx", ((unsigned long)(REC->ctx_id)), ((unsigned long)(REC->min_nr)), ((unsigned long)(REC->nr)), ((unsigned long)(REC->events)), ((unsigned long)(REC->timeout))
#
Fix it by renaming the "/format" common tracepoint field "nr" to "__syscall_nr".
Signed-off-by: Taeung Song <treeze.taeung@gmail.com>
[ Do not rename the struct member, just the '/format' field name ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160226132301.3ae065a4@gandalf.local.home
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When perf added a "reg" function to the function tracing event (not a
tracepoint), it caused that event to be displayed as a tracepoint and
could cause errors in tracepoint handling. That was solved by adding a
flag to ignore ftrace non-tracepoint events. But that flag was missed
when displaying events in available_events, which should only contain
tracepoint events.
This broke a documented way to enable all events with:
cat available_events > set_event
As the function non-tracepoint event would cause that to error out.
The commit here fixes that by having the available_events file not list
events that have the ignore flag set.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWzxLPAAoJEKKk/i67LK/8uAMH+gPedE1YbBmhFOgOEEyjADsV
AJ/XRkqHhNyjB5h05jvO9KeoaEL27E0unvRNIKFjeKoco1fqQ5USMXr1hdOpQKt6
X16478XGWiZI2w11Yp86UmhrzVFmRcCv1zSIhTFhidOIQVwu6aUSRA3DchaBXCez
nzQBgRUzsHLWhS1tRNtdnf6gRe6PSImjwmpyYETPzzatedAW4D1Xp+u+7z2uKXka
jJIaE90hYDCT08+6Q+QeD1RR2Uzh+E72ZhM5wCcnuwKd5BfxLcZs23PCIlZjX24k
Iu2hPeW1VVIN/mbqY04R4oAKRXJG8Wio8l1S5ldJTPReBCZLDK5N2O+LWPwIEIU=
=Ypbd
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v4.5-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Another small bug reported to me by Chunyu Hu.
When perf added a "reg" function to the function tracing event (not a
tracepoint), it caused that event to be displayed as a tracepoint and
could cause errors in tracepoint handling. That was solved by adding
a flag to ignore ftrace non-tracepoint events. But that flag was
missed when displaying events in available_events, which should only
contain tracepoint events.
This broke a documented way to enable all events with:
cat available_events > set_event
As the function non-tracepoint event would cause that to error out.
The commit here fixes that by having the available_events file not
list events that have the ignore flag set"
* tag 'trace-fixes-v4.5-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix showing function event in available_events
The ftrace:function event is only displayed for parsing the function tracer
data. It is not used to enable function tracing, and does not include an
"enable" file in its event directory.
Originally, this event was kept separate from other events because it did
not have a ->reg parameter. But perf added a "reg" parameter for its use
which caused issues, because it made the event available to functions where
it was not compatible for.
Commit 9b63776fa3 "tracing: Do not enable function event with enable"
added a TRACE_EVENT_FL_IGNORE_ENABLE flag that prevented the function event
from being enabled by normal trace events. But this commit missed keeping
the function event from being displayed by the "available_events" directory,
which is used to show what events can be enabled by set_event.
One documented way to enable all events is to:
cat available_events > set_event
But because the function event is displayed in the available_events, this
now causes an INVALID error:
cat: write error: Invalid argument
Reported-by: Chunyu Hu <chuhu@redhat.com>
Fixes: 9b63776fa3 "tracing: Do not enable function event with enable"
Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Conflicts:
drivers/net/phy/bcm7xxx.c
drivers/net/phy/marvell.c
drivers/net/vxlan.c
All three conflicts were cases of simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
One is by Yang Shi who added a READ_ONCE_NOCHECK() to the scan of the
stack made by the stack tracer. As the stack tracer scans the entire
kernel stack, KASAN triggers seeing it as a "stack out of bounds" error.
As the scan is looking at the contents of the stack from parent functions.
The NOCHECK() tells KASAN that this is done on purpose, and is not some
kind of stack overflow.
The second fix is to the ftrace selftests, to retrieve the PID of executed
commands from the shell with "$!" and not by parsing "jobs".
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWyycyAAoJEKKk/i67LK/8ALoH/RkMZ7Cih7vXb30wB13xSNrB
6o4ApuC4YOS9Un/4ruCXb+cGbW2LJLHkEU2ageoHLOZMvwuAM7iQ6fTUW1KxCRP2
ECvqyqi0ZRyoi/CibxVVH9hHEAJzUTwok67nkLeZBqIN9Fglcfd7toAwgcrH3y59
Pybyv5CV2eaff5IKoLXKZJNRLdrVLeM7v4BvdI0dxEikhWZ0XsA0RoIaNfTPqyQJ
F6sJ/njdoMMJK4N8CCPxlvnvEOzn0DnJnfUNUQEj5J3YU9DbAHAACaBSg5oSh9oK
BcFYKV2GIzPku1cafutRRlErcGyB2yqv7bB8Eo86zXRHbeonaj4XGJmH276ldVg=
=srlj
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v4.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two more small fixes.
One is by Yang Shi who added a READ_ONCE_NOCHECK() to the scan of the
stack made by the stack tracer. As the stack tracer scans the entire
kernel stack, KASAN triggers seeing it as a "stack out of bounds"
error. As the scan is looking at the contents of the stack from
parent functions. The NOCHECK() tells KASAN that this is done on
purpose, and is not some kind of stack overflow.
The second fix is to the ftrace selftests, to retrieve the PID of
executed commands from the shell with '$!' and not by parsing 'jobs'"
* tag 'trace-fixes-v4.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing, kasan: Silence Kasan warning in check_stack of stack_tracer
ftracetest: Fix instance test to use proper shell command for pids
add new map type to store stack traces and corresponding helper
bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
@ctx: struct pt_regs*
@map: pointer to stack_trace map
@flags: bits 0-7 - numer of stack frames to skip
bit 8 - collect user stack instead of kernel
bit 9 - compare stacks by hash only
bit 10 - if two different stacks hash into the same stackid
discard old
other bits - reserved
Return: >= 0 stackid on success or negative error
stackid is a 32-bit integer handle that can be further combined with
other data (including other stackid) and used as a key into maps.
Userspace will access stackmap using standard lookup/delete syscall commands to
retrieve full stack trace for given stackid.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull livepatching fixes from Jiri Kosina:
- regression (from 4.4) fix for ordering issue, introduced by an
earlier ftrace change, that broke live patching of modules.
The fix replaces the ftrace module notifier by direct call in order
to make the ordering guaranteed and well-defined. The patch, from
Jessica Yu, has been acked both by Steven and Rusty
- error message fix from Miroslav Benes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
ftrace/module: remove ftrace module notifier
livepatch: change the error message in asm/livepatch.h header files
Remove the ftrace module notifier in favor of directly calling
ftrace_module_enable() and ftrace_release_mod() in the module loader.
Hard-coding the function calls directly in the module loader removes
dependence on the module notifier call chain and provides better
visibility and control over what gets called when, which is important
to kernel utilities such as livepatch.
This fixes a notifier ordering issue in which the ftrace module notifier
(and hence ftrace_module_enable()) for coming modules was being called
after klp_module_notify(), which caused livepatch modules to initialize
incorrectly. This patch removes dependence on the module notifier call
chain in favor of hard coding the corresponding function calls in the
module loader. This ensures that ftrace and livepatch code get called in
the correct order on patch module load and unload.
Fixes: 5156dca34a ("ftrace: Fix the race between ftrace and insmod")
Signed-off-by: Jessica Yu <jeyu@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This patch adds the powernv_throttle tracepoint to trace the CPU
frequency throttling event, which is used by the powernv-cpufreq
driver in POWER8.
Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Here's a simple fix to correct that issue.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWr9ZIAAoJEKKk/i67LK/8Zw4H/jTSM58YqENMrXLkfL1DbzXR
UsJM+tnJX1BjYDy57yAj3HXYYWKB9h+T9Fku4CMxzRqkFHA3Vu95YIJN8hpQ0fqT
R4/nvetq214bH27DNFuDHzBwVJL368De0Kcmqy83FB5G89G8JXoxiY6nvDkmQUIq
mzYU9duCbCRvXrOSDCVSVf/hVg71Ek/erZMVfYwSf56yy8ICOoiW8Fyv6kludBAu
/71ztEWPlIXJWijIQsH2fWsdOln7N/Ej5+9wtotSlbHtTuhJJi2xr817WwOLNUBN
HC5OM5K6mWqnLveZZLTp6o77Ap6BYw2vCyElvARt23Eywz3iUE1ZzeRGSPtQwI8=
=s+F8
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"A cleanup to the stack tracer broke stack tracing on s390. Here's a
simple fix to correct that issue"
* tag 'trace-v4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/stacktrace: Show entire trace if passed in function not found
Pull perf fixes from Thomas Gleixner:
"This is much bigger than typical fixes, but Peter found a category of
races that spurred more fixes and more debugging enhancements. Work
started before the merge window, but got finished only now.
Aside of that this contains the usual small fixes to perf and tools.
Nothing particular exciting"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
perf: Remove/simplify lockdep annotation
perf: Synchronously clean up child events
perf: Untangle 'owner' confusion
perf: Add flags argument to perf_remove_from_context()
perf: Clean up sync_child_event()
perf: Robustify event->owner usage and SMP ordering
perf: Fix STATE_EXIT usage
perf: Update locking order
perf: Remove __free_event()
perf/bpf: Convert perf_event_array to use struct file
perf: Fix NULL deref
perf/x86: De-obfuscate code
perf/x86: Fix uninitialized value usage
perf: Fix race in perf_event_exit_task_context()
perf: Fix orphan hole
perf stat: Do not clean event's private stats
perf hists: Fix HISTC_MEM_DCACHELINE width setting
perf annotate browser: Fix behaviour of Shift-Tab with nothing focussed
perf tests: Remove wrong semicolon in while loop in CQM test
perf: Synchronously free aux pages in case of allocation failure
...
When a max stack trace is discovered, the stack dump is saved. In order to
not record the overhead of the stack tracer, the ip of the traced function
is looked for within the dump. The trace is started from the location of
that function. But if for some reason the ip is not found, the entire stack
trace is then truncated. That's not very useful. Instead, print everything
if the ip of the traced function is not found within the trace.
This issue showed up on s390.
Link: http://lkml.kernel.org/r/20160129102241.1b3c9c04@gandalf.local.home
Fixes: 72ac426a5b ("tracing: Clean up stack tracing and fix fentry updates")
Cc: stable@vger.kernel.org # v4.3+
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The first is a cut and paste issue that changed the amount of stack
to skip when tracing a stack dump from 0 to 6, which basically made
the stack disappear for small stack traces.
The second fix is just removing an unused field in a struct that is no
longer used, and currently just wastes space.
The third is another cut-and-paste fix that had a tracepoint recording
the wrong field (it was recording the previous field a second time).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWqibPAAoJEKKk/i67LK/8/NkH/3M6WB7RIiMMd4O403imbKcs
yIH0j9vH6Z5hwoAUUr0bEw+gHVgzsiRky5z+fP0f1J3QdVAdgEig6RgQtIbWRynu
i7fohNAiSMBob0wOIHTohQDKkQjvgoO9gO5S8nY6Axgpf4iqOTy3RF2a/gcltULY
qdgy9A0vLk6yMbP6c0P+kEzg4y+Q90DsUh8YzQKW7F1EJPneDmNdug3VM16gefTR
4yrodSBHxr8NV3kAhN8G7FjWmK5cBDFwD66vsti64mKVCW00hjYRCQ+5BrgQ7h0V
EDC7kHisckLb415SQxe8XdF4fKbfE1PuQYZhjTo02hx9XCMeyxDWbjTF2PrZCHw=
=gab6
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull minor tracing fixes from Steven Rostedt:
"This includes three minor fixes, mostly due to cut-and-paste issues.
The first is a cut and paste issue that changed the amount of stack to
skip when tracing a stack dump from 0 to 6, which basically made the
stack disappear for small stack traces.
The second fix is just removing an unused field in a struct that is no
longer used, and currently just wastes space.
The third is another cut-and-paste fix that had a tracepoint recording
the wrong field (it was recording the previous field a second time)"
* tag 'trace-v4.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/dma-buf/fence: Fix timeline str value on fence_annotate_wait_on
ftrace: Remove unused nr_trampolines var
tracing: Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()
minor fixes.
Here's what else is new:
o A new TRACE_EVENT_FN_COND macro, combining both _FN and _COND for
those that want both.
o New selftest to test the instance create and delete
o Better debug output when ftrace fails
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWlU8tAAoJEKKk/i67LK/8JckH/2XIhjwMunm35uCg1308sDqy
d44G3+p0pm8ztjBf8iD8wH2nP3m7z+nC8JBmSPIUgAHsKOYHWsBy2A/36OVWv5lK
1hVXvBwOuZXnyWXr7bC2RO9S9f9acSFaabZXWDi1BCJRJSgEcknz32V7ZAL4jOCO
SfBWBNrWJfUsURbfbElfVxPLArvyUg9Bb5dW5B+QFf6PuoJaORYzNLYXHlbsq++T
WlrlnD+mFZ/DKFZ/gl3FMSGMPaGimw09/3eqMzv/tLQobp6PbCWlJTwjUoxJ/9dO
XOY4sWUrUUZilU8qCk0i0ZSEumWmE+SWS3eq+Ef18B/5haIj/LkoM4UQD3h2Rc4=
=FDR+
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Not much new with tracing for this release. Mostly just clean ups and
minor fixes.
Here's what else is new:
- A new TRACE_EVENT_FN_COND macro, combining both _FN and _COND for
those that want both.
- New selftest to test the instance create and delete
- Better debug output when ftrace fails"
* tag 'trace-v4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (24 commits)
ftrace: Fix the race between ftrace and insmod
ftrace: Add infrastructure for delayed enabling of module functions
x86: ftrace: Fix the comments for ftrace_modify_code_direct()
tracing: Fix comment to use tracing_on over tracing_enable
metag: ftrace: Fix the comments for ftrace_modify_code
sh: ftrace: Fix the comments for ftrace_modify_code()
ia64: ftrace: Fix the comments for ftrace_modify_code()
ftrace: Clean up ftrace_module_init() code
ftrace: Join functions ftrace_module_init() and ftrace_init_module()
tracing: Introduce TRACE_EVENT_FN_COND macro
tracing: Use seq_buf_used() in seq_buf_to_user() instead of len
bpf: Constify bpf_verifier_ops structure
ftrace: Have ftrace_ops_get_func() handle RCU and PER_CPU flags too
ftrace: Remove use of control list and ops
ftrace: Fix output of enabled_functions for showing tramp
ftrace: Fix a typo in comment
ftrace: Show all tramps registered to a record on ftrace_bug()
ftrace: Add variable ftrace_expected for archs to show expected code
ftrace: Add new type to distinguish what kind of ftrace_bug()
tracing: Update cond flag when enabling or disabling a trigger
...
Pull misc vfs updates from Al Viro:
"All kinds of stuff. That probably should've been 5 or 6 separate
branches, but by the time I'd realized how large and mixed that bag
had become it had been too close to -final to play with rebasing.
Some fs/namei.c cleanups there, memdup_user_nul() introduction and
switching open-coded instances, burying long-dead code, whack-a-mole
of various kinds, several new helpers for ->llseek(), assorted
cleanups and fixes from various people, etc.
One piece probably deserves special mention - Neil's
lookup_one_len_unlocked(). Similar to lookup_one_len(), but gets
called without ->i_mutex and tries to avoid ever taking it. That, of
course, means that it's not useful for any directory modifications,
but things like getting inode attributes in nfds readdirplus are fine
with that. I really should've asked for moratorium on lookup-related
changes this cycle, but since I hadn't done that early enough... I
*am* asking for that for the coming cycle, though - I'm going to try
and get conversion of i_mutex to rwsem with ->lookup() done under lock
taken shared.
There will be a patch closer to the end of the window, along the lines
of the one Linus had posted last May - mechanical conversion of
->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
inode_is_locked()/inode_lock_nested(). To quote Linus back then:
-----
| This is an automated patch using
|
| sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
| sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
| sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[ ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
| sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
| sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
|
| with a very few manual fixups
-----
I'm going to send that once the ->i_mutex-affecting stuff in -next
gets mostly merged (or when Linus says he's about to stop taking
merges)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
nfsd: don't hold i_mutex over userspace upcalls
fs:affs:Replace time_t with time64_t
fs/9p: use fscache mutex rather than spinlock
proc: add a reschedule point in proc_readfd_common()
logfs: constify logfs_block_ops structures
fcntl: allow to set O_DIRECT flag on pipe
fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
fs: xattr: Use kvfree()
[s390] page_to_phys() always returns a multiple of PAGE_SIZE
nbd: use ->compat_ioctl()
fs: use block_device name vsprintf helper
lib/vsprintf: add %*pg format specifier
fs: use gendisk->disk_name where possible
poll: plug an unused argument to do_poll
amdkfd: don't open-code memdup_user()
cdrom: don't open-code memdup_user()
rsxx: don't open-code memdup_user()
mtip32xx: don't open-code memdup_user()
[um] mconsole: don't open-code memdup_user_nul()
[um] hostaudio: don't open-code memdup_user()
...
We hit ftrace_bug report when booting Android on a 64bit ATOM SOC chip.
Basically, there is a race between insmod and ftrace_run_update_code.
After load_module=>ftrace_module_init, another thread jumps in to call
ftrace_run_update_code=>ftrace_arch_code_modify_prepare
=>set_all_modules_text_rw, to change all modules
as RW. Since the new module is at MODULE_STATE_UNFORMED, the text attribute
is not changed. Then, the 2nd thread goes ahead to change codes.
However, load_module continues to call complete_formation=>set_section_ro_nx,
then 2nd thread would fail when probing the module's TEXT.
The patch fixes it by using notifier to delay the enabling of ftrace
records to the time when module is at state MODULE_STATE_COMING.
Link: http://lkml.kernel.org/r/567CE628.3000609@intel.com
Signed-off-by: Qiu Peiyang <peiyangx.qiu@intel.com>
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Qiu Peiyang pointed out that there's a race when enabling function tracing
and loading a module. In order to make the modifications of converting nops
in the prologue of functions into callbacks, the text needs to be converted
from read-only to read-write. When enabling function tracing, the text
permission is updated, the functions are modified, and then they are put
back.
When loading a module, the updates to convert function calls to mcount is
done before the module text is set to read-only. But after it is done, the
module text is visible by the function tracer. Thus we have the following
race:
CPU 0 CPU 1
----- -----
start function tracing
set text to read-write
load_module
add functions to ftrace
set module text read-only
update all functions to callbacks
modify module functions too
< Can't it's read-only >
When this happens, ftrace detects the issue and disables itself till the
next reboot.
To fix this, a new DISABLED flag is added for ftrace records, which all
module functions get when they are added. Then later, after the module code
is all set, the records will have the DISABLED flag cleared, and they will
be enabled if any callback wants all functions to be traced.
Note, this doesn't add the delay to later. It simply changes the
ftrace_module_init() to do both the setting of DISABLED records, and then
immediately calls the enable code. This helps with testing this new code as
it has the same behavior as previously. Another change will come after this
to have the ftrace_module_enable() called after the text is set to
read-only.
Cc: Qiu Peiyang <peiyangx.qiu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When we do cat /sys/kernel/debug/tracing/printk_formats, we hit kernel
panic at t_show.
general protection fault: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 2957 Comm: sh Tainted: G W O 3.14.55-x86_64-01062-gd4acdc7 #2
RIP: 0010:[<ffffffff811375b2>]
[<ffffffff811375b2>] t_show+0x22/0xe0
RSP: 0000:ffff88002b4ebe80 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000004 RSI: ffffffff81fd26a6 RDI: ffff880032f9f7b1
RBP: ffff88002b4ebe98 R08: 0000000000001000 R09: 000000000000ffec
R10: 0000000000000000 R11: 000000000000000f R12: ffff880004d9b6c0
R13: 7365725f6d706400 R14: ffff880004d9b6c0 R15: ffffffff82020570
FS: 0000000000000000(0000) GS:ffff88003aa00000(0063) knlGS:00000000f776bc40
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00000000f6c02ff0 CR3: 000000002c2b3000 CR4: 00000000001007f0
Call Trace:
[<ffffffff811dc076>] seq_read+0x2f6/0x3e0
[<ffffffff811b749b>] vfs_read+0x9b/0x160
[<ffffffff811b7f69>] SyS_read+0x49/0xb0
[<ffffffff81a3a4b9>] ia32_do_call+0x13/0x13
---[ end trace 5bd9eb630614861e ]---
Kernel panic - not syncing: Fatal exception
When the first time find_next calls find_next_mod_format, it should
iterate the trace_bprintk_fmt_list to find the first print format of
the module. However in current code, start_index is smaller than *pos
at first, and code will not iterate the list. Latter container_of will
get the wrong address with former v, which will cause mod_fmt be a
meaningless object and so is the returned mod_fmt->fmt.
This patch will fix it by correcting the start_index. After fixed,
when the first time calls find_next_mod_format, start_index will be
equal to *pos, and code will iterate the trace_bprintk_fmt_list to
get the right module printk format, so is the returned mod_fmt->fmt.
Link: http://lkml.kernel.org/r/5684B900.9000309@intel.com
Cc: stable@vger.kernel.org # 3.12+
Fixes: 102c9323c3 "tracing: Add __tracepoint_string() to export string pointers"
Signed-off-by: Qiu Peiyang <peiyangx.qiu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A _lot_ of ->write() instances were open-coding it; some are
converted to memdup_user_nul(), a lot more remain...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The file tracing_enable is obsolete and does not exist anymore. Replace
the comment that references it with the proper tracing_on file.
Link: http://lkml.kernel.org/r/1450787141-45544-1-git-send-email-chuhu@redhat.com
Signed-off-by: Chuyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The start and end variables were only used when ftrace_module_init() was
split up into multiple functions. No need to keep them around after the
merger.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This bpf_verifier_ops structure is never modified, like the other
bpf_verifier_ops structures, so declare it as const.
Done with the help of Coccinelle.
Link: http://lkml.kernel.org/r/1449855359-13724-1-git-send-email-Julia.Lawall@lip6.fr
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Jiri Olsa noted that the change to replace the control_ops did not update
the trampoline for when running perf on a single CPU and with CONFIG_PREEMPT
disabled (where dynamic ops, like perf, can use trampolines directly). The
result was that perf function could be called when RCU is not watching as
well as not handle the ftrace_local_disable().
Modify the ftrace_ops_get_func() to also check the RCU and PER_CPU ops flags
and use the recursive function if they are set. The recursive function is
modified to check those flags and execute the appropriate checks if they are
set.
Link: http://lkml.kernel.org/r/20151201134213.GA14155@krava.brq.redhat.com
Reported-by: Jiri Olsa <jolsa@redhat.com>
Patch-fixed-up-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently perf has its own list function within the ftrace infrastructure
that seems to be used only to allow for it to have per-cpu disabling as well
as a check to make sure that it's not called while RCU is not watching. It
uses something called the "control_ops" which is used to iterate over ops
under it with the control_list_func().
The problem is that this control_ops and control_list_func unnecessarily
complicates the code. By replacing FTRACE_OPS_FL_CONTROL with two new flags
(FTRACE_OPS_FL_RCU and FTRACE_OPS_FL_PER_CPU) we can remove all the code
that is special with the control ops and add the needed checks within the
generic ftrace_list_func().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When showing all tramps registered to a ftrace record in the file
enabled_functions, it exits the loop with ops == NULL. But then it is
suppose to show the function on the ops->trampoline and
add_trampoline_func() is called with the given ops. But because ops is now
NULL (to exit the loop), it always shows the static trampoline instead of
the one that is really registered to the record.
The call to add_trampoline_func() that shows the trampoline for the given
ops needs to be called at every iteration.
Fixes: 39daa7b9e8 "ftrace: Show all tramps registered to a record on ftrace_bug()"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf fixes from Ingo Molnar:
"This tree includes four core perf fixes for misc bugs, three fixes to
x86 PMU drivers, and two updates to old email addresses"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do not send exit event twice
perf/x86/intel: Fix INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA macro
perf/x86/intel: Make L1D_PEND_MISS.FB_FULL not constrained on Haswell
perf: Fix PERF_EVENT_IOC_PERIOD deadlock
treewide: Remove old email address
perf/x86: Fix LBR call stack save/restore
perf: Update email address in MAINTAINERS
perf/core: Robustify the perf_cgroup_from_task() RCU checks
perf/core: Fix RCU problem with cgroup context switching code
The set_event_pid filter relies on attaching to the sched_switch and
sched_wakeup tracepoints to see if it should filter the tracing on schedule
tracepoints. By adding the callbacks to sched_wakeup, pids in the
set_event_pid file will trace the wakeups of those tasks with those pids.
But sched_wakeup_new and sched_waking were missed. These two should also be
traced. Luckily, these tracepoints share the same class as sched_wakeup
which means they can use the same pre and post callbacks as sched_wakeup
does.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When an anomaly is detected in the function call modification code,
ftrace_bug() is called to disable function tracing as well as give any
information that may help debug the problem. Currently, only the first found
trampoline that is attached to the failed record is reported. Instead, show
all trampolines that are hooked to it.
Also, not only show the ops pointer but also report the function it calls.
While at it, add this info to the enabled_functions debug file too.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When an anomaly is found while modifying function code, ftrace_bug() is
called which disables the function tracing infrastructure and reports
information about what failed. If the code that is to be replaced does not
match what is expected, then actual code is shown. Currently there is no
arch generic way to show what was expected.
Add a new variable pointer calld ftrace_expected that the arch code can set
to point to what it expected so that ftrace_bug() can report the actual text
as well as the text that was expected to be there.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace function hook utility has several internal checks to make sure
that whatever it modifies is exactly what it expects to be modifying. This
is essential as modifying running code can be extremely dangerous to the
system.
When an anomaly is detected, ftrace_bug() is called which sends a splat to
the console and disables function tracing. There's some extra information
that is printed to help diagnose the issue.
One thing that is missing though is output of what ftrace was doing at the
time of the crash. Was it updating a call site or perhaps converting a call
site to a nop? A new global enum variable is created to state what ftrace
was doing at the time of the anomaly, and this is reported in ftrace_bug().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a trigger is enabled, the cond flag should be set beforehand,
otherwise a trigger that's expecting to process a trace record
(e.g. one with post_trigger set) could be invoked without one.
Likewise a trigger's cond flag should be reset after it's disabled,
not before.
Link: http://lkml.kernel.org/r/a420b52a67b1c2d3cab017914362d153255acb99.1448303214.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When crossing over to a new page, commit the current work. This will allow
readers to get data with less latency, and also simplifies the work to get
timestamps working for interrupted events.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The first commit of a buffer page updates the timestamp of that page. No
need to have the update to the next page add the timestamp too. It will only
be replaced by the first commit on that page anyway.
Only update to a page if it contains an event.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As cpu_buffer->tail_page may be modified by interrupts at almost any time,
the flow of logic is very important. Do not let gcc get smart with
re-reading cpu_buffer->tail_page by adding READ_ONCE() around most of its
accesses.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit fcc742eaad "ring-buffer: Add event descriptor to simplify passing
data" added a descriptor that holds various data instead of passing around
several variables through parameters. The problem was that one of the
parameters was modified in a function and the code was designed not to have
an effect on that modified parameter. Now that the parameter is a
descriptor and any modifications to it are non-volatile, the size of the
data could be unnecessarily expanded.
Remove the extra space added if a timestamp was added and the event went
across the page.
Cc: stable@vger.kernel.org # 4.3+
Fixes: fcc742eaad "ring-buffer: Add event descriptor to simplify passing data"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Do not update the read stamp after swapping out the reader page from the
write buffer. If the reader page is swapped out of the buffer before an
event is written to it, then the read_stamp may get an out of date
timestamp, as the page timestamp is updated on the first commit to that
page.
rb_get_reader_page() only returns a page if it has an event on it, otherwise
it will return NULL. At that point, check if the page being returned has
events and has not been read yet. Then at that point update the read_stamp
to match the time stamp of the reader page.
Cc: stable@vger.kernel.org # 2.6.30+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
One patch is needed to make tracing work without debugfs now that tracing
uses its own tracefs.
The second is removing an unused variable.
The third is fixing a warning about unused variables when MAX_TRACER is
not configured. Note, this warning shows up in gcc 6.0, but does not show
up in gcc 4.9, as it seems that gcc does not complain about constants
not being used.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWRKA+AAoJEKKk/i67LK/8BZ4H/0UgA8+OpbUmBHQk/RdMRZW2
BSwO/GuErq3nSKJUjP4f3b4BGdMoFgThIOxR9dvHKwJvGfwpbTivpgR0OqUdFKAZ
1FyUT/oNM0ySNGpZFGuMQ9hvR+deeUvvyPhosjPWlxZLSnRSsEzTlePUqikX1VY6
hEQj+ulK8J0iLlHMDYLJ5FkQ10+owrI2vmT8BU1O8KZN8EBvhoUx4EfJewD63/pa
Gzjm6C9RpVb6ks555NmwFzLeN1Yd5X4n5m2CQFrIpzBgRlibWIGCPMVe7OgNawMu
wONNvhq9zLfo3pftS092lek4iPgEw7ypl6iEgBxDzqTIAjSxUqf+N+1xIbOCcOE=
=GikR
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull trace cleanups from Steven Rostedt:
"This contains three more clean up patches.
One patch is needed to make tracing work without debugfs now that
tracing uses its own tracefs.
The second is removing an unused variable.
The third is fixing a warning about unused variables when MAX_TRACER
is not configured. Note, this warning shows up in gcc 6.0, but does
not show up in gcc 4.9, as it seems that gcc does not complain about
constants not being used"
* tag 'trace-v4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: #ifdef out uses of max trace when CONFIG_TRACER_MAX_TRACE is not set
tracing: Remove unused ftrace_cpu_disabled per cpu variable
tracing: Make tracing work when debugfs is not configured in
Pull networking fixes from David Miller:
1) Fix null deref in xt_TEE netfilter module, from Eric Dumazet.
2) Several spots need to get to the original listner for SYN-ACK
packets, most spots got this ok but some were not. Whilst covering
the remaining cases, create a helper to do this. From Eric Dumazet.
3) Missiing check of return value from alloc_netdev() in CAIF SPI code,
from Rasmus Villemoes.
4) Don't sleep while != TASK_RUNNING in macvtap, from Vlad Yasevich.
5) Use after free in mvneta driver, from Justin Maggard.
6) Fix race on dst->flags access in dst_release(), from Eric Dumazet.
7) Add missing ZLIB_INFLATE dependency for new qed driver. From Arnd
Bergmann.
8) Fix multicast getsockopt deadlock, from WANG Cong.
9) Fix deadlock in btusb, from Kuba Pawlak.
10) Some ipv6_add_dev() failure paths were not cleaning up the SNMP6
counter state. From Sabrina Dubroca.
11) Fix packet_bind() race, which can cause lost notifications, from
Francesco Ruggeri.
12) Fix MAC restoration in qlcnic driver during bonding mode changes,
from Jarod Wilson.
13) Revert bridging forward delay change which broke libvirt and other
userspace things, from Vlad Yasevich.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
Revert "bridge: Allow forward delay to be cfgd when STP enabled"
bpf_trace: Make dependent on PERF_EVENTS
qed: select ZLIB_INFLATE
net: fix a race in dst_release()
net: mvneta: Fix memory use after free.
net: Documentation: Fix default value tcp_limit_output_bytes
macvtap: Resolve possible __might_sleep warning in macvtap_do_read()
mvneta: add FIXED_PHY dependency
net: caif: check return value of alloc_netdev
net: hisilicon: NET_VENDOR_HISILICON should depend on HAS_DMA
drivers: net: xgene: fix RGMII 10/100Mb mode
netfilter: nft_meta: use skb_to_full_sk() helper
net_sched: em_meta: use skb_to_full_sk() helper
sched: cls_flow: use skb_to_full_sk() helper
netfilter: xt_owner: use skb_to_full_sk() helper
smack: use skb_to_full_sk() helper
net: add skb_to_full_sk() helper and use it in selinux_netlbl_skbuff_setsid()
bpf: doc: correct arch list for supported eBPF JIT
dwc_eth_qos: Delete an unnecessary check before the function call "of_node_put"
bonding: fix panic on non-ARPHRD_ETHER enslave failure
...
Arnd Bergmann reported:
In my ARM randconfig tests, I'm getting a build error for
newly added code in bpf_perf_event_read and bpf_perf_event_output
whenever CONFIG_PERF_EVENTS is disabled:
kernel/trace/bpf_trace.c: In function 'bpf_perf_event_read':
kernel/trace/bpf_trace.c:203:11: error: 'struct perf_event' has no member named 'oncpu'
if (event->oncpu != smp_processor_id() ||
^
kernel/trace/bpf_trace.c:204:11: error: 'struct perf_event' has no member named 'pmu'
event->pmu->count)
This can happen when UPROBE_EVENT is enabled but KPROBE_EVENT
is disabled. I'm not sure if that is a configuration we care
about, otherwise we could prevent this case from occuring by
adding Kconfig dependencies.
Looking at this further, it's really that UPROBE_EVENT enables PERF_EVENTS.
By just having BPF_EVENTS depend on PERF_EVENTS, then all is fine.
Link: http://lkml.kernel.org/r/4525348.Aq9YoXkChv@wuerfel
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
tracing_max_lat_fops is used only when TRACER_MAX_TRACE enabled, so also
swith the related code. The related warning with defconfig under x86_64:
CC kernel/trace/trace.o
kernel/trace/trace.c:5466:37: warning: ‘tracing_max_lat_fops’ defined but not used [-Wunused-const-variable]
static const struct file_operations tracing_max_lat_fops = {
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the ring buffer is lockless, there is no need to disable ftrace on
CPU. And no one doing so: after commit 68179686ac ("tracing: Remove
ftrace_disable/enable_cpu()") ftrace_cpu_disabled stays the same after
initialization, nothing changes it.
ftrace_cpu_disabled shouldn't be used by any external module since it
disables only function and graph_function tracers but not any other
tracer.
Link: http://lkml.kernel.org/r/1446836846-22239-1-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
stable tags to them. I searched through my INBOX just as the merge window
opened and found lots of patches to pull. I ran them through all my tests
and they were in linux-next for a few days.
Features added this release:
----------------------------
o Module globbing. You can now filter function tracing to several
modules. # echo '*:mod:*snd*' > set_ftrace_filter (Dmitry Safonov)
o Tracer specific options are now visible even when the tracer is not
active. It was rather annoying that you can only see and modify tracer
options after enabling the tracer. Now they are in the options/ directory
even when the tracer is not active. Although they are still only visible
when the tracer is active in the trace_options file.
o Trace options are now per instance (although some of the tracer specific
options are global)
o New tracefs file: set_event_pid. If any pid is added to this file, then
all events in the instance will filter out events that are not part of
this pid. sched_switch and sched_wakeup events handle next and the wakee
pids.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWPLQ5AAoJEKKk/i67LK/8CTYIAI1u8DE5QCzv3J0p54jVpNVR
J5FqEU3eXIzd6FS4JXD4nxCeMpUZAy21YnhlZpsnrbJJM5bc9bUsBCwiKKM+MuSZ
ztmy2sgYKkO0h/KUdhNgYJrzis3/Ojquyx9iAqK5ST/Fr+nKYx81akFKjNK53iur
RJRut45sSa8rv11LaL8sgJ6hAWQTc+YkybUdZ5xaMdJmZ6A61T7Y6VzTjbUexuvL
hntCfTjYLtVd8dbfknAnf3B7n/VOO3IFF85wr7ciYR5oEVfPrF8tHmJBlhHExPpX
kaXAiDDRY/UTg/5DQqnp4zmxJoR5BQ2l4pT5PwiLcnwhcphIDNYS8EYUmOYAWjU=
=TjOE
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracking updates from Steven Rostedt:
"Most of the changes are clean ups and small fixes. Some of them have
stable tags to them. I searched through my INBOX just as the merge
window opened and found lots of patches to pull. I ran them through
all my tests and they were in linux-next for a few days.
Features added this release:
----------------------------
- Module globbing. You can now filter function tracing to several
modules. # echo '*:mod:*snd*' > set_ftrace_filter (Dmitry Safonov)
- Tracer specific options are now visible even when the tracer is not
active. It was rather annoying that you can only see and modify
tracer options after enabling the tracer. Now they are in the
options/ directory even when the tracer is not active. Although
they are still only visible when the tracer is active in the
trace_options file.
- Trace options are now per instance (although some of the tracer
specific options are global)
- New tracefs file: set_event_pid. If any pid is added to this file,
then all events in the instance will filter out events that are not
part of this pid. sched_switch and sched_wakeup events handle next
and the wakee pids"
* tag 'trace-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
tracefs: Fix refcount imbalance in start_creating()
tracing: Put back comma for empty fields in boot string parsing
tracing: Apply tracer specific options from kernel command line.
tracing: Add some documentation about set_event_pid
ring_buffer: Remove unneeded smp_wmb() before wakeup of reader benchmark
tracing: Allow dumping traces without tracking trace started cpus
ring_buffer: Fix more races when terminating the producer in the benchmark
ring_buffer: Do no not complete benchmark reader too early
tracing: Remove redundant TP_ARGS redefining
tracing: Rename max_stack_lock to stack_trace_max_lock
tracing: Allow arch-specific stack tracer
recordmcount: arm64: Replace the ignored mcount call into nop
recordmcount: Fix endianness handling bug for nop_mcount
tracepoints: Fix documentation of RCU lockdep checks
tracing: ftrace_event_is_function() can return boolean
tracing: is_legal_op() can return boolean
ring-buffer: rb_event_is_commit() can return boolean
ring-buffer: rb_per_cpu_empty() can return boolean
ring_buffer: ring_buffer_empty{cpu}() can return boolean
ring-buffer: rb_is_reader_page() can return boolean
...
Currently tracing_init_dentry() returns -ENODEV when debugfs is not
configured in, which causes tracefs not populated with tracing files and
directories, so we will get an empty directory even after we manually
mount tracefs.
We can make tracing_init_dentry() return NULL if debugfs is not
configured in and can manually mount tracefs. But return -ENODEV
if debugfs is configured in but not initialized or failed to create
automount point as that would break backward compatibility with older
tools.
Link: http://lkml.kernel.org/r/1446797056-11683-1-git-send-email-hello.wjx@gmail.com
Signed-off-by: Jiaxing Wang <hello.wjx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull core block updates from Jens Axboe:
"This is the core block pull request for 4.4. I've got a few more
topic branches this time around, some of them will layer on top of the
core+drivers changes and will come in a separate round. So not a huge
chunk of changes in this round.
This pull request contains:
- Enable blk-mq page allocation tracking with kmemleak, from Catalin.
- Unused prototype removal in blk-mq from Christoph.
- Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
xchg()'s, from Davidlohr.
- A plug flush fix from Jeff.
- Also from Jeff, a fix that means we don't have to update shared tag
sets at init time unless we do a state change. This cuts down boot
times on thousands of devices a lot with scsi/blk-mq.
- blk-mq waitqueue barrier fix from Kosuke.
- Various fixes from Ming:
- Fixes for segment merging and splitting, and checks, for
the old core and blk-mq.
- Potential blk-mq speedup by marking ctx pending at the end
of a plug insertion batch in blk-mq.
- direct-io no page dirty on kernel direct reads.
- A WRITE_SYNC fix for mpage from Roman"
* 'for-4.4/core' of git://git.kernel.dk/linux-block:
blk-mq: avoid excessive boot delays with large lun counts
blktrace: re-write setting q->blk_trace
blk-mq: mark ctx as pending at batch in flush plug path
blk-mq: fix for trace_block_plug()
block: check bio_mergeable() early before merging
blk-mq: check bio_mergeable() early before merging
block: avoid to merge splitted bio
block: setup bi_phys_segments after splitting
block: fix plug list flushing for nomerge queues
blk-mq: remove unused blk_mq_clone_flush_request prototype
blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
block: kmemleak: Track the page allocations for struct request
Pull networking updates from David Miller:
Changes of note:
1) Allow to schedule ICMP packets in IPVS, from Alex Gartrell.
2) Provide FIB table ID in ipv4 route dumps just as ipv6 does, from
David Ahern.
3) Allow the user to ask for the statistics to be filtered out of
ipv4/ipv6 address netlink dumps. From Sowmini Varadhan.
4) More work to pass the network namespace context around deep into
various packet path APIs, starting with the netfilter hooks. From
Eric W Biederman.
5) Add layer 2 TX/RX checksum offloading to qeth driver, from Thomas
Richter.
6) Use usec resolution for SYN/ACK RTTs in TCP, from Yuchung Cheng.
7) Support Very High Throughput in wireless MESH code, from Bob
Copeland.
8) Allow setting the ageing_time in switchdev/rocker. From Scott
Feldman.
9) Properly autoload L2TP type modules, from Stephen Hemminger.
10) Fix and enable offload features by default in 8139cp driver, from
David Woodhouse.
11) Support both ipv4 and ipv6 sockets in a single vxlan device, from
Jiri Benc.
12) Fix CWND limiting of thin streams in TCP, from Bendik Rønning
Opstad.
13) Fix IPSEC flowcache overflows on large systems, from Steffen
Klassert.
14) Convert bridging to track VLANs using rhashtable entries rather than
a bitmap. From Nikolay Aleksandrov.
15) Make TCP listener handling completely lockless, this is a major
accomplishment. Incoming request sockets now live in the
established hash table just like any other socket too.
From Eric Dumazet.
15) Provide more bridging attributes to netlink, from Nikolay
Aleksandrov.
16) Use hash based algorithm for ipv4 multipath routing, this was very
long overdue. From Peter Nørlund.
17) Several y2038 cures, mostly avoiding timespec. From Arnd Bergmann.
18) Allow non-root execution of EBPF programs, from Alexei Starovoitov.
19) Support SO_INCOMING_CPU as setsockopt, from Eric Dumazet. This
influences the port binding selection logic used by SO_REUSEPORT.
20) Add ipv6 support to VRF, from David Ahern.
21) Add support for Mellanox Spectrum switch ASIC, from Jiri Pirko.
22) Add rtl8xxxu Realtek wireless driver, from Jes Sorensen.
23) Implement RACK loss recovery in TCP, from Yuchung Cheng.
24) Support multipath routes in MPLS, from Roopa Prabhu.
25) Fix POLLOUT notification for listening sockets in AF_UNIX, from Eric
Dumazet.
26) Add new QED Qlogic river, from Yuval Mintz, Manish Chopra, and
Sudarsana Kalluru.
27) Don't fetch timestamps on AF_UNIX sockets, from Hannes Frederic
Sowa.
28) Support ipv6 geneve tunnels, from John W Linville.
29) Add flood control support to switchdev layer, from Ido Schimmel.
30) Fix CHECKSUM_PARTIAL handling of potentially fragmented frames, from
Hannes Frederic Sowa.
31) Support persistent maps and progs in bpf, from Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1790 commits)
sh_eth: use DMA barriers
switchdev: respect SKIP_EOPNOTSUPP flag in case there is no recursion
net: sched: kill dead code in sch_choke.c
irda: Delete an unnecessary check before the function call "irlmp_unregister_service"
net: dsa: mv88e6xxx: include DSA ports in VLANs
net: dsa: mv88e6xxx: disable SA learning for DSA and CPU ports
net/core: fix for_each_netdev_feature
vlan: Invoke driver vlan hooks only if device is present
arcnet/com20020: add LEDS_CLASS dependency
bpf, verifier: annotate verbose printer with __printf
dp83640: Only wait for timestamps for packets with timestamping enabled.
ptp: Change ptp_class to a proper bitmask
dp83640: Prune rx timestamp list before reading from it
dp83640: Delay scheduled work.
dp83640: Include hash in timestamp/packet matching
ipv6: fix tunnel error handling
net/mlx5e: Fix LSO vlan insertion
net/mlx5e: Re-eanble client vlan TX acceleration
net/mlx5e: Return error in case mlx5e_set_features() fails
net/mlx5e: Don't allow more than max supported channels
...
Both early_enable_events() and apply_trace_boot_options() parse a boot
string that may get parsed later on. They both use strsep() which converts a
comma into a nul character. To still allow the boot string to be parsed
again the same way, the nul character gets converted back to a comma after
the token is processed.
The problem is that these two functions check for an empty parameter (two
commas in a row ",,"), and continue the loop if the parameter is empty, but
fails to place the comma back. In this case, the second parsing will end at
this blank field, and not process fields afterward.
In most cases, users should not have an empty field, but if its going to be
checked, the code might as well be correct.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the trace_options parameter is only applied in
tracer_alloc_buffers() when global_trace.current_trace is nop_trace,
so a tracer specific option will not be applied even when the specific
tracer is also enabled from kernel command line. For example, the
'func_stack_trace' option can't be enabled with the following kernel
parameter:
ftrace=function ftrace_filter=kfree trace_options=func_stack_trace
We can enable tracer specific options by simply apply the options again
if the specific tracer is also supplied from command line and started
in register_tracer().
To make trace_boot_options_buf can be parsed again, a comma and a space
is put back if they were replaced by strsep and strstrip respectively.
Also make register_tracer() be __init to access the __init data, and
in fact register_tracer is only called from __init code.
Link: http://lkml.kernel.org/r/1446599669-9294-1-git-send-email-hello.wjx@gmail.com
Signed-off-by: Jiaxing Wang <hello.wjx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle were:
- sched/fair load tracking fixes and cleanups (Byungchul Park)
- Make load tracking frequency scale invariant (Dietmar Eggemann)
- sched/deadline updates (Juri Lelli)
- stop machine fixes, cleanups and enhancements for bugs triggered by
CPU hotplug stress testing (Oleg Nesterov)
- scheduler preemption code rework: remove PREEMPT_ACTIVE and related
cleanups (Peter Zijlstra)
- Rework the sched_info::run_delay code to fix races (Peter Zijlstra)
- Optimize per entity utilization tracking (Peter Zijlstra)
- ... misc other fixes, cleanups and smaller updates"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
sched: Start stopper early
stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
sched/x86: Fix typo in __switch_to() comments
sched/core: Remove a parameter in the migrate_task_rq() function
sched/core: Drop unlikely behind BUG_ON()
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
sched/numa: Fix task_tick_fair() from disabling numa_balancing
sched/core: Add preempt_count invariant check
sched/core: More notrace annotations
sched/core: Kill PREEMPT_ACTIVE
sched/core, sched/x86: Kill thread_info::saved_preempt_count
sched/core: Simplify preempt_count tests
sched/core: Robustify preemption leak checks
sched/core: Stop setting PREEMPT_ACTIVE
...
wake_up_process() has a memory barrier before doing anything, thus adding a
memory barrier before calling it is redundant.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We don't init iter->started when dumping the ftrace buffer, and there's no
real need to do so - so allow skipping that check if the iter doesn't have
an initialized ->started cpumask.
Link: http://lkml.kernel.org/r/1441385156-27279-1-git-send-email-sasha.levin@oracle.com
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The commit b44754d826 ("ring_buffer: Allow to exit the ring
buffer benchmark immediately") added a hack into ring_buffer_producer()
that set @kill_test when kthread_should_stop() returned true. It improved
the situation a lot. It stopped the kthread in most cases because
the producer spent most of the time in the patched while cycle.
But there are still few possible races when kthread_should_stop()
is set outside of the cycle. Then we do not set @kill_test and
some other checks pass.
This patch adds a better fix. It renames @test_kill/TEST_KILL() into
a better descriptive @test_error/TEST_ERROR(). Also it introduces
break_test() function that checks for both @test_error and
kthread_should_stop().
The new function is used in the producer when the check for @test_error
is not enough. It is not used in the consumer because its state
is manipulated by the producer via the "reader_finish" variable.
Also we add a missing check into ring_buffer_producer_thread()
between setting TASK_INTERRUPTIBLE and calling schedule_timeout().
Otherwise, we might miss a wakeup from kthread_stop().
Link: http://lkml.kernel.org/r/1441629518-32712-3-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It seems that complete(&read_done) might be called too early
in some situations.
1st scenario:
-------------
CPU0 CPU1
ring_buffer_producer_thread()
wake_up_process(consumer);
wait_for_completion(&read_start);
ring_buffer_consumer_thread()
complete(&read_start);
ring_buffer_producer()
# producing data in
# the do-while cycle
ring_buffer_consumer();
# reading data
# got error
# set kill_test = 1;
set_current_state(
TASK_INTERRUPTIBLE);
if (reader_finish) # false
schedule();
# producer still in the middle of
# do-while cycle
if (consumer && !(cnt % wakeup_interval))
wake_up_process(consumer);
# spurious wakeup
while (!reader_finish &&
!kill_test)
# leaving because
# kill_test == 1
reader_finish = 0;
complete(&read_done);
1st BANG: We might access uninitialized "read_done" if this is the
the first round.
# producer finally leaving
# the do-while cycle because kill_test == 1;
if (consumer) {
reader_finish = 1;
wake_up_process(consumer);
wait_for_completion(&read_done);
2nd BANG: This will never complete because consumer already did
the completion.
2nd scenario:
-------------
CPU0 CPU1
ring_buffer_producer_thread()
wake_up_process(consumer);
wait_for_completion(&read_start);
ring_buffer_consumer_thread()
complete(&read_start);
ring_buffer_producer()
# CPU3 removes the module <--- difference from
# and stops producer <--- the 1st scenario
if (kthread_should_stop())
kill_test = 1;
ring_buffer_consumer();
while (!reader_finish &&
!kill_test)
# kill_test == 1 => we never go
# into the top level while()
reader_finish = 0;
complete(&read_done);
# producer still in the middle of
# do-while cycle
if (consumer && !(cnt % wakeup_interval))
wake_up_process(consumer);
# spurious wakeup
while (!reader_finish &&
!kill_test)
# leaving because kill_test == 1
reader_finish = 0;
complete(&read_done);
BANG: We are in the same "bang" situations as in the 1st scenario.
Root of the problem:
--------------------
ring_buffer_consumer() must complete "read_done" only when "reader_finish"
variable is set. It must not be skipped due to other conditions.
Note that we still must keep the check for "reader_finish" in a loop
because there might be spurious wakeups as described in the
above scenarios.
Solution:
----------
The top level cycle in ring_buffer_consumer() will finish only when
"reader_finish" is set. The data will be read in "while-do" cycle
so that they are not read after an error (kill_test == 1)
or a spurious wake up.
In addition, "reader_finish" is manipulated by the producer thread.
Therefore we add READ_ONCE() to make sure that the fresh value is
read in each cycle. Also we add the corresponding barrier
to synchronize the sleep check.
Next we set the state back to TASK_RUNNING for the situation where we
did not sleep.
Just from paranoid reasons, we initialize both completions statically.
This is safer, in case there are other races that we are unaware of.
As a side effect we could remove the memory barrier from
ring_buffer_producer_thread(). IMHO, this was the reason for
the barrier. ring_buffer_reset() uses spin locks that should
provide the needed memory barrier for using the buffer.
Link: http://lkml.kernel.org/r/1441629518-32712-2-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
TP_ARGS is not used anywhere in trace.h nor trace_entries.h
Firstly, I left just #undef TP_ARGS and had no errors - remove it.
Link: http://lkml.kernel.org/r/1446576560-14085-1-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that max_stack_lock is a global variable, it requires a naming
convention that is unlikely to collide. Rename it to the same naming
convention that the other stack_trace variables have.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A stack frame may be used in a different way depending on cpu architecture.
Thus it is not always appropriate to slurp the stack contents, as current
check_stack() does, in order to calcurate a stack index (height) at a given
function call. At least not on arm64.
In addition, there is a possibility that we will mistakenly detect a stale
stack frame which has not been overwritten.
This patch makes check_stack() a weak function so as to later implement
arch-specific version.
Link: http://lkml.kernel.org/r/1446182741-31019-5-git-send-email-takahiro.akashi@linaro.org
Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make ftrace_event_is_function() return bool to improve readability
due to this particular function only using either one or zero as its
return value.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-9-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make is_legal_op() return bool to improve readability due to this particular
function only using either one or zero as its return value.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-8-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make rb_event_is_commit() return bool to improve readability
due to this particular function only using either one or zero as its
return value.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-7-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Make rb_is_reader_page() return bool to improve readability due to this
particular function only using either true or false as its return value.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-4-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch makes report_latency return bool due to this
particular function only using either one or zero as its
return value.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-3-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch makes report_latency return bool to improve readability,
indicating whether this new latency should be reported/recorded.
No functional change.
Link: http://lkml.kernel.org/r/1443537816-5788-2-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Update instancd_rmdir to use tracefs_remove_recursive instead of
debugfs_remove_recursive.This was left in the transition from debugfs
to tracefs.
Link: http://lkml.kernel.org/r/1445169490-18315-2-git-send-email-hello.wjx@gmail.com
Cc: stable@vger.kernel.org # 4.1+
Fixes: 8434dc9340 ("tracing: Convert the tracing facility over to use tracefs")
Signed-off-by: Jiaxing Wang <hello.wjx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's no need to record the time tracepoints take when tracing is off.
This is because:
1) We cannot see these records since ring_buffer record is off at that
moment.
2) If tracing is off and benchmark tracepoint is enabled, the time
tracepoint takes is fewer than the same situation when tracing is on,
since the tracepoints need to be wrote into ring_buffer, it would
take more time. If turn on tracing at this moment, the average and
standard deviation cannot exactly present the time that tracepoints
take to write data into ring_buffer.
Link: http://lkml.kernel.org/r/1445947933-27955-1-git-send-email-zhang.chunyan@linaro.org
Signed-off-by: Chunyan Zhang <zhang.chunyan@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For the case where pids are already in set_event_pid, and one is added or
removed then each CPU should be checked to make sure that the new or old pid
is on or not on a CPU.
For example:
# echo 123 >> set_event_pid
or
# echo '!123' >> set_event_pid
Link: http://lkml.kernel.org/r/20151030061643.GA19480@cac
Suggested-by: Jiaxing Wang <hello.wjx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This is really about simplifying the double xchg patterns into
a single cmpxchg, with the same logic. Other than the immediate
cleanup, there are some subtleties this change deals with:
(i) While the load of the old bt is fully ordered wrt everything,
ie:
old_bt = xchg(&q->blk_trace, bt); [barrier]
if (old_bt)
(void) xchg(&q->blk_trace, old_bt); [barrier]
blk_trace could still be changed between the xchg and the old_bt
load. Note that this description is merely theoretical and afaict
very small, but doing everything in a single context with cmpxchg
closes this potential race.
(ii) Ordering guarantees are obviously kept with cmpxchg.
(iii) Gets rid of the hacky-by-nature (void)xchg pattern.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
eviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
exported perf symbols are GPL only, mark eBPF helper functions
used in tracing as GPL only as well.
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix safety checks for bpf_perf_event_read():
- only non-inherited events can be added to perf_event_array map
(do this check statically at map insertion time)
- dynamically check that event is local and !pmu->count
Otherwise buggy bpf program can cause kernel splat.
Also fix error path after perf_event_attrs()
and remove redundant 'extern'.
Fixes: 35578d7984 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
p_start() and p_stop() are seq_file functions that match. Teach sparse to
know that rcu_read_lock_sched() that is taken by p_start() is released by
p_stop.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
My tests found that if a task is running but not filtered when set_event_pid
is modified, then it can still be traced.
Call on_each_cpu() to check if the current running task should be filtered
and update the per cpu flags of tr->data appropriately.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the necessary hooks to use the pids loaded in set_event_pid to filter
all the events enabled in the tracing instance that match the pids listed.
Two probes are added to both sched_switch and sched_wakeup tracepoints to be
called before other probes are called and after the other probes are called.
The first is used to set the necessary flags to let the probes know to test
if they should be traced or not.
The sched_switch pre probe will set the "ignore_pid" flag if neither the
previous or next task has a matching pid.
The sched_switch probe will set the "ignore_pid" flag if the next task
does not match the matching pid.
The pre probe allows for probes tracing sched_switch to be traced if
necessary.
The sched_wakeup pre probe will set the "ignore_pid" flag if neither the
current task nor the wakee task has a matching pid.
The sched_wakeup post probe will set the "ignore_pid" flag if the current
task does not have a matching pid.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Create a tracing directory called set_event_pid, which currently has no
function, but will be used to filter all events for the tracing instance or
the pids that are added to the file.
The reason no functionality is added with this commit is that this commit
focuses on the creation and removal of the pids in a safe manner. And tests
can be made against this change to make sure things are correct before
hooking features to the list of pids.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This helper is used to send raw data from eBPF program into
special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
User space needs to perf_event_open() it (either for one or all cpus) and
store FD into perf_event_array (similar to bpf_perf_event_read() helper)
before eBPF program can send data into it.
Today the programs triggered by kprobe collect the data and either store
it into the maps or print it via bpf_trace_printk() where latter is the debug
facility and not suitable to stream the data. This new helper replaces
such bpf_trace_printk() usage and allows programs to have dedicated
channel into user space for post-processing of the raw data collected.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both start_branch_trace() and stop_branch_trace() are used in only one
location, and are both static. As they are small functions there is no
need to keep them separated out.
Link: http://lkml.kernel.org/r/1445000689-32596-1-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a new options to trace Kconfig, CONFIG_TRACING_EVENTS_GPIO, that is
used for enabling/disabling compilation of gpio function trace events.
Link: http://lkml.kernel.org/r/1438432079-11704-4-git-send-email-tal.shorer@gmail.com
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Tal Shorer <tal.shorer@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code in stack tracer should not be executed within an NMI as it grabs
spinlocks and stack tracing an NMI gives the possibility of causing a
deadlock. Although this is safe on x86_64, because it does not perform stack
traces when the task struct stack is not in use (interrupts and NMIs), it
may be an issue for NMIs on i386 and other archs that use the same stack as
the NMI.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Extend module command for function filter selection with globbing.
It uses the same globbing as function filter.
sh# echo '*alloc*:mod:*' > set_ftrace_filter
Will trace any function with the letters 'alloc' in the name in any
module but not in kernel.
sh# echo '!*alloc*:mod:ipv6' >> set_ftrace_filter
Will prevent from tracing functions with 'alloc' in the name from module
ipv6 (do not forget to append to set_ftrace_filter file).
sh# echo '*alloc*:mod:!ipv6' > set_ftrace_filter
Will trace functions with 'alloc' in the name from kernel and any
module except ipv6.
sh# echo '*alloc*:mod:!*' > set_ftrace_filter
Will trace any function with the letters 'alloc' in the name only from
kernel, but not from any module.
sh# echo '*:mod:!*' > set_ftrace_filter
or
sh# echo ':mod:!' > set_ftrace_filter
Will trace every function in the kernel, but will not trace functions
from any module.
sh# echo '*:mod:*' > set_ftrace_filter
or
sh# echo ':mod:' > set_ftrace_filter
As the opposite will trace all functions from all modules, but not from
kernel.
sh# echo '*:mod:*snd*' > set_ftrace_filter
Will trace your sound drivers only (if any).
Link: http://lkml.kernel.org/r/1443545176-3215-4-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
[ Made format changes ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_match parameters are very related and I reduce the number of local
variables & parameters with it.
This is also preparation for module globbing as it would introduce more
realated variables & parameters.
Link: http://lkml.kernel.org/r/1443545176-3215-3-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
[ Made some formatting changes ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The stack tracer was triggering the WARN_ON() in module.c:
static void module_assert_mutex_or_preempt(void)
{
#ifdef CONFIG_LOCKDEP
if (unlikely(!debug_locks))
return;
WARN_ON(!rcu_read_lock_sched_held() &&
!lockdep_is_held(&module_mutex));
#endif
}
The reason is that the stack tracer traces all function calls, and some of
those calls happen while exiting or entering user space and idle. Some of
these functions are called after RCU had already stopped watching, as RCU
does not watch userspace or idle CPUs.
If a max stack is hit, then the save_stack_trace() is called, which will
check module addresses and call module_assert_mutex_or_preempt(), and then
trigger the warning. Sad part is, the warning itself will also do a stack
trace and tigger the same warning. That probably should be fixed.
The warning was added by 0be964be0d "module: Sanitize RCU usage and
locking" but this bug has probably been around longer. But it's unlikely to
cause much harm, but the new warning causes the system to lock up.
Cc: stable@vger.kernel.org # 4.2+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc:"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
"Not" is too abstract variable name - changed to clear_filter.
Removed ftrace_match_module_records function: comparison with !* or *
not does the general code in filter_parse_regex() as it works without
mod command for
sh# echo '!*' > /sys/kernel/debug/tracing/set_ftrace_filter
Link: http://lkml.kernel.org/r/1443545176-3215-2-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
By now there isn't any subcommand for mod.
Before:
sh$ echo '*:mod:ipv6:a' > set_ftrace_filter
sh$ echo '*:mod:ipv6' > set_ftrace_filter
had the same results, but now first will result in:
sh$ echo '*:mod:ipv6:a' > set_ftrace_filter
-bash: echo: write error: Invalid argument
Also, I clarified ftrace_mod_callback code a little.
Link: http://lkml.kernel.org/r/1443545176-3215-1-git-send-email-0x7f454c46@gmail.com
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
[ converted 'if (ret == 0)' to 'if (!ret)' ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
__trace_sched_switch_state() is the last remaining PREEMPT_ACTIVE
user, move trace_sched_switch() from prepare_task_switch() to
__schedule() and propagate the @preempt argument.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To cover the common case of sorting an array of pointers, Daniel
Wagner recently modified the library sort() to use a specific swap
function for size==8, in addition to the size==4 case which was
already handled. Since sizeof(long) is either 4 or 8,
ftrace_swap_ips() is redundant and we can just let sort() pick an
appropriate and fast swap callback.
Link: http://lkml.kernel.org/r/1441834023-13130-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The kernel now has kstrdup_const/kfree_const for reusing .rodata
(typically string literals) when possible; there's no reason to
duplicate that logic in the tracing system. Moreover, as the comment
above core_kernel_data states, it may not always return true for
.rodata - that is for example the case on x86_64, where we thus end up
kstrdup'ing all the passed-in strings.
Arguably, testing for .rodata explicitly (as kstrdup_const does) is
also more correct: I don't think one is supposed to be able to change
the name after creating the event_subsystem by passing the address of
a static char (but non-const) array.
Link: http://lkml.kernel.org/r/1441833841-12955-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the tracer options to instances options directory as well. Only add the
options for tracers that are allowed to be enabled by an instance. But note,
that tracer options are global. That is, tracer options enabled in an
instance, also take affect at the top level and in other instances.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow instances to have their own options, at least for the core options
(non tracer specific ones). There are a few global options that should not
be added to instances, like enabling of trace_printk, and the sched comm
recording, which do not have a specific trace instance associated to them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for the multi buffer instances to have their own trace_flags,
the check in ftrace_trace_stack() needs to test the trace_array descriptor
flag that is for the current event, not the global_trace descriptor.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation of having the multi buffer instances having their own trace
option flags, the trace option files needs a way to not only pass in the
flag they represent, but also the trace_array descriptor.
A new field is added to the trace_array descriptor called trace_flags_index,
which is a 32 byte character array representing a bit. This array is simply
filled with the index of the array, where
index_array[n] = n;
Then the address of this array is passed to the file callbacks instead of
the index of the flag index. Then to retrieve both the flag index and the
trace_array descriptor:
data is the passed in argument.
index = *(unsigned char *)data;
data -= index;
/* Now data points to the address of the array in the trace_array */
tr = container_of(data, struct trace_array, trace_flags_index);
Suggested-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation to make trace options per instance, the global trace_flags
needs to be moved from being a global variable to a field within the trace
instance trace_array structure.
There's still more work to do, as there's some functions that use
trace_flags without passing in a way to get to the current_trace array. For
those, the global_trace is used directly (from trace.c). This includes
setting and clearing the trace_flags. This means that when a new instance is
created, it just gets the trace_flags of the global_trace and will not be
able to modify them. Depending on the functions that have access to the
trace_array, the flags of an instance may not affect parts of its trace,
where the global_trace is used. These will be fixed in future changes.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The sleep-time and graph-time options are only for the function graph tracer
and are not used by anything else. As tracer options are now visible when
the tracer is not activated, its better to move the function graph specific
tracer options into the function graph tracer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the effort to move the global trace_flags to the tracing instances, the
direct access to trace_flags must be removed from trace_printk.c
Instead, add a new trace_printk_enabled boolean that is set by a new access
function trace_printk_control(), that will enable or disable trace_printk.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a enum that denotes the last bit of the trace_flags and have a
BUILD_BUG_ON(last_bit > 32).
If we add more bits than we have in trace_flags, the kernel wont build.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are options that are unique to a specific tracer (like function and
function graph). Currently, these options are only visible in the options
directory when the tracer is enabled.
This has been a pain, especially for something like the func_stack_trace
option that if used inappropriately, could bring the system to a crawl. But
the only way to see it, is to enable the function tracer.
For example, if one had done:
# cd /sys/kernel/tracing
# echo __schedule > set_ftrace_filter
# echo 1 > options/func_stack_trace
# echo function > current_tracer
The __schedule call will be traced and a stack trace will also be recorded
there. Now when you were done, you may do...
# echo nop > current_tracer
# echo > set_ftrace_filter
But you forgot to disable the func_stack_trace. The only way to disable it
is to re-enable function tracing first. If you do not add a filter to
set_ftrace_filter and just do:
# echo function > current_tracer
Now you would be performing a stack trace on *every* function! On some
systems, that causes a live lock. Others may take a few minutes to fix your
mistake.
Having the func_stack_trace option visible allows you to check it and
disable it before enabling the funtion tracer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Only create the stacktrace trace option when CONFIG_STACKTRACE is
configured.
Cleaned up the ftrace_trace_stack() function call a little to allow better
encapsulation of the stacktrace trace flag.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the function tracer is not compiled in, do not create the option files
for it.
Fix up both the sched_wakeup and irqsoff tracers to handle the change.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use a cute little macro trick to keep the names of the trace flags file
guaranteed to match the corresponding masks.
The macro TRACE_FLAGS is defined as a serious of enum names followed by
the string name of the file that matches it. For example:
#define TRACE_FLAGS \
C(PRINT_PARENT, "print-parent"), \
C(SYM_OFFSET, "sym-offset"), \
C(SYM_ADDR, "sym-addr"), \
C(VERBOSE, "verbose"),
Now we can define the following:
#undef C
#define C(a, b) TRACE_ITER_##a##_BIT
enum trace_iterator_bits { TRACE_FLAGS };
The above creates:
enum trace_iterator_bits {
TRACE_ITER_PRINT_PARENT_BIT,
TRACE_ITER_SYM_OFFSET_BIT,
TRACE_ITER_SYM_ADDR_BIT,
TRACE_ITER_VERBOSE_BIT,
};
Then we can redefine C as:
#undef C
#define C(a, b) TRACE_ITER_##a = (1 << TRACE_ITER_##a##_BIT)
enum trace_iterator_flags { TRACE_FLAGS };
Which creates:
enum trace_iterator_flags {
TRACE_ITER_PRINT_PARENT = (1 << TRACE_ITER_PRINT_PARENT_BIT),
TRACE_ITER_SYM_OFFSET = (1 << TRACE_ITER_SYM_OFFSET_BIT),
TRACE_ITER_SYM_ADDR = (1 << TRACE_ITER_SYM_ADDR_BIT),
TRACE_ITER_VERBOSE = (1 << TRACE_ITER_VERBOSE_BIT),
};
Then finally we can create the list of file names:
#undef C
#define C(a, b) b
static const char *trace_options[] = {
TRACE_FLAGS
NULL
};
Which creates:
static const char *trace_options[] = {
"print-parent",
"sym-offset",
"sym-addr",
"verbose",
NULL
};
The importance of this is that the strings match the bit index.
trace_options[TRACE_ITER_SYM_ADDR_BIT] == "sym-addr"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using enums with FLAG_BIT and then defining a FLAG = (1 << FLAG_BIT), is a
bit more robust as we require that there are no bits out of order or skipped
to match the file names that represent the bits.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There was a time where the function tracing would disable interrupts unless
specifically told not to, where it would only disable preemption. With the
new lockless code, the function tracing never disalbes interrupts and just
uses disabling of preemption. Remove the option "ftrace_preempt" as it does
nothing anyway.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to facilitate making all tracer options visible even when the
tracer is not active, we need to get rid of duplicate options. Any option
that is shared between multiple tracers really should be a main option.
As the wakeup and irqsoff tracers both use the "display-graph" option, and
use it exactly the same way, move that option from the tracer options to the
main options and consolidate them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
seq_print_user_ip() is used in only one location in one file. Turn it into a
static function. We could inject its code into the caller, but that would
make the code a bit too complex. Keep the code separate.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
seq_print_userip_objs() is used only in one location, in one file. Instead
of having it as an external function, go one further than making it static,
but inject is code into its only user. It doesn't make the calling function
much more complex.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for having trace options be per instance, the trace_array
needs to be passed to the trace_buffer_unlock_commit(). The
trace_event_buffer_lock_reserve() already passes in the trace_event_file
where the trace_array can be derived from.
Also added a "__init" to the boot up test event plus function tracing
function function_test_events_call().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_trace_stack_regs() is used in only one place, and because that is
such a simple function, just move its code into the location that it was
used in (trace_buffer_unlock_commit_regs()).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch makes is_good_name return bool to improve readability
due to this particular function only using either one or zero as its
return value.
No functional change.
Link: http://lkml.kernel.org/r/1442929393-4753-2-git-send-email-bywxiaobai@163.com
Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The changes with more meat are:
o Allowing the trace event filters to filter on CPU number and process ids
o Two new markers for trace output latency were added
(10 and 100 msec latencies)
o Have tracing_thresh filter function profiling time
I also worked on modifying the ring buffer code for some future
work, and moved the adding of the timestamp around. One of my changes
caused a regression, and since other changes were built on top of it
and already tested, I had to operate a revert of that change. Instead
of rebasing, this change set has the code that caused a regression
as well as the code to revert that change without touching the other
changes that were made on top of it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJV6aZEAAoJEEjnJuOKh9ldrR4H/A1RcQf1prLLoUibPP4w3lat
dmQcdpS1NY+cqyiKuKPAOkFDGQL7qWzRqZ8whcPSJIsHq57ufqNSLf+0bbQYPzg9
g3CgGL7OApmGi5ulj0sNxhadvc9TFm/SAN0nVJlNuUWdm8e1UWHLsrJZaMfopu2r
RDEtkOhg619mhDL4rktNdS6rk0B92Fhu2o2PwLZPVlUl1NNEt4WJU+ejitXUVO1A
Nb70/rTGGJKtyHbW+74on4LnEN5Uu0Viu6rMwGfYyIgRmC2otdBDvE4xfKMiTUKr
SzBjzrhIoMIRn4Vl0vElfulkpYaw7pcC2BdpZ4d9VpIOiLSlZs0x/TgCtpFEv5M=
=baZ3
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing update from Steven Rostedt:
"Mostly this is just clean ups and micro optimizations.
The changes with more meat are:
- Allowing the trace event filters to filter on CPU number and
process ids
- Two new markers for trace output latency were added (10 and 100
msec latencies)
- Have tracing_thresh filter function profiling time
I also worked on modifying the ring buffer code for some future work,
and moved the adding of the timestamp around. One of my changes
caused a regression, and since other changes were built on top of it
and already tested, I had to operate a revert of that change. Instead
of rebasing, this change set has the code that caused a regression as
well as the code to revert that change without touching the other
changes that were made on top of it"
* tag 'trace-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated"
tracing: Don't make assumptions about length of string on task rename
tracing: Allow triggers to filter for CPU ids and process names
ftrace: Format MCOUNT_ADDR address as type unsigned long
tracing: Introduce two additional marks for delay
ftrace: Fix function_graph duration spacing with 7-digits
ftrace: add tracing_thresh to function profile
tracing: Clean up stack tracing and fix fentry updates
ring-buffer: Reorganize function locations
ring-buffer: Make sure event has enough room for extend and padding
ring-buffer: Get timestamp after event is allocated
ring-buffer: Move the adding of the extended timestamp out of line
ring-buffer: Add event descriptor to simplify passing data
ftrace: correct the counter increment for trace_buffer data
tracing: Fix for non-continuous cpu ids
tracing: Prefer kcalloc over kzalloc with multiply
Pull networking updates from David Miller:
"Another merge window, another set of networking changes. I've heard
rumblings that the lightweight tunnels infrastructure has been voted
networking change of the year. But what do I know?
1) Add conntrack support to openvswitch, from Joe Stringer.
2) Initial support for VRF (Virtual Routing and Forwarding), which
allows the segmentation of routing paths without using multiple
devices. There are some semantic kinks to work out still, but
this is a reasonably strong foundation. From David Ahern.
3) Remove spinlock fro act_bpf fast path, from Alexei Starovoitov.
4) Ignore route nexthops with a link down state in ipv6, just like
ipv4. From Andy Gospodarek.
5) Remove spinlock from fast path of act_gact and act_mirred, from
Eric Dumazet.
6) Document the DSA layer, from Florian Fainelli.
7) Add netconsole support to bcmgenet, systemport, and DSA. Also
from Florian Fainelli.
8) Add Mellanox Switch Driver and core infrastructure, from Jiri
Pirko.
9) Add support for "light weight tunnels", which allow for
encapsulation and decapsulation without bearing the overhead of a
full blown netdevice. From Thomas Graf, Jiri Benc, and a cast of
others.
10) Add Identifier Locator Addressing support for ipv6, from Tom
Herbert.
11) Support fragmented SKBs in iwlwifi, from Johannes Berg.
12) Allow perf PMUs to be accessed from eBPF programs, from Kaixu Xia.
13) Add BQL support to 3c59x driver, from Loganaden Velvindron.
14) Stop using a zero TX queue length to mean that a device shouldn't
have a qdisc attached, use an explicit flag instead. From Phil
Sutter.
15) Use generic geneve netdevice infrastructure in openvswitch, from
Pravin B Shelar.
16) Add infrastructure to avoid re-forwarding a packet in software
that was already forwarded by a hardware switch. From Scott
Feldman.
17) Allow AF_PACKET fanout function to be implemented in a bpf
program, from Willem de Bruijn"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1458 commits)
netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled
net: fec: clear receive interrupts before processing a packet
ipv6: fix exthdrs offload registration in out_rt path
xen-netback: add support for multicast control
bgmac: Update fixed_phy_register()
sock, diag: fix panic in sock_diag_put_filterinfo
flow_dissector: Use 'const' where possible.
flow_dissector: Fix function argument ordering dependency
ixgbe: Resolve "initialized field overwritten" warnings
ixgbe: Remove bimodal SR-IOV disabling
ixgbe: Add support for reporting 2.5G link speed
ixgbe: fix bounds checking in ixgbe_setup_tc for 82598
ixgbe: support for ethtool set_rxfh
ixgbe: Avoid needless PHY access on copper phys
ixgbe: cleanup to use cached mask value
ixgbe: Remove second instance of lan_id variable
ixgbe: use kzalloc for allocating one thing
flow: Move __get_hash_from_flowi{4,6} into flow_dissector.c
ixgbe: Remove unused PCI bus types
...
The commit a4543a2fa9 "ring-buffer: Get timestamp after event is
allocated" is needed for some future work. But after adding it, there is a
race somewhere that causes the saved timestamp to have a slight shift, and
get ahead of the actual timestamp and make it look like time goes backwards.
I'm still looking into why this happens, but in the mean time, this is
holding up other work to get in. I'm reverting the change for now (which
makes the problem go away), and will add it back after I know what is wrong
and fix it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:
- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.
- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.
- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.
- Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.
- Improvement of the SGP gaps logic from Keith Busch.
- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.
- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.
- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.
- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.
- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.
- Small blk cgroup cleanup from Viresh Kumar"
* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...
Pull scheduler updates from Ingo Molnar:
"The biggest change in this cycle is the rewrite of the main SMP load
balancing metric: the CPU load/utilization. The main goal was to make
the metric more precise and more representative - see the changelog of
this commit for the gory details:
9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
It is done in a way that significantly reduces complexity of the code:
5 files changed, 249 insertions(+), 494 deletions(-)
and the performance testing results are encouraging. Nevertheless we
need to keep an eye on potential regressions, since this potentially
affects every SMP workload in existence.
This work comes from Yuyang Du.
Other changes:
- SCHED_DL updates. (Andrea Parri)
- Simplify architecture callbacks by removing finish_arch_switch().
(Peter Zijlstra et al)
- cputime accounting: guarantee stime + utime == rtime. (Peter
Zijlstra)
- optimize idle CPU wakeups some more - inspired by Facebook server
loads. (Mike Galbraith)
- stop_machine fixes and updates. (Oleg Nesterov)
- Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra)
- sched/numa tweaks. (Srikar Dronamraju)
- misc fixes and small cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
sched/deadline: Fix comment in enqueue_task_dl()
sched/deadline: Fix comment in push_dl_tasks()
sched: Change the sched_class::set_cpus_allowed() calling context
sched: Make sched_class::set_cpus_allowed() unconditional
sched: Fix a race between __kthread_bind() and sched_setaffinity()
sched: Ensure a task has a non-normalized vruntime when returning back to CFS
sched/numa: Fix NUMA_DIRECT topology identification
tile: Reorganize _switch_to()
sched, sparc32: Update scheduler comments in copy_thread()
sched: Remove finish_arch_switch()
sched, tile: Remove finish_arch_switch
sched, sh: Fold finish_arch_switch() into switch_to()
sched, score: Remove finish_arch_switch()
sched, avr32: Remove finish_arch_switch()
sched, MIPS: Get rid of finish_arch_switch()
sched, arm: Remove finish_arch_switch()
sched/fair: Clean up load average references
sched/fair: Provide runnable_load_avg back to cfs_rq
sched/fair: Remove task and group entity load when they are dead
sched/fair: Init cfs_rq's sched_entity load average
...
generalize FETCH_FUNC_NAME(memory, string) into
strncpy_from_unsafe() and fix sparse warnings that were
present in original implementation.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By extending the filter rules by more generic fields
we can write triggers filters like
echo 'stacktrace if cpu == 1' > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
or
echo 'stacktrace if comm == sshd' > \
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
CPU and COMM are not part of struct trace_entry. We could add the two
new fields to ftrace_common_field list and fix up all depending
sides. But that looks pretty ugly. Another thing I would like to
avoid that the 'format' file contents changes.
All this can be avoided by introducing another list which contains
non field members of struct trace_entry.
Link: http://lkml.kernel.org/r/1439210146-24707-1-git-send-email-daniel.wagner@bmw-carit.de
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to the perf_event_map_fd and index, the function
bpf_perf_event_read() can convert the corresponding map
value to the pointer to struct perf_event and return the
Hardware PMU counter value.
Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
By copying BPF related operation to uprobe processing path, this patch
allow users attach BPF programs to uprobes like what they are already
doing on kprobes.
After this patch, users are allowed to use PERF_EVENT_IOC_SET_BPF on a
uprobe perf event. Which make it possible to profile user space programs
and kernel events together using BPF.
Because of this patch, CONFIG_BPF_EVENTS should be selected by
CONFIG_UPROBE_EVENT to ensure trace_call_bpf() is compiled even if
KPROBE_EVENT is not set.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kaixu Xia <xiakaixu@huawei.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1435716878-189507-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Mathieu reported that since 317f394160 ("sched: Move the second half
of ttwu() to the remote cpu") trace_sched_wakeup() can happen out of
context of the waker.
This is a problem when you want to analyse wakeup paths because it is
now very hard to correlate the wakeup event to whoever issued the
wakeup.
OTOH trace_sched_wakeup() is issued at the point where we set
p->state = TASK_RUNNING, which is right were we hand the task off to
the scheduler, so this is an important point when looking at
scheduling behaviour, up to here its been the wakeup path everything
hereafter is due to scheduler policy.
To bridge this gap, introduce a second tracepoint: trace_sched_waking.
It is guaranteed to be called in the waker context.
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Francis Giraldeau <francis.giraldeau@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150609091336.GQ3644@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 4104d326b6 ("ftrace: Remove global function list and call function
directly") simplified the ftrace code by removing the global_ops list with a
new design. But this cleanup also broke the filtering of PIDs that are added
to the set_ftrace_pid file.
Add back the proper hooks to have pid filtering working once again.
Cc: stable@vger.kernel.org # 3.16+
Reported-by: Matt Fleming <matt@console-pimps.org>
Reported-by: Richard Weinberger <richard.weinberger@gmail.com>
Tested-by: Matt Fleming <matt@console-pimps.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A fine granulity support for delay would be very useful when profiling
VM logics, such as page allocation including page reclaim and memory
compaction with function graph.
Thus, this patch adds two additional marks with two changes.
- An equal sign in mark selection function is removed to align code
behavior with comments and documentation.
- The function graph example related to delay in ftrace.txt is updated
to cover all supported marks.
Link: http://lkml.kernel.org/r/1436626300-1679-3-git-send-email-jungseoklee85@gmail.com
Cc: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Jungseok Lee noticed the following:
Currently, row's width of 7-digit duration numbers not aligned with
other cases like the following example.
3) $ 3999884 us | }
3) | finish_task_switch() {
3) 0.365 us | _raw_spin_unlock_irq();
3) 3.333 us | }
3) $ 3999976 us | }
3) $ 3999979 us | } /* schedule */
As adding a single white space in case of 7-digit numbers, the format
could be unified easily as follows.
3) $ 2237472 us | }
3) | finish_task_switch() {
3) 0.364 us | _raw_spin_unlock_irq();
3) 3.125 us | }
3) $ 2237556 us | }
3) $ 2237559 us | } /* schedule */
Instead of making a special case for 7-digit numbers, the logic
of the len and the space loop is slightly modified to make the
two cases have the same format.
Link: http://lkml.kernel.org/r/1436626300-1679-2-git-send-email-jungseoklee85@gmail.com
Reported-by: Jungseok Lee <jungseoklee85@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch extends tracing_thresh functionality to function profile tracer.
If tracing_thresh is set, print those entries only,
whose average is > tracing thresh.
Link: http://lkml.kernel.org/r/1434972488-8571-1-git-send-email-umesh.t@samsung.com
Signed-off-by: Umesh Tiwari <umesh.t@samsung.com>
[ Removed unnecessary 'moved' comment ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Akashi Takahiro was porting the stack tracer to arm64 and found some
issues with it. One was that it repeats the top function, due to the
stack frame added by the mcount caller and added by itself. This
was added when fentry came in, and before fentry created its own stack
frame. But x86's fentry now creates its own stack frame, and there's
no need to insert the function again.
This also cleans up the code a bit, where it doesn't need to do something
special for fentry, and doesn't include insertion of a duplicate
entry for the called function being traced.
Link: http://lkml.kernel.org/r/55A646EE.6030402@linaro.org
Some-suggestions-by: Jungseok Lee <jungseoklee85@gmail.com>
Some-suggestions-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: AKASHI Takahiro <takahiro.akashi@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Functions in ring-buffer.c have gotten interleaved between different
use cases. Move the functions around to get like functions closer
together. This may or may not help gcc keep cache locality, but it
makes it a little easier to work with the code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that events only add time extends after it is committed, in case
an event comes in before it can discard the allocated event, the time
extend needs to be stored within the event. If the event is bigger
than then size needed for the time extend, padding must be added.
The minimum padding size is 8 bytes. Thus if the event is 12 bytes
(size of time extend + 4), there will not be enough room to add both
the time extend and padding. Make sure all events are either 8 bytes
or 16 or more bytes.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move the capturing of the timestamp to after an event is allocated.
If the event is not a commit (where it is an event that preempted
another event), then no timestamp is needed, because the delta of
nested events is always zero.
If the event starts on a new page, no delta needs to be calculated
as the full timestamp will be added to the page header, and the
event will have a delta of zero.
Now if the event requires a time extend (the delta does not fit
in the 27 bit delta slot in the header), then the event is discarded,
the length is extended to hold the TIME_EXTEND event that allows for
a 59 bit delta, and the commit is tried again.
If the event can't be discarded (another event came in after it),
then the TIME_EXTEND is added directly to the allocated event and
the rest of the event is given padding.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Requiring a extended time stamp is an uncommon occurrence, and it is
best to do it out of line when needed.
Add a noinline function that handles the extended timestamp and
have it called with an unlikely to completely move it out of the
fast path.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add rb_event_info descriptor to pass event info to functions a bit
easier than using a bunch of parameters. This will also allow for
changing the code around a bit to find better fast paths.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In ftrace_dump, for disabling buffer, iter.tr->trace_buffer.data is used.
But for enabling, iter.trace_buffer->data is used.
Even though, both point to same buffer, for readability, same convention
should be used.
Link: http://lkml.kernel.org/r/1434972306-20043-1-git-send-email-umesh.t@samsung.com
Signed-off-by: Umesh Tiwari <umesh.t@samsung.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently exception occures due to access beyond buffer_iter
range while using index of cpu bigger than num_possible_cpus().
Below there is an example for such exception when we use
cpus 0,1,16,17.
In order to fix buffer allocation size for non-continuous cpu ids
we allocate according to the max cpu id and not according to the
amount of possible cpus.
Example:
$ cat /sys/kernel/debug/tracing/per_cpu/cpu1/trace
Path: /bin/busybox
CPU: 0 PID: 82 Comm: cat Not tainted 4.0.0 #29
task: 80734c80 ti: 80012000 task.ti: 80012000
[ECR ]: 0x00220100 => Invalid Read @ 0x00000000 by insn @ 0x800abafc
[EFA ]: 0x00000000
[BLINK ]: ring_buffer_read_finish+0x24/0x64
[ERET ]: rb_check_pages+0x20/0x188
[STAT32]: 0x00001a00 :
BTA: 0x800abafc SP: 0x80013f0c FP: 0x57719cf8
LPS: 0x200036b4 LPE: 0x200036b8 LPC: 0x00000000
r00: 0x8002aca0 r01: 0x00001606 r02: 0x00000000
r03: 0x00000001 r04: 0x00000000 r05: 0x804b4954
r06: 0x00030003 r07: 0x8002a260 r08: 0x00000286
r09: 0x00080002 r10: 0x00001006 r11: 0x807351a4
r12: 0x00000001
Stack Trace:
rb_check_pages+0x20/0x188
ring_buffer_read_finish+0x24/0x64
tracing_release+0x4e/0x170
__fput+0x62/0x158
task_work_run+0xa2/0xd4
do_notify_resume+0x52/0x7c
resume_user_mode_begin+0xdc/0xe0
Link: http://lkml.kernel.org/r/1433835155-6894-3-git-send-email-gilf@ezchip.com
Signed-off-by: Noam Camus <noamc@ezchip.com>
Signed-off-by: Gil Fruchter <gilf@ezchip.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use kcalloc for allocating an array instead of kzalloc with multiply,
as that is what kcalloc is used for.
Found with checkpatch.
Link: http://lkml.kernel.org/r/1433835155-6894-2-git-send-email-gilf@ezchip.com
Signed-off-by: Gil Fruchter <gilf@ezchip.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fengguang Wu's tests triggered a bug in the branch tracer's start up
test when CONFIG_DEBUG_PREEMPT set. This was because that config
adds some debug logic in the per cpu field, which calls back into
the branch tracer.
The branch tracer has its own recursive checks, but uses a per cpu
variable to implement it. If retrieving the per cpu variable calls
back into the branch tracer, you can see how things will break.
Instead of using a per cpu variable, use the trace_recursion field
of the current task struct. Simply set a bit when entering the
branch tracing and clear it when leaving. If the bit is set on
entry, just don't do the tracing.
There's also the case with lockdep, as the local_irq_save() called
before the recursion can also trigger code that can call back into
the function. Changing that to a raw_local_irq_save() will protect
that as well.
This prevents the recursion and the inevitable crash that follows.
Link: http://lkml.kernel.org/r/20150630141803.GA28071@wfg-t540p.sh.intel.com
Cc: stable@vger.kernel.org # 3.10+
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
"monitonic raw". Also some enhancements to make the ring buffer even
faster. But the biggest and most noticeable change is the renaming of
the ftrace* files, structures and variables that have to deal with
trace events.
Over the years I've had several developers tell me about their confusion
with what ftrace is compared to events. Technically, "ftrace" is the
infrastructure to do the function hooks, which include tracing and also
helps with live kernel patching. But the trace events are a separate
entity altogether, and the files that affect the trace events should
not be named "ftrace". These include:
include/trace/ftrace.h -> include/trace/trace_events.h
include/linux/ftrace_event.h -> include/linux/trace_events.h
Also, functions that are specific for trace events have also been renamed:
ftrace_print_*() -> trace_print_*()
(un)register_ftrace_event() -> (un)register_trace_event()
ftrace_event_name() -> trace_event_name()
ftrace_trigger_soft_disabled()-> trace_trigger_soft_disabled()
ftrace_define_fields_##call() -> trace_define_fields_##call()
ftrace_get_offsets_##call() -> trace_get_offsets_##call()
Structures have been renamed:
ftrace_event_file -> trace_event_file
ftrace_event_{call,class} -> trace_event_{call,class}
ftrace_event_buffer -> trace_event_buffer
ftrace_subsystem_dir -> trace_subsystem_dir
ftrace_event_raw_##call -> trace_event_raw_##call
ftrace_event_data_offset_##call-> trace_event_data_offset_##call
ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call
And a few various variables and flags have also been updated.
This has been sitting in linux-next for some time, and I have not heard
a single complaint about this rename breaking anything. Mostly because
these functions, variables and structures are mostly internal to the
tracing system and are seldom (if ever) used by anything external to that.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJViYhVAAoJEEjnJuOKh9ldcJ0IAI+mytwoMAN/CWDE8pXrTrgs
aHlcr1zorSzZ0Lq6lKsWP+V0VGVhP8KWO16vl35HaM5ZB9U+cDzWiGobI8JTHi/3
eeTAPTjQdgrr/L+ZO1ApzS1jYPhN3Xi5L7xublcYMJjKfzU+bcYXg/x8gRt0QbG3
S9QN/kBt0JIIjT7McN64m5JVk2OiU36LxXxwHgCqJvVCPHUrriAdIX7Z5KRpEv13
zxgCN4d7Jiec/FsMW8dkO0vRlVAvudZWLL7oDmdsvNhnLy8nE79UOeHos2c1qifQ
LV4DeQ+2Hlu7w9wxixHuoOgNXDUEiQPJXzPc/CuCahiTL9N/urQSGQDoOVMltR4=
=hkdz
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This patch series contains several clean ups and even a new trace
clock "monitonic raw". Also some enhancements to make the ring buffer
even faster. But the biggest and most noticeable change is the
renaming of the ftrace* files, structures and variables that have to
deal with trace events.
Over the years I've had several developers tell me about their
confusion with what ftrace is compared to events. Technically,
"ftrace" is the infrastructure to do the function hooks, which include
tracing and also helps with live kernel patching. But the trace
events are a separate entity altogether, and the files that affect the
trace events should not be named "ftrace". These include:
include/trace/ftrace.h -> include/trace/trace_events.h
include/linux/ftrace_event.h -> include/linux/trace_events.h
Also, functions that are specific for trace events have also been renamed:
ftrace_print_*() -> trace_print_*()
(un)register_ftrace_event() -> (un)register_trace_event()
ftrace_event_name() -> trace_event_name()
ftrace_trigger_soft_disabled() -> trace_trigger_soft_disabled()
ftrace_define_fields_##call() -> trace_define_fields_##call()
ftrace_get_offsets_##call() -> trace_get_offsets_##call()
Structures have been renamed:
ftrace_event_file -> trace_event_file
ftrace_event_{call,class} -> trace_event_{call,class}
ftrace_event_buffer -> trace_event_buffer
ftrace_subsystem_dir -> trace_subsystem_dir
ftrace_event_raw_##call -> trace_event_raw_##call
ftrace_event_data_offset_##call-> trace_event_data_offset_##call
ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call
And a few various variables and flags have also been updated.
This has been sitting in linux-next for some time, and I have not
heard a single complaint about this rename breaking anything. Mostly
because these functions, variables and structures are mostly internal
to the tracing system and are seldom (if ever) used by anything
external to that"
* tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
ring_buffer: Allow to exit the ring buffer benchmark immediately
ring-buffer-benchmark: Fix the wrong type
ring-buffer-benchmark: Fix the wrong param in module_param
ring-buffer: Add enum names for the context levels
ring-buffer: Remove useless unused tracing_off_permanent()
ring-buffer: Give NMIs a chance to lock the reader_lock
ring-buffer: Add trace_recursive checks to ring_buffer_write()
ring-buffer: Allways do the trace_recursive checks
ring-buffer: Move recursive check to per_cpu descriptor
ring-buffer: Add unlikelys to make fast path the default
tracing: Rename ftrace_get_offsets_##call() to trace_event_get_offsets_##call()
tracing: Rename ftrace_define_fields_##call() to trace_event_define_fields_##call()
tracing: Rename ftrace_event_type_funcs_##call to trace_event_type_funcs_##call
tracing: Rename ftrace_data_offset_##call to trace_event_data_offset_##call
tracing: Rename ftrace_raw_##call event structures to trace_event_raw_##call
tracing: Rename ftrace_trigger_soft_disabled() to trace_trigger_soft_disabled()
tracing: Rename FTRACE_EVENT_FL_* flags to EVENT_FILE_FL_*
tracing: Rename struct ftrace_subsystem_dir to trace_subsystem_dir
tracing: Rename ftrace_event_name() to trace_event_name()
tracing: Rename FTRACE_MAX_EVENT to TRACE_EVENT_TYPE_MAX
...
I could not come up with a situation where the operand counter (cnt)
could go below zero, so I added a WARN_ON_ONCE(cnt < 0). Vince was
able to trigger that warn on with his fuzzer test, but didn't have
a filter input that caused it.
Later, Sasha Levin was able to trigger that same warning, and was
able to give me the filter string that triggered it. It was simply
a single operation ">".
I wrapped the filtering code in a userspace program such that I could
single step through the logic. With a single operator the operand
counter can legitimately go below zero, and should be reported to the
user as an error, but should not produce a kernel warning. The
WARN_ON_ONCE(cnt < 0) should be just a "if (cnt < 0) break;" and the
code following it will produce the error message for the user.
While debugging this, I found that there was another bug that let
the pointer to the filter string go beyond the filter string.
This too was fixed.
Finally, there was a typo in a stub function that only gets compiled
if trace events is disabled but tracing is enabled (I'm not even sure
that's possible).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVjWh2AAoJEEjnJuOKh9ldOn0IANHPW82++0O87U1pEe3hHnKv
gSTKiNPVNC3GBt9DHnawA0EuyPfPa+Wj5X2xgrstWA+KRADZErZzdWpzbh/iHosJ
0kaUFqFcaKBheOSqhHfz3WQshD16pb1lQYbV7vbdzMjpcIpYT3VcuKQq3zQVb5Pr
njPmgZXK+I9ITYQ8E+DysnTg0+Mo+l/2P/tqnBoIkAVmuZitfJS5okTtVw1GNzyR
7VRMGBE3G0GxB++57T/xILXjFc9sSGQH5lZgLHQhEh36YgWuDvc0R2FfxDKROmeq
b/xw68uCO1Hv8oEng6r/UceVtUoaXhf+JamSJqxztBTsjsqR/iXCV78Jac1vnPY=
=cmr8
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This isn't my 4.2 pull request (yet). I found a few more bugs that I
would have sent to fix 4.1, but since 4.1 is already out, I'm sending
this before sending my 4.2 request (which is ready to go).
After fixing the previous filter issue reported by Vince Weaver, I
could not come up with a situation where the operand counter (cnt)
could go below zero, so I added a WARN_ON_ONCE(cnt < 0). Vince was
able to trigger that warn on with his fuzzer test, but didn't have a
filter input that caused it.
Later, Sasha Levin was able to trigger that same warning, and was able
to give me the filter string that triggered it. It was simply a
single operation ">".
I wrapped the filtering code in a userspace program such that I could
single step through the logic. With a single operator the operand
counter can legitimately go below zero, and should be reported to the
user as an error, but should not produce a kernel warning. The
WARN_ON_ONCE(cnt < 0) should be just a "if (cnt < 0) break;" and the
code following it will produce the error message for the user.
While debugging this, I found that there was another bug that let the
pointer to the filter string go beyond the filter string. This too
was fixed.
Finally, there was a typo in a stub function that only gets compiled
if trace events is disabled but tracing is enabled (I'm not even sure
that's possible)"
* tag 'trace-fixes-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix typo from "static inlin" to "static inline"
tracing/filter: Do not allow infix to exceed end of string
tracing/filter: Do not WARN on operand count going below zero
Part of the disassembly of do_blk_trace_setup:
231b: e8 00 00 00 00 callq 2320 <do_blk_trace_setup+0x50>
231c: R_X86_64_PC32 strlen+0xfffffffffffffffc
2320: eb 0a jmp 232c <do_blk_trace_setup+0x5c>
2322: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
2328: 48 83 c3 01 add $0x1,%rbx
232c: 48 39 d8 cmp %rbx,%rax
232f: 76 47 jbe 2378 <do_blk_trace_setup+0xa8>
2331: 41 80 3c 1c 2f cmpb $0x2f,(%r12,%rbx,1)
2336: 75 f0 jne 2328 <do_blk_trace_setup+0x58>
2338: 41 c6 04 1c 5f movb $0x5f,(%r12,%rbx,1)
233d: 4c 89 e7 mov %r12,%rdi
2340: e8 00 00 00 00 callq 2345 <do_blk_trace_setup+0x75>
2341: R_X86_64_PC32 strlen+0xfffffffffffffffc
2345: eb e1 jmp 2328 <do_blk_trace_setup+0x58>
Yep, that's right: gcc isn't smart enough to realize that replacing '/' by
'_' cannot change the strlen(), so we call it again and again (at least
when a '/' is found). Even if gcc were that smart, this construction
would still loop over the string twice, once for the initial strlen() call
and then the open-coded loop.
Let's simply use strreplace() instead.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Liked-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no point in starting over every time we see a ','...
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The trace.h header when called without CONFIG_EVENT_TRACING enabled
(seldom done), will not compile because of a typo in the protocol
of trace_event_enum_update().
Cc: stable@vger.kernel.org # 4.1+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While debugging a WARN_ON() for filtering, I found that it is possible
for the filter string to be referenced after its end. With the filter:
# echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter
The filter_parse() function can call infix_get_op() which calls
infix_advance() that updates the infix filter pointers for the cnt
and tail without checking if the filter is already at the end, which
will put the cnt to zero and the tail beyond the end. The loop then calls
infix_next() that has
ps->infix.cnt--;
return ps->infix.string[ps->infix.tail++];
The cnt will now be below zero, and the tail that is returned is
already passed the end of the filter string. So far the allocation
of the filter string usually has some buffer that is zeroed out, but
if the filter string is of the exact size of the allocated buffer
there's no guarantee that the charater after the nul terminating
character will be zero.
Luckily, only root can write to the filter.
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When testing the fix for the trace filter, I could not come up with
a scenario where the operand count goes below zero, so I added a
WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case
that it can happen (although the filter would be wrong).
# echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter
That is, a single operation without any operands will hit the path
where the WARN_ON_ONCE() can trigger. Although this is harmless,
and the filter is reported as a error. But instead of spitting out
a warning to the kernel dmesg, just fail nicely and report it via
the proper channels.
Link: http://lkml.kernel.org/r/558C6082.90608@oracle.com
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull networking updates from David Miller:
1) Add TX fast path in mac80211, from Johannes Berg.
2) Add TSO/GRO support to ibmveth, from Thomas Falcon
3) Move away from cached routes in ipv6, just like ipv4, from Martin
KaFai Lau.
4) Lots of new rhashtable tests, from Thomas Graf.
5) Run ingress qdisc lockless, from Alexei Starovoitov.
6) Allow servers to fetch TCP packet headers for SYN packets of new
connections, for fingerprinting. From Eric Dumazet.
7) Add mode parameter to pktgen, for testing receive. From Alexei
Starovoitov.
8) Cache access optimizations via simplifications of build_skb(), from
Alexander Duyck.
9) Move page frag allocator under mm/, also from Alexander.
10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
11) Add a counter guard in case we try to perform endless reclassify
loops in the packet scheduler.
12) Extern flow dissector to be programmable and use it in new "Flower"
classifier. From Jiri Pirko.
13) AF_PACKET fanout rollover fixes, performance improvements, and new
statistics. From Willem de Bruijn.
14) Add netdev driver for GENEVE tunnels, from John W Linville.
15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
Borkmann.
18) Add tail call support to BPF, from Alexei Starovoitov.
19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
21) Favor even port numbers for allocation to connect() requests, and
odd port numbers for bind(0), in an effort to help avoid
ip_local_port_range exhaustion. From Eric Dumazet.
22) Add Cavium ThunderX driver, from Sunil Goutham.
23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
from Alexei Starovoitov.
24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
25) Double TCP Small Queues default to 256K to accomodate situations
like the XEN driver and wireless aggregation. From Wei Liu.
26) Add more entropy inputs to flow dissector, from Tom Herbert.
27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
Jonassen.
28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
29) Track and act upon link status of ipv4 route nexthops, from Andy
Gospodarek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
bridge: vlan: flush the dynamically learned entries on port vlan delete
bridge: multicast: add a comment to br_port_state_selection about blocking state
net: inet_diag: export IPV6_V6ONLY sockopt
stmmac: troubleshoot unexpected bits in des0 & des1
net: ipv4 sysctl option to ignore routes when nexthop link is down
net: track link-status of ipv4 nexthops
net: switchdev: ignore unsupported bridge flags
net: Cavium: Fix MAC address setting in shutdown state
drivers: net: xgene: fix for ACPI support without ACPI
ip: report the original address of ICMP messages
net/mlx5e: Prefetch skb data on RX
net/mlx5e: Pop cq outside mlx5e_get_cqe
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
net/mlx5e: Remove extra spaces
net/mlx5e: Avoid TX CQE generation if more xmit packets expected
net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
...
When the following filter is used it causes a warning to trigger:
# cd /sys/kernel/debug/tracing
# echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
-bash: echo: write error: Invalid argument
# cat events/ext4/ext4_truncate_exit/filter
((dev==1)blocks==2)
^
parse_error: No error
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990()
Modules linked in: bnep lockd grace bluetooth ...
CPU: 3 PID: 1223 Comm: bash Tainted: G W 4.1.0-rc3-test+ #450
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0
0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c
ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea
Call Trace:
[<ffffffff816ed4f9>] dump_stack+0x4c/0x6e
[<ffffffff8107fb07>] warn_slowpath_common+0x97/0xe0
[<ffffffff8136b46c>] ? _kstrtoull+0x2c/0x80
[<ffffffff8107fb6a>] warn_slowpath_null+0x1a/0x20
[<ffffffff81159065>] replace_preds+0x3c5/0x990
[<ffffffff811596b2>] create_filter+0x82/0xb0
[<ffffffff81159944>] apply_event_filter+0xd4/0x180
[<ffffffff81152bbf>] event_filter_write+0x8f/0x120
[<ffffffff811db2a8>] __vfs_write+0x28/0xe0
[<ffffffff811dda43>] ? __sb_start_write+0x53/0xf0
[<ffffffff812e51e0>] ? security_file_permission+0x30/0xc0
[<ffffffff811dc408>] vfs_write+0xb8/0x1b0
[<ffffffff811dc72f>] SyS_write+0x4f/0xb0
[<ffffffff816f5217>] system_call_fastpath+0x12/0x6a
---[ end trace e11028bd95818dcd ]---
Worse yet, reading the error message (the filter again) it says that
there was no error, when there clearly was. The issue is that the
code that checks the input does not check for balanced ops. That is,
having an op between a closed parenthesis and the next token.
This would only cause a warning, and fail out before doing any real
harm, but it should still not caues a warning, and the error reported
should work:
# cd /sys/kernel/debug/tracing
# echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
-bash: echo: write error: Invalid argument
# cat events/ext4/ext4_truncate_exit/filter
((dev==1)blocks==2)
^
parse_error: Meaningless filter expression
And give no kernel warning.
Link: http://lkml.kernel.org/r/20150615175025.7e809215@gandalf.local.home
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: stable@vger.kernel.org # 2.6.31+
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It's useful to do per-cpu histograms.
Suggested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf_trace_printk() is a helper function used to debug eBPF programs.
Let socket and TC programs use it as well.
Note, it's DEBUG ONLY helper. If it's used in the program,
the kernel will print warning banner to make sure users don't use
it in production.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eBPF programs attached to kprobes need to filter based on
current->pid, uid and other fields, so introduce helper functions:
u64 bpf_get_current_pid_tgid(void)
Return: current->tgid << 32 | current->pid
u64 bpf_get_current_uid_gid(void)
Return: current_gid << 32 | current_uid
bpf_get_current_comm(char *buf, int size_of_buf)
stores current->comm into buf
They can be used from the programs attached to TC as well to classify packets
based on current task fields.
Update tracex2 example to print histogram of write syscalls for each process
instead of aggregated for all.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It takes a while until the ring_buffer_benchmark module is removed
when the ring buffer hammer is running. It is because it takes
few seconds and kthread_should_stop() is not being checked.
This patch adds the check for kthread termination into the producer.
It uses the existing @kill_test flag to finish the kthreads as
cleanly as possible.
It disables printing the "ERROR" message when the kthread is going.
It makes sure that producer does not go into the 10sec sleep
when it is being killed.
Finally, it does not call wait_to_die() when kthread_should_stop()
already returns true.
Link: http://lkml.kernel.org/r/20150615155428.GD3135@pathway.suse.cz
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The macro 'module_param' shows that the type of the
variable disable_reader and write_iteration is unsigned
integer. so, we change their type form int to unsigned int.
Link: http://lkml.kernel.org/r/1433923927-67782-1-git-send-email-long.wanglong@huawei.com
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The {producer|consumer}_{nice|fifo} parameters are integer
type, we should use 'int' as the second param in module_param.
For example(consumer_fifo):
the default value of consumer_fifo is -1.
Without this patch:
# cat /sys/module/ring_buffer_benchmark/parameters/consumer_fifo
4294967295
With this patch:
# cat /sys/module/ring_buffer_benchmark/parameters/consumer_fifo
-1
Link: http://lkml.kernel.org/r/1433923873-67712-1-git-send-email-long.wanglong@huawei.com
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As this is already exported from tracing side via commit d9847d310a
("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might
as well want to move it to the core, so also networking users can make
use of it, e.g. to measure diffs for certain flows from ingress/egress.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tracing_off_permanent() call is a way to disable all ring_buffers.
Nothing uses it and nothing should use it, as tracing_off() and
friends are better, as they disable the ring buffers related to
tracing. The tracing_off_permanent() even disabled non tracing
ring buffers. This is a bit drastic, and was added to handle NMIs
doing outputs that could corrupt the ring buffer when only tracing
used them. It is now obsolete and adds a little overhead, it should
be removed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, if an NMI does a dump of a ring buffer, it disables
all ring buffers from ever doing any writes again. This is because
it wont take the locks for the cpu_buffer and this can cause
corruption if it preempted a read, or a read happens on another
CPU for the current cpu buffer. This is a bit overkill.
First, it should at least try to take the lock, and if it fails
then disable it. Also, there's no need to disable all ring
buffers, even those that are unrelated to what is being read.
Only disable the per cpu ring buffer that is being read if
it can not get the lock for it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ring_buffer_write() function isn't protected by the trace recursive
writes. Luckily, this function is not used as much and is unlikely
to ever recurse. But it should still have the protection, because
even a call to ring_buffer_lock_reserve() could cause ring buffer
corruption if called when ring_buffer_write() is being used.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the trace_recursive checks are only done if CONFIG_TRACING
is enabled. That was because there use to be a dependency with tracing
for the recursive checks (it used the task_struct trace recursive
variable). But now it uses its own variable and there is no dependency.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of using a global per_cpu variable to perform the recursive
checks into the ring buffer, use the already existing per_cpu descriptor
that is part of the ring buffer itself.
Not only does this simplify the code, it also allows for one ring buffer
to be used within the guts of the use of another ring buffer. For example
trace_printk() can now be used within the ring buffer to record changes
done by an instance into the main ring buffer. The recursion checks
will prevent the trace_printk() itself from causing recursive issues
with the main ring buffer (it is just ignored), but the recursive
checks wont prevent the trace_printk() from recording other ring buffers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases:
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The ftrace_raw_##call structures are built
by macros for trace events. They have nothing to do with function tracing.
Rename them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The ftrace_trigger_soft_disabled() tests if a
trace_event is soft disabled (called but not traced), and returns true if
it is. It has nothing to do with function tracing and should be renamed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The FTRACE_EVENT_FL_* flags are flags to
do with the trace_event files in the tracefs directory. They are not related
to function tracing. Rename them to a more descriptive name.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The structure ftrace_subsystem_dir holds
the information about trace event subsystems. It should not be named
ftrace, rename it to trace_subsystem_dir.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. ftrace_event_name() returns the name of
an event tracepoint, has nothing to do with function tracing. Rename it
to trace_event_name().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. Rename the max trace_event type size to
something more descriptive and appropriate.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The ftrace_output_*() and ftrace_raw_output_*()
functions represent the trace_event code. Rename them to just trace_output
or trace_raw_output.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The ftrace_event_buffer functions and data
structures are for trace_events and not for function hooks. Rename them
to trace_event_buffer*.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The structures ftrace_event_call and
ftrace_event_class have nothing to do with the function hooks, and are
really trace_event structures. Rename ftrace_event_* to trace_event_*.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The structure ftrace_event_file is really
about trace events and not "ftrace". Rename it to trace_event_file.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The functions (un)register_ftrace_event() is
really about trace_events, and the name should be register_trace_event()
instead.
Also renamed ftrace_event_reg() to trace_event_reg() for the same reason.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name "ftrace" really refers to the function hook infrastructure. It
is not about the trace_events. The functions ftrace_print_*() are not part of
the function infrastructure, and the names can be confusing. Rename them
to be trace_print_*().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The term "ftrace" is really the infrastructure of the function hooks,
and not the trace events. Rename ftrace_event.h to trace_events.h to
represent the trace_event infrastructure and decouple the term ftrace
from it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Expose the NMI safe accessor to the monotonic raw clock to the
tracer. The mono clock was added with commit
1b3e5c0936. The advantage of the
monotonic raw clock is that it will advance more constantly than the
monotonic clock.
Imagine someone is trying to optimize a particular program to reduce
instructions executed for a given workload while minimizing the effect
on runtime. Also suppose that NTP is running and potentially making
larger adjustments to the monotonic clock. If NTP is adjusting the
monotonic clock to advance more rapidly, the program will appear to
use fewer instructions per second but run longer than if the monotonic
raw clock had been used. The total number of instructions observed
would be the same regardless of the clock source used, but how it's
attributed to time would be affected.
Conversely if NTP is adjusting the monotonic clock to advance more
slowly, the program will appear to use more instructions per second
but run more quickly. Of course there are many sources that can cause
jitter in performance measurements on modern processors, but let's
remove NTP from the list.
The monotonic raw clock can also be useful for tracing early boot,
e.g. when debugging issues with NTP.
Link: http://lkml.kernel.org/r/20150508143037.GB1276@dreric01-Precision-T1650
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Drew Richardson <drew.richardson@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Critical tracepoint hooks should never call anything that takes a lock,
so they are unable to call getrawmonotonic() or ktime_get().
Export the rest of the tracing clock functions so can be used in
tracepoint hooks.
Background: We have a customer that adds their own module and registers
a tracepoint hook to sched_wakeup. They were using ktime_get() for a
time source, but it grabs a seq lock and caused a deadlock to occur.
Link: http://lkml.kernel.org/r/1430406624-22609-1-git-send-email-jsnitsel@redhat.com
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The only caller to this function (__print_array) was getting it wrong by
passing the array length instead of buffer length. As the element size
was already being passed for other reasons it seems reasonable to push
the calculation of buffer length into the function.
Link: http://lkml.kernel.org/r/1430320727-14582-1-git-send-email-alex.bennee@linaro.org
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
The first is a bug when ftrace_dump_on_oops is triggered in atomic context
and function graph tracer is the tracer that is being reported.
The second fix is bad parsing of the trace_events from the kernel
command line, where it would ignore specific events if the system
name is used with defining the event(it enables all events within the
system).
The last one is a fix to the TRACE_DEFINE_ENUM(), where a check was missing
to see if the ptr was incremented to the end of the string, but the loop
increments it again and can miss the nul delimiter to stop processing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVN7g7AAoJEEjnJuOKh9ldLCkIAJj6wsj59wR8aIxtxnD5bJZZ
Y9dKkas6BOaUCGp0MVBCkpm3uMgh0uPO10jOuthgc3LgQy0piwKJrIzGkI5gcRuA
JvZw2X08/jCSNu8BHydes2A0XMkZobMuWFAeWz3CSzNBfbI3sDDqgnRQ9eyMM66R
+sPV7ZELQLK/ZFs93gFoaZe/OKGYJcamhMtG2v7p9h1qBJZUDakRFo478bAwu5SB
Kh6xpXA76WnmkeQekAAsWGdqIQzoNy3IoVePmdhVpZvoiKLUrf1JWVYFoNkO39Ik
kC1gqJro0EmHQFOo5rt8yQmqXjvSU0sS/sAW3HZ0Szl5OtiUNnag+Wt3ku7lr8U=
=VORa
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This adds three fixes for the tracing code.
The first is a bug when ftrace_dump_on_oops is triggered in atomic
context and function graph tracer is the tracer that is being
reported.
The second fix is bad parsing of the trace_events from the kernel
command line, where it would ignore specific events if the system name
is used with defining the event(it enables all events within the
system).
The last one is a fix to the TRACE_DEFINE_ENUM(), where a check was
missing to see if the ptr was incremented to the end of the string,
but the loop increments it again and can miss the nul delimiter to
stop processing"
* tag 'trace-v4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix possible out of bounds memory access when parsing enums
tracing: Fix incorrect enabling of trace events by boot cmdline
tracing: Handle ftrace_dump() atomic context in graph_trace_open()
The code that replaces the enum names with the enum values in the
tracepoints' format files could possible miss the end of string nul
character. This was caused by processing things like backslashes, quotes
and other tokens. After processing the tokens, a check for the nul
character needed to be done before continuing the loop, because the loop
incremented the pointer before doing the check, which could bypass the nul
character.
Link: http://lkml.kernel.org/r/552E661D.5060502@oracle.com
Reported-by: Sasha Levin <sasha.levin@oracle.com> # via KASan
Tested-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Fixes: 0c564a538a "tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is a problem that trace events are not properly enabled with
boot cmdline. The problem is that if we pass "trace_event=kmem:mm_page_alloc"
to the boot cmdline, it enables all kmem trace events, and not just
the page_alloc event.
This is caused by the parsing mechanism. When we parse the cmdline, the buffer
contents is modified due to tokenization. And, if we use this buffer
again, we will get the wrong result.
Unfortunately, this buffer is be accessed three times to set trace events
properly at boot time. So, we need to handle this situation.
There is already code handling ",", but we need another for ":".
This patch adds it.
Link: http://lkml.kernel.org/r/1429159484-22977-1-git-send-email-iamjoonsoo.kim@lge.com
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[ added missing return ret; ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The seq_printf return value, because it's frequently misused,
will eventually be converted to void.
See: commit 1f33c41c03 ("seq_file: Rename seq_overflow() to
seq_has_overflowed() and make public")
Miscellanea:
o Remove unused return value from trace_lookup_stack
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
relayfs and tracefs are dealing with inodes of their own;
those two act as filesystem drivers
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull perf changes from Ingo Molnar:
"Core kernel changes:
- One of the more interesting features in this cycle is the ability
to attach eBPF programs (user-defined, sandboxed bytecode executed
by the kernel) to kprobes.
This allows user-defined instrumentation on a live kernel image
that can never crash, hang or interfere with the kernel negatively.
(Right now it's limited to root-only, but in the future we might
allow unprivileged use as well.)
(Alexei Starovoitov)
- Another non-trivial feature is per event clockid support: this
allows, amongst other things, the selection of different clock
sources for event timestamps traced via perf.
This feature is sought by people who'd like to merge perf generated
events with external events that were measured with different
clocks:
- cluster wide profiling
- for system wide tracing with user-space events,
- JIT profiling events
etc. Matching perf tooling support is added as well, available via
the -k, --clockid <clockid> parameter to perf record et al.
(Peter Zijlstra)
Hardware enablement kernel changes:
- x86 Intel Processor Trace (PT) support: which is a hardware tracer
on steroids, available on Broadwell CPUs.
The hardware trace stream is directly output into the user-space
ring-buffer, using the 'AUX' data format extension that was added
to the perf core to support hardware constraints such as the
necessity to have the tracing buffer physically contiguous.
This patch-set was developed for two years and this is the result.
A simple way to make use of this is to use BTS tracing, the PT
driver emulates BTS output - available via the 'intel_bts' PMU.
More explicit PT specific tooling support is in the works as well -
will probably be ready by 4.2.
(Alexander Shishkin, Peter Zijlstra)
- x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware
feature of Intel Xeon CPUs that allows the measurement and
allocation/partitioning of caches to individual workloads.
These kernel changes expose the measurement side as a new PMU
driver, which exposes various QoS related PMU events. (The
partitioning change is work in progress and is planned to be merged
as a cgroup extension.)
(Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P
Waskiewicz Jr)
- x86 Intel Haswell LBR call stack support: this is a new Haswell
feature that allows the hardware recording of call chains, plus
tooling support. To activate this feature you have to enable it
via the new 'lbr' call-graph recording option:
perf record --call-graph lbr
perf report
or:
perf top --call-graph lbr
This hardware feature is a lot faster than stack walk or dwarf
based unwinding, but has some limitations:
- It reuses the current LBR facility, so LBR call stack and
branch record can not be enabled at the same time.
- It is only available for user-space callchains.
(Yan, Zheng)
- x86 Intel Broadwell CPU support and various event constraints and
event table fixes for earlier models.
(Andi Kleen)
- x86 Intel HT CPUs event scheduling workarounds. This is a complex
CPU bug affecting the SNB,IVB,HSW families that results in counter
value corruption. The mitigation code is automatically enabled and
is transparent.
(Maria Dimakopoulou, Stephane Eranian)
The perf tooling side had a ton of changes in this cycle as well, so
I'm only able to list the user visible changes here, in addition to
the tooling changes outlined above:
User visible changes affecting all tools:
- Improve support of compressed kernel modules (Jiri Olsa)
- Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo)
- Bash completion for subcommands (Yunlong Song)
- Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa)
- Support missing -f to override perf.data file ownership. (Yunlong Song)
- Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo)
User visible changes in individual tools:
'perf data':
New tool for converting perf.data to other formats, initially
for the CTF (Common Trace Format) from LTTng (Jiri Olsa,
Sebastian Siewior)
'perf diff':
Add --kallsyms option (David Ahern)
'perf list':
Allow listing events with 'tracepoint' prefix (Yunlong Song)
Sort the output of the command (Yunlong Song)
'perf kmem':
Respect -i option (Jiri Olsa)
Print big numbers using thousands' group (Namhyung Kim)
Allow -v option (Namhyung Kim)
Fix alignment of slab result table (Namhyung Kim)
'perf probe':
Support multiple probes on different binaries on the same command line (Masami Hiramatsu)
Support unnamed union/structure members data collection. (Masami Hiramatsu)
Check kprobes blacklist when adding new events. (Masami Hiramatsu)
'perf record':
Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra)
Support recording running/enabled time (Andi Kleen)
'perf sched':
Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song)
'perf report' and 'perf top':
Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo)
Indicate which callchain entries are annotated in the
TUI hists browser (Arnaldo Carvalho de Melo)
Add pid/tid filtering to 'report' and 'script' commands (David Ahern)
Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one
cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT
events (Arnaldo Carvalho de Melo)
'perf stat':
Report unsupported events properly (Suzuki K. Poulose)
Output running time and run/enabled ratio in CSV mode (Andi Kleen)
'perf trace':
Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo)
Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo)
Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo)
Dump stack on segfaults (Arnaldo Carvalho de Melo)
No need to explicitely enable evsels for workload started from perf, let it
be enabled via perf_event_attr.enable_on_exec, removing some events that take
place in the 'perf trace' before a workload is really started by it.
(Arnaldo Carvalho de Melo)
Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo)
There's also been a ton of infrastructure work done, such as the
split-out of perf's build system into tools/build/ and other changes -
see the shortlog and changelog for details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits)
perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init()
perf evlist: Fix type for references to data_head/tail
perf probe: Check the orphaned -x option
perf probe: Support multiple probes on different binaries
perf buildid-list: Fix segfault when show DSOs with hits
perf tools: Fix cross-endian analysis
perf tools: Fix error path to do closedir() when synthesizing threads
perf tools: Fix synthesizing fork_event.ppid for non-main thread
perf tools: Add 'I' event modifier for exclude_idle bit
perf report: Don't call map__kmap if map is NULL.
perf tests: Fix attr tests
perf probe: Fix ARM 32 building error
perf tools: Merge all perf_event_attr print functions
perf record: Add clockid parameter
perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10
perf sched replay: Support using -f to override perf.data file ownership
perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files
perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task
perf sched replay: Fix the segmentation fault problem caused by pr_err in threads
perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations
...
of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.
Tracepoints have helper functions for the TP_printk() called
__print_symbolic() and __print_flags() that lets a numeric number be
displayed as a a human comprehensible text. What is placed in the
TP_printk() is also shown in the tracepoint format file such that
user space tools like perf and trace-cmd can parse the binary data
and express the values too. Unfortunately, the way the TRACE_EVENT()
macro works, anything placed in the TP_printk() will be shown pretty
much exactly as is. The problem arises when enums are used. That's
because unlike macros, enums will not be changed into their values
by the C pre-processor. Thus, the enum string is exported to the
format file, and this makes it useless for user space tools.
The TRACE_DEFINE_ENUM() solves this by converting the enum strings
in the TP_printk() format into their number, and that is what is
shown to user space. For example, the tracepoint tlb_flush currently
has this in its format file:
__print_symbolic(REC->reason,
{ TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
{ TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
{ TLB_LOCAL_SHOOTDOWN, "local shootdown" },
{ TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
After adding:
TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
Its format file will contain this:
__print_symbolic(REC->reason,
{ 0, "flush on task switch" },
{ 1, "remote shootdown" },
{ 2, "local shootdown" },
{ 3, "local mm shootdown" })
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVLBTuAAoJEEjnJuOKh9ldjHMIALdRS755TXCZGOf0r7O2akOR
wMPeum7C+ae1mH+jCsJKUC0/jUfQKaMt/UxoHlipDgcGg8kD2jtGnGCw4Xlwvdsr
y4rFmcTRSl1mo0zDSsg6ujoupHlVYN0+JPjrd7S3cv/llJoY49zcanNLF7S2XLeM
dZCtWRLWYpBiWO68ai6AqJTnE/eGFIqBI048qb5Eg8dbK243SSeSIf9Ywhb+VsA+
aq6F7cWI/H6j4tbeza8tAN19dcwenDro5EfCDY8ARQHJu1f6Y3+DLf2imjkd6Aiu
JVAoGIjHIpI+djwCZC1u4gi4urjfOqYartrM3Q54tb3YWYqHeNqP2ASI2a4EpYk=
=Ixwt
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Some clean ups and small fixes, but the biggest change is the addition
of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.
Tracepoints have helper functions for the TP_printk() called
__print_symbolic() and __print_flags() that lets a numeric number be
displayed as a a human comprehensible text. What is placed in the
TP_printk() is also shown in the tracepoint format file such that user
space tools like perf and trace-cmd can parse the binary data and
express the values too. Unfortunately, the way the TRACE_EVENT()
macro works, anything placed in the TP_printk() will be shown pretty
much exactly as is. The problem arises when enums are used. That's
because unlike macros, enums will not be changed into their values by
the C pre-processor. Thus, the enum string is exported to the format
file, and this makes it useless for user space tools.
The TRACE_DEFINE_ENUM() solves this by converting the enum strings in
the TP_printk() format into their number, and that is what is shown to
user space. For example, the tracepoint tlb_flush currently has this
in its format file:
__print_symbolic(REC->reason,
{ TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
{ TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
{ TLB_LOCAL_SHOOTDOWN, "local shootdown" },
{ TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
After adding:
TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
Its format file will contain this:
__print_symbolic(REC->reason,
{ 0, "flush on task switch" },
{ 1, "remote shootdown" },
{ 2, "local shootdown" },
{ 3, "local mm shootdown" })"
* tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (27 commits)
tracing: Add enum_map file to show enums that have been mapped
writeback: Export enums used by tracepoint to user space
v4l: Export enums used by tracepoints to user space
SUNRPC: Export enums in tracepoints to user space
mm: tracing: Export enums in tracepoints to user space
irq/tracing: Export enums in tracepoints to user space
f2fs: Export the enums in the tracepoints to userspace
net/9p/tracing: Export enums in tracepoints to userspace
x86/tlb/trace: Export enums in used by tlb_flush tracepoint
tracing/samples: Update the trace-event-sample.h with TRACE_DEFINE_ENUM()
tracing: Allow for modules to convert their enums to values
tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values
tracing: Update trace-event-sample with TRACE_SYSTEM_VAR documentation
tracing: Give system name a pointer
brcmsmac: Move each system tracepoints to their own header
iwlwifi: Move each system tracepoints to their own header
mac80211: Move message tracepoints to their own header
tracing: Add TRACE_SYSTEM_VAR to xhci-hcd
tracing: Add TRACE_SYSTEM_VAR to kvm-s390
tracing: Add TRACE_SYSTEM_VAR to intel-sst
...
more than one release, as I had it ready for the 4.0 merge window, but
a last minute thing that needed to go into Linux first had to be done.
That was that perf hard coded the file system number when reading
/sys/kernel/debugfs/tracing directory making sure that the path had
the debugfs mount # before it would parse the tracing file. This broke
other use cases of perf, and the check is removed.
Now when mounting /sys/kernel/debug, tracefs is automatically mounted
in /sys/kernel/debug/tracing such that old tools will still see that
path as expected. But now system admins can mount tracefs directly
and not need to mount debugfs, which can expose security issues.
A new directory is created when tracefs is configured such that
system admins can now mount it separately (/sys/kernel/tracing).
This branch is based off of Al Viro's vfs debugfs_automount branch
at commit 163f9eb95a
debugfs: Provide a file creation function that also takes an initial size
to get the debugfs_create_automount() operation.
I just noticed that Al rebased the pull to add his Signed-off-by to
that commit, and the commit is now e59b4e9187.
I did a git diff of those two and see they are the same. Only the
latter has Al's SOB.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVLA6YAAoJEEjnJuOKh9ldv6AH/1JUINDQwV+M0VTwzbLogloo
Sco0byLhskmx5KLVD7Vs8BJAGrgHTdit32kzBGmLGJvVCKBa+c8lwmRw6rnXg3uX
K4kGp7BIyn1/geXoGpCmDKaLGXhDcw49hRzejKDg/OqFtxKTsSeQtG8fo29ps9Do
0VaF6UDp8gYplC2N2BfpB59LVndrITQ3mSsBBeFPvS7IxFJXAhDBOq2yi0aI6HyJ
ICo2L/bA9HLxMuceWrXbsun+RP68+AQlnFfAtok7AcuBzUYPCKY0shT2VMOUtpTt
1dGxMxq6q1ACfmY7gbp47WMX9aKjWcSEr0V+IYx/xex6Maf0Xsujsy99bKYUWvs=
=OcgU
-----END PGP SIGNATURE-----
Merge tag 'trace-4.1-tracefs' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracefs from Steven Rostedt:
"This adds the new tracefs file system.
This has been in linux-next for more than one release, as I had it
ready for the 4.0 merge window, but a last minute thing that needed to
go into Linux first had to be done. That was that perf hard coded the
file system number when reading /sys/kernel/debugfs/tracing directory
making sure that the path had the debugfs mount # before it would
parse the tracing file. This broke other use cases of perf, and the
check is removed.
Now when mounting /sys/kernel/debug, tracefs is automatically mounted
in /sys/kernel/debug/tracing such that old tools will still see that
path as expected. But now system admins can mount tracefs directly
and not need to mount debugfs, which can expose security issues. A
new directory is created when tracefs is configured such that system
admins can now mount it separately (/sys/kernel/tracing)"
* tag 'trace-4.1-tracefs' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have mkdir and rmdir be part of tracefs
tracefs: Add directory /sys/kernel/tracing
tracing: Automatically mount tracefs on debugfs/tracing
tracing: Convert the tracing facility over to use tracefs
tracefs: Add new tracefs file system
tracing: Create cmdline tracer options on tracing fs init
tracing: Only create tracer options files if directory exists
debugfs: Provide a file creation function that also takes an initial size
Add a enum_map file in the tracing directory to see what enums have been
saved to convert in the print fmt files.
As this requires the enum mapping to be persistent in memory, it is only
created if the new config option CONFIG_TRACE_ENUM_MAP_FILE is enabled.
This is for debugging and will increase the persistent memory footprint
of the kernel.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Update the infrastructure such that modules that declare TRACE_DEFINE_ENUM()
will have those enums converted into their values in the tracepoint
print fmt strings.
Link: http://lkml.kernel.org/r/87vbhjp74q.fsf@rustcorp.com.au
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Several tracepoints use the helper functions __print_symbolic() or
__print_flags() and pass in enums that do the mapping between the
binary data stored and the value to print. This works well for reading
the ASCII trace files, but when the data is read via userspace tools
such as perf and trace-cmd, the conversion of the binary value to a
human string format is lost if an enum is used, as userspace does not
have access to what the ENUM is.
For example, the tracepoint trace_tlb_flush() has:
__print_symbolic(REC->reason,
{ TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
{ TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
{ TLB_LOCAL_SHOOTDOWN, "local shootdown" },
{ TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })
Which maps the enum values to the strings they represent. But perf and
trace-cmd do no know what value TLB_LOCAL_MM_SHOOTDOWN is, and would
not be able to map it.
With TRACE_DEFINE_ENUM(), developers can place these in the event header
files and ftrace will convert the enums to their values:
By adding:
TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);
$ cat /sys/kernel/debug/tracing/events/tlb/tlb_flush/format
[...]
__print_symbolic(REC->reason,
{ 0, "flush on task switch" },
{ 1, "remote shootdown" },
{ 2, "local shootdown" },
{ 3, "local mm shootdown" })
The above is what userspace expects to see, and tools do not need to
be modified to parse them.
Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org
Cc: Guilherme Cox <cox@computer.org>
Cc: Tony Luck <tony.luck@gmail.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Dynamically allocated trampolines call ftrace_ops_get_func to get the
function which they should call. For dynamic fops (FTRACE_OPS_FL_DYNAMIC
flag is set) ftrace_ops_list_func is always returned. This is reasonable
for static trampolines but goes against the main advantage of dynamic
ones, that is avoidance of going through the list of all registered
callbacks for functions that are only being traced by a single callback.
We can fix it by returning ops->func (or recursion safe version) from
ftrace_ops_get_func whenever it is possible for dynamic trampolines.
Note that dynamic trampolines are not allowed for dynamic fops if
CONFIG_PREEMPT=y.
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1501291023000.25445@pobox.suse.cz
Link: http://lkml.kernel.org/r/1424357773-13536-1-git-send-email-mbenes@suse.cz
Reported-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only
dependency, it also depends on the tracing infrastructure and on kprobes,
without which it will fail to build with:
In file included from kernel/trace/bpf_trace.c:14:0:
kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’:
kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’
unsigned int val = current->trace_recursion;
[...]
It took quite some time to trigger this build failure, because right now
BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL
more configurable, not just under CONFIG_EXPERT.
If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing
gateway as well.
We might want to make this an interactive option later on, although
I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of
an indicator that the user wants BPF support.
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Debugging of BPF programs needs some form of printk from the
program, so let programs call limited trace_printk() with %d %u
%x %p modifiers only.
Similar to kernel modules, during program load verifier checks
whether program is calling bpf_trace_printk() and if so, kernel
allocates trace_printk buffers and emits big 'this is debug
only' banner.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
bpf_ktime_get_ns() is used by programs to compute time delta
between events or as a timestamp
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-5-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};
event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
add TRACE_EVENT_FL_KPROBE flag to differentiate kprobe type of
tracepoints, since bpf programs can only be attached to kprobe
type of PERF_TYPE_TRACEPOINT perf events.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1427312966-8434-3-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A clean up of the recursive protection code changed
val = this_cpu_read(current_context);
val--;
val &= this_cpu_read(current_context);
to
val = this_cpu_read(current_context);
val &= val & (val - 1);
Which has a duplicate use of '&' as the above is the same as
val = val & (val - 1);
Actually, it would be best to remove that line altogether and
just add it to where it is used.
And Christoph even mentioned that it can be further compacted to
just a single line:
__this_cpu_and(current_context, __this_cpu_read(current_context) - 1);
Link: http://lkml.kernel.org/alpine.DEB.2.11.1503271423580.23114@gentwo.org
Suggested-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The commit that added a check for this to checkpatch says:
"Using weak declarations can have unintended link defects. The __weak on
the declaration causes non-weak definitions to become weak."
In this case, when a PowerPC kernel is built with CONFIG_KPROBE_EVENT
but not CONFIG_UPROBE_EVENT, it generates the following warning:
WARNING: 1 bad relocations
c0000000014f2190 R_PPC64_ADDR64 uprobes_fetch_type_table
This is fixed by passing the fetch_table arrays to
traceprobe_parse_probe_arg() which also means that they can never be NULL.
Link: http://lkml.kernel.org/r/20150312165834.4482cb48@canb.auug.org.au
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
TRACE_EVENT_FL_USE_CALL_FILTER flag in ftrace:functon event can be
removed. This flag was first introduced in commit
f306cc82a9 ("tracing: Update event filters for multibuffer").
Now, the only place uses this flag is ftrace:function, but the filter of
ftrace:function has a different code path with events/syscalls and
events/tracepoints. It uses ftrace_filter_write() and perf's
ftrace_profile_set_filter() to set the filter, the functionality of file
'tracing/events/ftrace/function/filter' is bypassed in function
init_pred(), in which case, neither call->filter nor file->filter is
used.
So we can safely remove TRACE_EVENT_FL_USE_CALL_FILTER flag from
ftrace:function events.
Link: http://lkml.kernel.org/r/1425367294-27852-1-git-send-email-hekuang@huawei.com
Signed-off-by: He Kuang <hekuang@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It has come to my attention that this_cpu_read/write are horrible on
architectures other than x86. Worse yet, they actually disable
preemption or interrupts! This caused some unexpected tracing results
on ARM.
101.356868: preempt_count_add <-ring_buffer_lock_reserve
101.356870: preempt_count_sub <-ring_buffer_lock_reserve
The ring_buffer_lock_reserve has recursion protection that requires
accessing a per cpu variable. But since preempt_disable() is traced, it
too got traced while accessing the variable that is suppose to prevent
recursion like this.
The generic version of this_cpu_read() and write() are:
#define this_cpu_generic_read(pcp) \
({ typeof(pcp) ret__; \
preempt_disable(); \
ret__ = *this_cpu_ptr(&(pcp)); \
preempt_enable(); \
ret__; \
})
#define this_cpu_generic_to_op(pcp, val, op) \
do { \
unsigned long flags; \
raw_local_irq_save(flags); \
*__this_cpu_ptr(&(pcp)) op val; \
raw_local_irq_restore(flags); \
} while (0)
Which is unacceptable for locations that know they are within preempt
disabled or interrupt disabled locations.
Paul McKenney stated that __this_cpu_() versions produce much better code on
other architectures than this_cpu_() does, if we know that the call is done in
a preempt disabled location.
I also changed the recursive_unlock() to use two local variables instead
of accessing the per_cpu variable twice.
Link: http://lkml.kernel.org/r/20150317114411.GE3589@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20150317104038.312e73d1@gandalf.local.home
Cc: stable@vger.kernel.org
Acked-by: Christoph Lameter <cl@linux.com>
Reported-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Tested-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some archs (specifically PowerPC), are sensitive with the ordering of
the enabling of the calls to function tracing and setting of the
function to use to be traced.
That is, update_ftrace_function() sets what function the ftrace_caller
trampoline should call. Some archs require this to be set before
calling ftrace_run_update_code().
Another bug was discovered, that ftrace_startup_sysctl() called
ftrace_run_update_code() directly. If the function the ftrace_caller
trampoline changes, then it will not be updated. Instead a call
to ftrace_startup_enable() should be called because it tests to see
if the callback changed since the code was disabled, and will
tell the arch to update appropriately. Most archs do not need this
notification, but PowerPC does.
The problem could be seen by the following commands:
# echo 0 > /proc/sys/kernel/ftrace_enabled
# echo function > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /proc/sys/kernel/ftrace_enabled
# cat /sys/kernel/debug/tracing/trace
The trace will show that function tracing was not active.
Cc: stable@vger.kernel.org # 2.6.27+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When ftrace is enabled globally through the proc interface, we must check if
ftrace_graph_active is set. If it is set, then we should also pass the
FTRACE_START_FUNC_RET command to ftrace_run_update_code(). Similarly, when
ftrace is disabled globally through the proc interface, we must check if
ftrace_graph_active is set. If it is set, then we should also pass the
FTRACE_STOP_FUNC_RET command to ftrace_run_update_code().
Consider the following situation.
# echo 0 > /proc/sys/kernel/ftrace_enabled
After this ftrace_enabled = 0.
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
Since ftrace_enabled = 0, ftrace_enable_ftrace_graph_caller() is never
called.
# echo 1 > /proc/sys/kernel/ftrace_enabled
Now ftrace_enabled will be set to true, but still
ftrace_enable_ftrace_graph_caller() will not be called, which is not
desired.
Further if we execute the following after this:
# echo nop > /sys/kernel/debug/tracing/current_tracer
Now since ftrace_enabled is set it will call
ftrace_disable_ftrace_graph_caller(), which causes a kernel warning on
the ARM platform.
On the ARM platform, when ftrace_enable_ftrace_graph_caller() is called,
it checks whether the old instruction is a nop or not. If it's not a nop,
then it returns an error. If it is a nop then it replaces instruction at
that address with a branch to ftrace_graph_caller.
ftrace_disable_ftrace_graph_caller() behaves just the opposite. Therefore,
if generic ftrace code ever calls either ftrace_enable_ftrace_graph_caller()
or ftrace_disable_ftrace_graph_caller() consecutively two times in a row,
then it will return an error, which will cause the generic ftrace code to
raise a warning.
Note, x86 does not have an issue with this because the architecture
specific code for ftrace_enable_ftrace_graph_caller() and
ftrace_disable_ftrace_graph_caller() does not check the previous state,
and calling either of these functions twice in a row has no ill effect.
Link: http://lkml.kernel.org/r/e4fbe64cdac0dd0e86a3bf914b0f83c0b419f146.1425666454.git.panand@redhat.com
Cc: stable@vger.kernel.org # 2.6.31+
Signed-off-by: Pratyush Anand <panand@redhat.com>
[
removed extra if (ftrace_start_up) and defined ftrace_graph_active as 0
if CONFIG_FUNCTION_GRAPH_TRACER is not set.
]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When /proc/sys/kernel/ftrace_enabled is set to zero, all function
tracing is disabled. But the records that represent the functions
still hold information about the ftrace_ops that are hooked to them.
ftrace_ops may request "REGS" (have a full set of pt_regs passed to
the callback), or "TRAMP" (the ops has its own trampoline to use).
When the record is updated to represent the state of the ops hooked
to it, it sets "REGS_EN" and/or "TRAMP_EN" to state that the callback
points to the correct trampoline (REGS has its own trampoline).
When ftrace_enabled is set to zero, all ftrace locations are a nop,
so they do not point to any trampoline. But the _EN flags are still
set. This can cause the accounting to go wrong when ftrace_enabled
is cleared and an ops that has a trampoline is registered or unregistered.
For example, the following will cause ftrace to crash:
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
# echo 0 > /proc/sys/kernel/ftrace_enabled
# echo nop > /sys/kernel/debug/tracing/current_tracer
# echo 1 > /proc/sys/kernel/ftrace_enabled
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
As function_graph uses a trampoline, when ftrace_enabled is set to zero
the updates to the record are not done. When enabling function_graph
again, the record will still have the TRAMP_EN flag set, and it will
look for an op that has a trampoline other than the function_graph
ops, and fail to find one.
Cc: stable@vger.kernel.org # 3.17+
Reported-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
printk and friends can now format bitmaps using '%*pb[l]'. cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
o Several clean ups to the code
One such clean up was to convert to 64 bit time keeping, in the
ring buffer benchmark code.
o Adding of __print_array() helper macro for TRACE_EVENT()
o Updating the sample/trace_events/ to add samples of different ways to
make trace events. Lots of features have been added since the sample
code was made, and these features are mostly unknown. Developers
have been making their own hacks to do things that are already available.
o Performance improvements. Most notably, I found a performance bug where
a waiter that is waiting for a full page from the ring buffer will
see that a full page is not available, and go to sleep. The sched
event caused by it going to sleep would cause it to wake up again.
It would see that there was still not a full page, and go back to sleep
again, and that would wake it up again, until finally it would see a
full page. This change has been marked for stable.
Other improvements include removing global locks from fast paths.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU3M+GAAoJEEjnJuOKh9ldpWQIAJTUzeVXlU0cf3bVn768VW7e
XS41WHF34l1tNevmKTh6fCPiw8+U0UMGLQt5WKtyaaARsZn2MlefLVuvHPKFlK2w
+qcI4OEVHH97Qgf9HWJSsYgnZaOnOE+TENqnokEgXMimRMuVcd/S4QaGxwJVDcjm
iBF5j2TaG4aGbx4a3J7KueoZ3K+39r3ut15hIGi/IZBZldQ1pt26ytafD/KA3CU3
BLRM2HLttAMsV1ds0EDLgZjSGICVetFcdOmI5Gwj7Qr3KrOTRPYJMNc8NdDL7Js9
v8VhujhFGvcCrhO/IKpVvd9yluz3RCF+Z7ihc+D/+1B3Nsm0PTwN3Fl5J+f89AA=
=u2Mm
-----END PGP SIGNATURE-----
Merge tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The updates included in this pull request for ftrace are:
o Several clean ups to the code
One such clean up was to convert to 64 bit time keeping, in the
ring buffer benchmark code.
o Adding of __print_array() helper macro for TRACE_EVENT()
o Updating the sample/trace_events/ to add samples of different ways
to make trace events. Lots of features have been added since the
sample code was made, and these features are mostly unknown.
Developers have been making their own hacks to do things that are
already available.
o Performance improvements. Most notably, I found a performance bug
where a waiter that is waiting for a full page from the ring buffer
will see that a full page is not available, and go to sleep. The
sched event caused by it going to sleep would cause it to wake up
again. It would see that there was still not a full page, and go
back to sleep again, and that would wake it up again, until finally
it would see a full page. This change has been marked for stable.
Other improvements include removing global locks from fast paths"
* tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Do not wake up a splice waiter when page is not full
tracing: Fix unmapping loop in tracing_mark_write
tracing: Add samples of DECLARE_EVENT_CLASS() and DEFINE_EVENT()
tracing: Add TRACE_EVENT_FN example
tracing: Add TRACE_EVENT_CONDITION sample
tracing: Update the TRACE_EVENT fields available in the sample code
tracing: Separate out initializing top level dir from instances
tracing: Make tracing_init_dentry_tr() static
trace: Use 64-bit timekeeping
tracing: Add array printing helper
tracing: Remove newline from trace_printk warning banner
tracing: Use IS_ERR() check for return value of tracing_init_dentry()
tracing: Remove unneeded includes of debugfs.h and fs.h
tracing: Remove taking of trace_types_lock in pipe files
tracing: Add ref count to tracer for when they are being read by pipe
Pull s390 updates from Martin Schwidefsky:
- The remaining patches for the z13 machine support: kernel build
option for z13, the cache synonym avoidance, SMT support,
compare-and-delay for spinloops and the CES5S crypto adapater.
- The ftrace support for function tracing with the gcc hotpatch option.
This touches common code Makefiles, Steven is ok with the changes.
- The hypfs file system gets an extension to access diagnose 0x0c data
in user space for performance analysis for Linux running under z/VM.
- The iucv hvc console gets wildcard spport for the user id filtering.
- The cacheinfo code is converted to use the generic infrastructure.
- Cleanup and bug fixes.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
s390/process: free vx save area when releasing tasks
s390/hypfs: Eliminate hypfs interval
s390/hypfs: Add diagnose 0c support
s390/cacheinfo: don't use smp_processor_id() in preemptible context
s390/zcrypt: fixed domain scanning problem (again)
s390/smp: increase maximum value of NR_CPUS to 512
s390/jump label: use different nop instruction
s390/jump label: add sanity checks
s390/mm: correct missing space when reporting user process faults
s390/dasd: cleanup profiling
s390/dasd: add locking for global_profile access
s390/ftrace: hotpatch support for function tracing
ftrace: let notrace function attribute disable hotpatching if necessary
ftrace: allow architectures to specify ftrace compile options
s390: reintroduce diag 44 calls for cpu_relax()
s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
s390/zcrypt: Number of supported ap domains is not retrievable.
s390/spinlock: add compare-and-delay to lock wait loops
s390/tape: remove redundant if statement
s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
...
When an application connects to the ring buffer via splice, it can only
read full pages. Splice does not work with partial pages. If there is
not enough data to fill a page, the splice command will either block
or return -EAGAIN (if set to nonblock).
Code was added where if the page is not full, to just sleep again.
The problem is, it will get woken up again on the next event. That
is, when something is written into the ring buffer, if there is a waiter
it will wake it up. The waiter would then check the buffer, see that
it still does not have enough data to fill a page and go back to sleep.
To make matters worse, when the waiter goes back to sleep, it could
cause another event, which would wake it back up again to see it
doesn't have enough data and sleep again. This produces a tremendous
overhead and fills the ring buffer with noise.
For example, recording sched_switch on an idle system for 10 seconds
produces 25,350,475 events!!!
Create another wait queue for those waiters wanting full pages.
When an event is written, it only wakes up waiters if there's a full
page of data. It does not wake up the waiter if the page is not yet
full.
After this change, recording sched_switch on an idle system for 10
seconds produces only 800 events. Getting rid of 25,349,675 useless
events (99.9969% of events!!), is something to take seriously.
Cc: stable@vger.kernel.org # 3.16+
Cc: Rabin Vincent <rabin@rab.in>
Fixes: e30f53aad2 "tracing: Do not busy wait in buffer splice"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- Rework of the core ACPI resources parsing code to fix issues
in it and make using resource offsets more convenient and
consolidation of some resource-handing code in a couple of places
that have grown analagous data structures and code to cover the
the same gap in the core (Jiang Liu, Thomas Gleixner, Lv Zheng).
- ACPI-based IOAPIC hotplug support on top of the resources handling
rework (Jiang Liu, Yinghai Lu).
- ACPICA update to upstream release 20150204 including an interrupt
handling rework that allows drivers to install raw handlers for
ACPI GPEs which then become entirely responsible for the given GPE
and the ACPICA core code won't touch it (Lv Zheng, David E Box,
Octavian Purdila).
- ACPI EC driver rework to fix several concurrency issues and other
problems related to events handling on top of the ACPICA's new
support for raw GPE handlers (Lv Zheng).
- New ACPI driver for AMD SoCs analogous to the LPSS (Low-Power
Subsystem) driver for Intel chips (Ken Xue).
- Two minor fixes of the ACPI LPSS driver (Heikki Krogerus,
Jarkko Nikula).
- Two new blacklist entries for machines (Samsung 730U3E/740U3E and
510R) where the native backlight interface doesn't work correctly
while the ACPI one does (Hans de Goede).
- Rework of the ACPI processor driver's handling of idle states
to make the code more straightforward and less bloated overall
(Rafael J Wysocki).
- Assorted minor fixes related to ACPI and SFI (Andreas Ruprecht,
Andy Shevchenko, Hanjun Guo, Jan Beulich, Rafael J Wysocki,
Yaowei Bai).
- PCI core power management modification to avoid resuming (some)
runtime-suspended devices during system suspend if they are in
the right states already (Rafael J Wysocki).
- New SFI-based cpufreq driver for Intel platforms using SFI
(Srinidhi Kasagar).
- cpufreq core fixes, cleanups and simplifications (Viresh Kumar,
Doug Anderson, Wolfram Sang).
- SkyLake CPU support and other updates for the intel_pstate driver
(Kristen Carlson Accardi, Srinivas Pandruvada).
- cpufreq-dt driver cleanup (Markus Elfring).
- Init fix for the ARM big.LITTLE cpuidle driver (Sudeep Holla).
- Generic power domains core code fixes and cleanups (Ulf Hansson).
- Operating Performance Points (OPP) core code cleanups and kernel
documentation update (Nishanth Menon).
- New dabugfs interface to make the list of PM QoS constraints
available to user space (Nishanth Menon).
- New devfreq driver for Tegra Activity Monitor (Tomeu Vizoso).
- New devfreq class (devfreq_event) to provide raw utilization data
to devfreq governors (Chanwoo Choi).
- Assorted minor fixes and cleanups related to power management
(Andreas Ruprecht, Krzysztof Kozlowski, Rickard Strandqvist,
Pavel Machek, Todd E Brandt, Wonhong Kwon).
- turbostat updates (Len Brown) and cpupower Makefile improvement
(Sriram Raghunathan).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJU2neOAAoJEILEb/54YlRx51QP/jrv1Wb5eMaemzMksPIWI5Zn
I8IbxzToxu7wDDsrTBRv+LuyllMPrnppFOHHvB35gUYu7Y6I066s3ErwuqeFlbmy
+VicmyGMahv3yN74qg49MXzWtaJZa8hrFXn8ItujiUIcs08yELi0vBQFlZImIbTB
PdQngO88VfiOVjDvmKkYUU//9Sc9LCU0ZcdUQXSnA1oNOxuUHjiARz98R03hhSqu
BWR+7M0uaFbu6XeK+BExMXJTpKicIBZ1GAF6hWrS8V4aYg+hH1cwjf2neDAzZkcU
UkXieJlLJrCq+ZBNcy7WEhkWQkqJNWei5WYiy6eoQeQpNoliY2V+2OtSMJaKqDye
PIiMwXstyDc5rgyULN0d1UUzY6mbcUt2rOL0VN2bsFVIJ1HWCq8mr8qq689pQUYv
tcH18VQ2/6r2zW28sTO/ByWLYomklD/Y6bw2onMhGx3Knl0D8xYJKapVnTGhr5eY
d4k41ybHSWNKfXsZxdJc+RxndhPwj9rFLfvY/CZEhLcW+2pAiMarRDOPXDoUI7/l
aJpmPzy/6mPXGBnTfr6jKDSY3gXNazRIvfPbAdiGayKcHcdRM4glbSbNH0/h1Iq6
HKa8v9Fx87k1X5r4ZbhiPdABWlxuKDiM7725rfGpvjlWC3GNFOq7YTVMOuuBA225
Mu9PRZbOsZsnyNkixBpX
=zZER
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
"We have a few new features this time, including a new SFI-based
cpufreq driver, a new devfreq driver for Tegra Activity Monitor, a new
devfreq class for providing its governors with raw utilization data
and a new ACPI driver for AMD SoCs.
Still, the majority of changes here are reworks of existing code to
make it more straightforward or to prepare it for implementing new
features on top of it. The primary example is the rework of ACPI
resources handling from Jiang Liu, Thomas Gleixner and Lv Zheng with
support for IOAPIC hotplug implemented on top of it, but there is
quite a number of changes of this kind in the cpufreq core, ACPICA,
ACPI EC driver, ACPI processor driver and the generic power domains
core code too.
The most active developer is Viresh Kumar with his cpufreq changes.
Specifics:
- Rework of the core ACPI resources parsing code to fix issues in it
and make using resource offsets more convenient and consolidation
of some resource-handing code in a couple of places that have grown
analagous data structures and code to cover the the same gap in the
core (Jiang Liu, Thomas Gleixner, Lv Zheng).
- ACPI-based IOAPIC hotplug support on top of the resources handling
rework (Jiang Liu, Yinghai Lu).
- ACPICA update to upstream release 20150204 including an interrupt
handling rework that allows drivers to install raw handlers for
ACPI GPEs which then become entirely responsible for the given GPE
and the ACPICA core code won't touch it (Lv Zheng, David E Box,
Octavian Purdila).
- ACPI EC driver rework to fix several concurrency issues and other
problems related to events handling on top of the ACPICA's new
support for raw GPE handlers (Lv Zheng).
- New ACPI driver for AMD SoCs analogous to the LPSS (Low-Power
Subsystem) driver for Intel chips (Ken Xue).
- Two minor fixes of the ACPI LPSS driver (Heikki Krogerus, Jarkko
Nikula).
- Two new blacklist entries for machines (Samsung 730U3E/740U3E and
510R) where the native backlight interface doesn't work correctly
while the ACPI one does (Hans de Goede).
- Rework of the ACPI processor driver's handling of idle states to
make the code more straightforward and less bloated overall (Rafael
J Wysocki).
- Assorted minor fixes related to ACPI and SFI (Andreas Ruprecht,
Andy Shevchenko, Hanjun Guo, Jan Beulich, Rafael J Wysocki, Yaowei
Bai).
- PCI core power management modification to avoid resuming (some)
runtime-suspended devices during system suspend if they are in the
right states already (Rafael J Wysocki).
- New SFI-based cpufreq driver for Intel platforms using SFI
(Srinidhi Kasagar).
- cpufreq core fixes, cleanups and simplifications (Viresh Kumar,
Doug Anderson, Wolfram Sang).
- SkyLake CPU support and other updates for the intel_pstate driver
(Kristen Carlson Accardi, Srinivas Pandruvada).
- cpufreq-dt driver cleanup (Markus Elfring).
- Init fix for the ARM big.LITTLE cpuidle driver (Sudeep Holla).
- Generic power domains core code fixes and cleanups (Ulf Hansson).
- Operating Performance Points (OPP) core code cleanups and kernel
documentation update (Nishanth Menon).
- New dabugfs interface to make the list of PM QoS constraints
available to user space (Nishanth Menon).
- New devfreq driver for Tegra Activity Monitor (Tomeu Vizoso).
- New devfreq class (devfreq_event) to provide raw utilization data
to devfreq governors (Chanwoo Choi).
- Assorted minor fixes and cleanups related to power management
(Andreas Ruprecht, Krzysztof Kozlowski, Rickard Strandqvist, Pavel
Machek, Todd E Brandt, Wonhong Kwon).
- turbostat updates (Len Brown) and cpupower Makefile improvement
(Sriram Raghunathan)"
* tag 'pm+acpi-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (151 commits)
tools/power turbostat: relax dependency on APERF_MSR
tools/power turbostat: relax dependency on invariant TSC
Merge branch 'pci/host-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into acpi-resources
tools/power turbostat: decode MSR_*_PERF_LIMIT_REASONS
tools/power turbostat: relax dependency on root permission
ACPI / video: Add disable_native_backlight quirk for Samsung 510R
ACPI / PM: Remove unneeded nested #ifdef
USB / PM: Remove unneeded #ifdef and associated dead code
intel_pstate: provide option to only use intel_pstate with HWP
ACPI / EC: Add GPE reference counting debugging messages
ACPI / EC: Add query flushing support
ACPI / EC: Refine command storm prevention support
ACPI / EC: Add command flushing support.
ACPI / EC: Introduce STARTED/STOPPED flags to replace BLOCKED flag
ACPI: add AMD ACPI2Platform device support for x86 system
ACPI / table: remove duplicate NULL check for the handler of acpi_table_parse()
ACPI / EC: Update revision due to raw handler mode.
ACPI / EC: Reduce ec_poll() by referencing the last register access timestamp.
ACPI / EC: Fix several GPE handling issues by deploying ACPI_GPE_DISPATCH_RAW_HANDLER mode.
ACPICA: Events: Enable APIs to allow interrupt/polling adaptive request based GPE handling model
...
Commit 6edb2a8a38 introduced
an array map_pages that contains the addresses returned by
kmap_atomic. However, when unmapping those pages, map_pages[0]
is unmapped before map_pages[1], breaking the nesting requirement
as specified in the documentation for kmap_atomic/kunmap_atomic.
This was caught by the highmem debug code present in kunmap_atomic.
Fix the loop to do the unmapping properly.
Link: http://lkml.kernel.org/r/1418871056-6614-1-git-send-email-markivx@codeaurora.org
Cc: stable@vger.kernel.org # 3.5+
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Reported-by: Lime Yang <limey@codeaurora.org>
Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing "instances" directory can create sub tracing buffers
with mkdir, and remove them with rmdir. As a mkdir will also create
all the files and directories that control the sub buffer the inode
mutexes need to be released before this is done, to avoid deadlocks.
It is better to let the tracing system unlock the inode mutexes before
calling the functions that create the files within the new directory
(or deletes the files from the one being destroyed).
Now that tracing has been converted over to tracefs, the tracefs file
system can be modified to accommodate this feature. It still releases
the locks, but the filesystem itself can take care of the ugly
business and let the user just do what it needs.
The tracing system now attaches a descriptor to the directory dentry
that can have userspace create or remove sub directories. If this
descriptor does not exist for a dentry, then that dentry can not be
used to create other directories. This descriptor holds a mkdir and
rmdir method that only takes a character string as an argument.
The tracefs file system will first make a copy of the dentry name
before releasing the locks. Then it will pass the copied name to the
methods. It is up to the tracing system that supplied the methods to
handle races with duplicate names and such as all the inode mutexes
would be released when the functions are called.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As tools currently rely on the tracing directory in debugfs, we can not
just created a tracefs infrastructure and expect sysadmins to mount
the new tracefs to have their old tools work.
Instead, the debugfs tracing directory is still created and the tracefs
file system is mounted there when the debugfs filesystem is mounted.
No longer does the tracing infrastructure update the debugfs file system,
but instead interacts with the tracefs file system. But now, it still
appears to the user like nothing changed, except you also have the feature
of mounting just the tracing system without needing all of debugfs!
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
debugfs was fine for the tracing facility as a quick way to get
an interface. Now that tracing has matured, it should separate itself
from debugfs such that it can be mounted separately without needing
to mount all of debugfs with it. That is, users resist using tracing
because it requires mounting debugfs. Having tracing have its own file
system lets users get the features of tracing without needing to bring
in the rest of the kernel's debug infrastructure.
Another reason for tracefs is that debubfs does not support mkdir.
Currently, to create instances, one does a mkdir in the tracing/instance
directory. This is implemented via a hack that forces debugfs to do
something it is not intended on doing. By converting over to tracefs, this
hack can be removed and mkdir can be properly implemented. This patch does
not address this yet, but it lays the ground work for that to be done.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The options for cmdline tracers are not created if the debugfs system
is not ready yet. If tracing has started before debugfs is up, then the
option files for the tracer are not created. Create them when creating
the tracing directory if the current tracer requires option files.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Do not bother creating tracer options if no tracing directory
exists. If a tracer is enabled via the command line, and is
started before the tracing directory is created, then it wont have
its tracer specific options created.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull in Al Viro's changes to debugfs that implement the new primitive:
debugfs_create_automount(), that creates a directory in debugfs that will
safely mount another file system automatically when debugfs is mounted.
This will let tracefs automount itself on top of debugfs/tracing directory.
The top level trace array is treated a little different than the
instances, as it has to deal with more of the general tracing.
The tr->dir is the tracing directory, which is an immutable
dentry, where as the tr->dir of instances are the dentry that
was created, and can be destroyed later. These should have different
functions accessing them.
As only tracing_init_dentry() deals with the top level array, fold
the code for it into that function, and remove the trace_init_dentry_tr()
that was also used by the instances to get their directory dentry.
Add a tracing_get_dentry() to just get the tracing dir entry for
instances as well as the top level array.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Export the suspend_resume tracepoint so it can be used
in loadable modules.
Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the kernel is compiled with function tracer support the -pg compile option
is passed to gcc to generate extra code into the prologue of each function.
This patch replaces the "open-coded" -pg compile flag with a CC_FLAGS_FTRACE
makefile variable which architectures can override if a different option
should be used for code generation.
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The ring_buffer_producer uses 'struct timeval' to measure
its start and end times. 'struct timeval' on 32-bit systems
will have its tv_sec value overflow in year 2038 and beyond.
This patch replaces struct timeval with 'ktime_t' which uses
64-bit representation for nanoseconds.
Link: http://lkml.kernel.org/r/20150128141611.GA2701@tinar
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tina Ruchandani <ruchandani.tina@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If a trace event contains an array, there is currently no standard
way to format this for text output. Drivers are currently hacking
around this by a) local hacks that use the trace_seq functionailty
directly, or b) just not printing that information. For fixed size
arrays, formatting of the elements can be open-coded, but this gets
cumbersome for arrays of non-trivial size.
These approaches result in non-standard content of the event format
description delivered to userspace, so userland tools needs to be
taught to understand and parse each array printing method
individually.
This patch implements a __print_array() helper that tracepoint
implementations can use instead of reinventing it. A simple C-style
syntax is used to delimit the array and its elements {like,this}.
So that the helpers can be used with large static arrays as well as
dynamic arrays, they take a pointer and element count: they can be
used with __get_dynamic_array() for use with dynamic arrays.
Link: http://lkml.kernel.org/r/1422449335-8289-2-git-send-email-javi.merino@arm.com
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove the output-confusing newline below:
[ 0.191328]
**********************************************************
[ 0.191493] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 0.191586] ** **
...
Link: http://lkml.kernel.org/r/1422375440-31970-1-git-send-email-bp@alien8.de
Signed-off-by: Borislav Petkov <bp@suse.de>
[ added an extra '\n' by itself, to keep what it was suppose to do ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_init_dentry() will soon return NULL as a valid pointer for the
top level tracing directroy. NULL can not be used as an error value.
Instead, switch to ERR_PTR() and check the return status with
IS_ERR().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The creation of tracing files and directories is for the most part
encapsulated in helper functions in trace.c. Other files do not need to
include debugfs.h or fs.h, as they may have needed to in the past.
Remove them from the files that do not need them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
the mixture of function graph tracing and kprobes.
When jprobes and function graph tracing is enabled at the same time
it will crash the system.
# modprobe jprobe_example
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
After the first fork (jprobe_example probes it), the system will crash.
This is due to the way jprobes copies the stack frame and does not
do a normal function return. This messes up with the function graph
tracing accounting which hijacks the return address from the stack
and replaces it with a hook function. It saves the return addresses in
a separate stack to put back the correct return address when done.
But because the jprobe functions do not do a normal return, their
stack addresses are not put back until the function they probe is called,
which means that the probed function will get the return address of
the jprobe handler instead of its own.
The simple fix here was to disable function graph tracing while the
jprobe handler is being called.
While debugging this I found two minor bugs with the function graph
tracing.
The first was about the function graph tracer sharing its function hash
with the function tracer (they both get filtered by the same input).
The changing of the set_ftrace_filter would not sync the function recording
records after a change if the function tracer was disabled but the
function graph tracer was enabled. This was due to the update only checking
one of the ops instead of the shared ops to see if they were enabled and
should perform the sync. This caused the ftrace accounting to break and
a ftrace_bug() would be triggered, disabling ftrace until a reboot.
The second was that the check to update records only checked one of the
filter hashes. It needs to test both the "filter" and "notrace" hashes.
The "filter" hash determines what functions to trace where as the "notrace"
hash determines what functions not to trace (trace all but these).
Both hashes need to be passed to the update code to find out what change
is being done during the update. This also broke the ftrace record
accounting and triggered a ftrace_bug().
This patch set also include two more fixes that were reported separately
from the kprobe issue.
One was that init_ftrace_syscalls() was called twice at boot up.
This is not a major bug, but that call performed a rather large kmalloc
(NR_syscalls * sizeof(*syscalls_metadata)). The second call made the first
one a memory leak, and wastes memory.
The other fix is a regression caused by an update in the v3.19 merge window.
The moving to enable events early, moved the enabling before PID 1 was
created. The syscall events require setting the TIF_SYSCALL_TRACEPOINT
for all tasks. But for_each_process_thread() does not include the swapper
task (PID 0), and ended up being a nop. A suggested fix was to add
the init_task() to have its flag set, but I didn't really want to mess
with PID 0 for this minor bug. Instead I disable and re-enable events again
at early_initcall() where it use to be enabled. This also handles any other
event that might have its own reg function that could break at early
boot up.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUt9vmAAoJEEjnJuOKh9ldLHEIAJ9XrPW2xMIY5yI69jT1F7pv
PkSRqENnOK0l4UulD52SvIBecQTTBcEEjao4yVGkc7DCJBOws/1LZ5gW8OfNlKjq
rMB8yaosL1tXJ1ARVPMjcQVy+228zkgTXznwEZCjku1g7LuScQ28qyXsXO7B6yiK
xKoHqKjygmM/a2aVn+8tdiVKiDp6jdmkbYicbaFT4xP7XB5DaMmIiXRHxdvW6xdR
azKrVfYiMyJqTZNt/EVSWUk2WjeaYhoXyNtvgPx515wTo/llCnzhjcsocXBtH2P/
YOtwl+1L7Z89ukV9oXqrtrUJZ6Ps7+g7I1flJuL7/1FlNGnklcP9JojD+t6HeT8=
=vkec
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fixes from Steven Rostedt:
"This holds a few fixes to the ftrace infrastructure as well as the
mixture of function graph tracing and kprobes.
When jprobes and function graph tracing is enabled at the same time it
will crash the system:
# modprobe jprobe_example
# echo function_graph > /sys/kernel/debug/tracing/current_tracer
After the first fork (jprobe_example probes it), the system will
crash.
This is due to the way jprobes copies the stack frame and does not do
a normal function return. This messes up with the function graph
tracing accounting which hijacks the return address from the stack and
replaces it with a hook function. It saves the return addresses in a
separate stack to put back the correct return address when done. But
because the jprobe functions do not do a normal return, their stack
addresses are not put back until the function they probe is called,
which means that the probed function will get the return address of
the jprobe handler instead of its own.
The simple fix here was to disable function graph tracing while the
jprobe handler is being called.
While debugging this I found two minor bugs with the function graph
tracing.
The first was about the function graph tracer sharing its function
hash with the function tracer (they both get filtered by the same
input). The changing of the set_ftrace_filter would not sync the
function recording records after a change if the function tracer was
disabled but the function graph tracer was enabled. This was due to
the update only checking one of the ops instead of the shared ops to
see if they were enabled and should perform the sync. This caused the
ftrace accounting to break and a ftrace_bug() would be triggered,
disabling ftrace until a reboot.
The second was that the check to update records only checked one of
the filter hashes. It needs to test both the "filter" and "notrace"
hashes. The "filter" hash determines what functions to trace where as
the "notrace" hash determines what functions not to trace (trace all
but these). Both hashes need to be passed to the update code to find
out what change is being done during the update. This also broke the
ftrace record accounting and triggered a ftrace_bug().
This patch set also include two more fixes that were reported
separately from the kprobe issue.
One was that init_ftrace_syscalls() was called twice at boot up. This
is not a major bug, but that call performed a rather large kmalloc
(NR_syscalls * sizeof(*syscalls_metadata)). The second call made the
first one a memory leak, and wastes memory.
The other fix is a regression caused by an update in the v3.19 merge
window. The moving to enable events early, moved the enabling before
PID 1 was created. The syscall events require setting the
TIF_SYSCALL_TRACEPOINT for all tasks. But for_each_process_thread()
does not include the swapper task (PID 0), and ended up being a nop.
A suggested fix was to add the init_task() to have its flag set, but I
didn't really want to mess with PID 0 for this minor bug. Instead I
disable and re-enable events again at early_initcall() where it use to
be enabled. This also handles any other event that might have its own
reg function that could break at early boot up"
* tag 'trace-fixes-v3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix enabling of syscall events on the command line
tracing: Remove extra call to init_ftrace_syscalls()
ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing
ftrace: Check both notrace and filter for old hash
ftrace: Fix updating of filters for shared global_ops filters
Commit 5f893b2639 "tracing: Move enabling tracepoints to just after
rcu_init()" broke the enabling of system call events from the command
line. The reason was that the enabling of command line trace events
was moved before PID 1 started, and the syscall tracepoints require
that all tasks have the TIF_SYSCALL_TRACEPOINT flag set. But the
swapper task (pid 0) is not part of that. Since the swapper task is the
only task that is running at this early in boot, no task gets the
flag set, and the tracepoint never gets reached.
Instead of setting the swapper task flag (there should be no reason to
do that), re-enabled trace events again after the init thread (PID 1)
has been started. It requires disabling all command line events and
re-enabling them, as just enabling them again will not reset the logic
to set the TIF_SYSCALL_TRACEPOINT flag, as the syscall tracepoint will
be fooled into thinking that it was already set, and wont try setting
it again. For this reason, we must first disable it and re-enable it.
Link: http://lkml.kernel.org/r/1421188517-18312-1-git-send-email-mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/20150115040506.216066449@goodmis.org
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_init() calls init_ftrace_syscalls() and then calls trace_event_init()
which also calls init_ftrace_syscalls(). It makes more sense to only
call it from trace_event_init().
Calling it twice wastes memory, as it allocates the syscall events twice,
and loses the first copy of it.
Link: http://lkml.kernel.org/r/54AF53BD.5070303@huawei.com
Link: http://lkml.kernel.org/r/20150115040505.930398632@goodmis.org
Reported-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using just the filter for checking for trampolines or regs is not enough
when updating the code against the records that represent all functions.
Both the filter hash and the notrace hash need to be checked.
To trigger this bug (using trace-cmd and perf):
# perf probe -a do_fork
# trace-cmd start -B foo -e probe
# trace-cmd record -p function_graph -n do_fork sleep 1
The trace-cmd record at the end clears the filter before it disables
function_graph tracing and then that causes the accounting of the
ftrace function records to become incorrect and causes ftrace to bug.
Link: http://lkml.kernel.org/r/20150114154329.358378039@goodmis.org
Cc: stable@vger.kernel.org
[ still need to switch old_hash_ops to old_ops_hash ]
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the set_ftrace_filter affects both the function tracer as well as the
function graph tracer, the ops that represent each have a shared
ftrace_ops_hash structure. This allows both to be updated when the filter
files are updated.
But if function graph is enabled and the global_ops (function tracing) ops
is not, then it is possible that the filter could be changed without the
update happening for the function graph ops. This will cause the changes
to not take place and may even cause a ftrace_bug to occur as it could mess
with the trampoline accounting.
The solution is to check if the ops uses the shared global_ops filter and
if the ops itself is not enabled, to check if there's another ops that is
enabled and also shares the global_ops filter. In that case, the
modification still needs to be executed.
Link: http://lkml.kernel.org/r/20150114154329.055980438@goodmis.org
Cc: stable@vger.kernel.org # 3.17+
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both Linus (most recent) and Steve (a while ago) reported that perf
related callbacks have massive stack bloat.
The problem is that software events need a pt_regs in order to
properly report the event location and unwind stack. And because we
could not assume one was present we allocated one on stack and filled
it with minimal bits required for operation.
Now, pt_regs is quite large, so this is undesirable. Furthermore it
turns out that most sites actually have a pt_regs pointer available,
making this even more onerous, as the stack space is pointless waste.
This patch addresses the problem by observing that software events
have well defined nesting semantics, therefore we can use static
per-cpu storage instead of on-stack.
Linus made the further observation that all but the scheduler callers
of perf_sw_event() have a pt_regs available, so we change the regular
perf_sw_event() to require a valid pt_regs (where it used to be
optional) and add perf_sw_event_sched() for the scheduler.
We have a scheduler specific call instead of a more generic _noregs()
like construct because we can assume non-recursion from the scheduler
and thereby simplify the code further (_noregs would have to put the
recursion context call inline in order to assertain which __perf_regs
element to use).
One last note on the implementation of perf_trace_buf_prepare(); we
allow .regs = NULL for those cases where we already have a pt_regs
pointer available and do not need another.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Javi Merino <javi.merino@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cleanups
kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE
Fixes
kgdb/kdb: Allow access on a single core, if a CPU round up is deemed
impossible, which will allow inspection of the now "trashed" kernel
kdb: Add enable mask for the command groups
kdb: access controls to restrict sensitive commands
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUrq8WAAoJEIciOldedpOj+C8P/AjSUVBZdBLWzCU2VG150sQ0
UacwFVLve9heoColHBF7VqIDCRkZokIKJmCbHUBPZTbs22auLRpNI+D6CY5lZD17
jEHxrkKY4ragRRc/W3Y1MSc3aeGnS0i5AR8PJermMWxyUBfN3FBxgFHzTaLB2ZTT
8A+tvmwiG4mHue52gSiYZPCl/52WWOh+NjDe7T9OZ+mNmQKwZ5ssQZmmyUkxrs3b
LKXVXVtTUXxfEgB2x+lYTYAztcTsM5h+NbkT74FpSmwPjvU/p81Ptqveh+3JTdmX
H+Jz/SqD1/NfxC1Eenh5Mc++p/UVxeRbBulV9jwqjOyJqDjw3qHs1cjm8tZZj1qG
J3LODKi3GWhujMCfwdu5EJRnrFxgHCPiWInc2708oLbRi5SyOe6P6hNQ3K3Y4JtF
VkYa62wSaI0fDNQUFRc3bXUOUdMOCXjuzw3BtTi93tcUNcQwCXuYCmWtVvBgmK1h
LTrFCJmzbopiwpomxCwZ4BQm8id9HxP5pod95ypYb8K5aheXHCuSgibqj0nswWMm
ix0YTd4UNTn79r6p4d0fXFjOOYpXZA80ojeVI27D9zW7dBYc5CGVA1IDNH0ZfiPo
qySPUNUMXIjiTSOGZdUehByEC7tliLZczelRPnNh/9fmhJkJ745S7zs3DNQ7Ypg4
xDKthlRGNjn6cXOPl7gX
=cf1c
-----END PGP SIGNATURE-----
Merge tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull kgdb/kdb fixes from Jason Wessel:
"These have been around since 3.17 and in kgdb-next for the last 9
weeks and some will go back to -stable.
Summary of changes:
Cleanups
- kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE
Fixes
- kgdb/kdb: Allow access on a single core, if a CPU round up is
deemed impossible, which will allow inspection of the now "trashed"
kernel
- kdb: Add enable mask for the command groups
- kdb: access controls to restrict sensitive commands"
* tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kernel/debug/debug_core.c: Logging clean-up
kgdb: timeout if secondary CPUs ignore the roundup
kdb: Allow access to sensitive commands to be restricted by default
kdb: Add enable mask for groups of commands
kdb: Categorize kdb commands (similar to SysRq categorization)
kdb: Remove KDB_REPEAT_NONE flag
kdb: Use KDB_REPEAT_* values as flags
kdb: Rename kdb_register_repeat() to kdb_register_flags()
kdb: Rename kdb_repeat_t to kdb_cmdflags_t, cmd_repeat to cmd_flags
kdb: Remove currently unused kdbtab_t->cmd_flags
Taking the global mutex "trace_types_lock" in the trace_pipe files
causes a bottle neck as most the pipe files can be read per cpu
and there's no reason to serialize them.
The current_trace variable was given a ref count and it can not
change when the ref count is not zero. Opening the trace_pipe
files will up the ref count (and decremented on close), so that
the lock no longer needs to be taken when accessing the
current_trace variable.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When one of the trace pipe files are being read (by either the trace_pipe
or trace_pipe_raw), do not allow the current_trace to change. By adding
a ref count that is incremented when the pipe files are opened, will
prevent the current_trace from being changed.
This will allow for the removal of the global trace_types_lock from
reading the pipe buffers (which is currently a bottle neck).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- Fix a regression in leds-gpio introduced by a recent commit that
inadvertently changed the name of one of the properties used by
the driver (Fabio Estevam).
- Fix a regression in the ACPI backlight driver introduced by a
recent fix that missed one special case that had to be taken
into account (Aaron Lu).
- Drop the level of some new kernel messages from the ACPI core
introduced by a recent commit to KERN_DEBUG which they should
have used from the start and drop some other unuseful KERN_ERR
messages printed by ACPI (Rafael J Wysocki).
- Revert an incorrect commit modifying the cpupower tool
(Prarit Bhargava).
- Fix two regressions introduced by recent commits in the OPP
library and clean up some existing minor issues in that code
(Viresh Kumar).
- Continue to replace CONFIG_PM_RUNTIME with CONFIG_PM throughout
the tree (or drop it where that can be done) in order to make
it possible to eliminate CONFIG_PM_RUNTIME (Rafael J Wysocki,
Ulf Hansson, Ludovic Desroches). There will be one more
"CONFIG_PM_RUNTIME removal" batch after this one, because some
new uses of it have been introduced during the current merge
window, but that should be sufficient to finally get rid of it.
- Make the ACPI EC driver more robust against race conditions
related to GPE handler installation failures (Lv Zheng).
- Prevent the ACPI device PM core code from attempting to
disable GPEs that it has not enabled which confuses ACPICA
and makes it report errors unnecessarily (Rafael J Wysocki).
- Add a "force" command line switch to the intel_pstate driver
to make it possible to override the blacklisting of some
systems in that driver if needed (Ethan Zhao).
- Improve intel_pstate code documentation and add a MAINTAINERS
entry for it (Kristen Carlson Accardi).
- Make the ACPI fan driver create cooling device interfaces
witn names that reflect the IDs of the ACPI device objects
they are associated with, except for "generic" ACPI fans
(PNP ID "PNP0C0B"). That's necessary for user space thermal
management tools to be able to connect the fans with the
parts of the system they are supposed to be cooling properly.
From Srinivas Pandruvada.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUk0IDAAoJEILEb/54YlRx7fgP/3+yF/0TnEW93j2ALDAQFiLF
tSv2A2vQC8vtMJjjWx0z/HqPh86gfaReEFZmUJD/Q/e2LXEnxNZJ+QMjcekPVkDM
mTvcIMc2MR8vOA/oMkgxeaKregrrx7RkCfojd+NWZhVukkjl+mvBHgAnYjXRL+NZ
unDWGlbHG97vq/3kGjPYhDS00nxHblw8NHFBu5HL5RxwABdWoeZJITwqxXWyuPLw
nlqNWlOxmwvtSbw2VMKz0uof1nFHyQLykYsMG0ZsyayCRdWUZYkEqmE7GGpCLkLu
D6yfmlpen6ccIOsEAae0eXBt50IFY9Tihk5lovx1mZmci2SNRg29BqMI105wIn0u
8b8Ej7MNHp7yMxRpB5WfU90p/y7ioJns9guFZxY0CKaRnrI2+BLt3RscMi3MPI06
Cu2/WkSSa09fhDPA+pk+VDYsmWgyVawigesNmMP5/cvYO/yYywVRjOuO1k77qQGp
4dSpFYEHfpxinejZnVZOk2V9MkvSLoSMux6wPV0xM0IE1iD0ulVpHjTJrwp80ph4
+bfUFVr/vrD1y7EKbf1PD363ZKvJhWhvQWDgETsM1vgLf21PfWO7C2kflIAsWsdQ
1ukD5nCBRlP4K73hG7bdM6kRztXhUdR0SHg85/t0KB/ExiVqtcXIzB60D0G1lENd
QlKbq3O4lim1WGuhazQY
=5fo2
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI and power management updates from Rafael Wysocki:
"These are regression fixes (leds-gpio, ACPI backlight driver,
operating performance points library, ACPI device enumeration
messages, cpupower tool), other bug fixes (ACPI EC driver, ACPI device
PM), some cleanups in the operating performance points (OPP)
framework, continuation of CONFIG_PM_RUNTIME elimination, a couple of
minor intel_pstate driver changes, a new MAINTAINERS entry for it and
an ACPI fan driver change needed for better support of thermal
management in user space.
Specifics:
- Fix a regression in leds-gpio introduced by a recent commit that
inadvertently changed the name of one of the properties used by the
driver (Fabio Estevam).
- Fix a regression in the ACPI backlight driver introduced by a
recent fix that missed one special case that had to be taken into
account (Aaron Lu).
- Drop the level of some new kernel messages from the ACPI core
introduced by a recent commit to KERN_DEBUG which they should have
used from the start and drop some other unuseful KERN_ERR messages
printed by ACPI (Rafael J Wysocki).
- Revert an incorrect commit modifying the cpupower tool (Prarit
Bhargava).
- Fix two regressions introduced by recent commits in the OPP library
and clean up some existing minor issues in that code (Viresh
Kumar).
- Continue to replace CONFIG_PM_RUNTIME with CONFIG_PM throughout the
tree (or drop it where that can be done) in order to make it
possible to eliminate CONFIG_PM_RUNTIME (Rafael J Wysocki, Ulf
Hansson, Ludovic Desroches).
There will be one more "CONFIG_PM_RUNTIME removal" batch after this
one, because some new uses of it have been introduced during the
current merge window, but that should be sufficient to finally get
rid of it.
- Make the ACPI EC driver more robust against race conditions related
to GPE handler installation failures (Lv Zheng).
- Prevent the ACPI device PM core code from attempting to disable
GPEs that it has not enabled which confuses ACPICA and makes it
report errors unnecessarily (Rafael J Wysocki).
- Add a "force" command line switch to the intel_pstate driver to
make it possible to override the blacklisting of some systems in
that driver if needed (Ethan Zhao).
- Improve intel_pstate code documentation and add a MAINTAINERS entry
for it (Kristen Carlson Accardi).
- Make the ACPI fan driver create cooling device interfaces witn
names that reflect the IDs of the ACPI device objects they are
associated with, except for "generic" ACPI fans (PNP ID "PNP0C0B").
That's necessary for user space thermal management tools to be able
to connect the fans with the parts of the system they are supposed
to be cooling properly. From Srinivas Pandruvada"
* tag 'pm+acpi-3.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
MAINTAINERS: add entry for intel_pstate
ACPI / video: update the skip case for acpi_video_device_in_dod()
power / PM: Eliminate CONFIG_PM_RUNTIME
NFC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
SCSI / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
ACPI / EC: Fix unexpected ec_remove_handlers() invocations
Revert "tools: cpupower: fix return checks for sysfs_get_idlestate_count()"
tracing / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
x86 / PM: Replace CONFIG_PM_RUNTIME in io_apic.c
PM: Remove the SET_PM_RUNTIME_PM_OPS() macro
mmc: atmel-mci: use SET_RUNTIME_PM_OPS() macro
PM / Kconfig: Replace PM_RUNTIME with PM in dependencies
ARM / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
sound / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
phy / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
video / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
tty / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM
spi: Replace CONFIG_PM_RUNTIME with CONFIG_PM
ACPI / PM: Do not disable wakeup GPEs that have not been enabled
ACPI / utils: Drop error messages from acpi_evaluate_reference()
...
as I thought it might be. I'm pushing this in now.
This will allow Thomas to debug his irq work for 3.20.
This adds two new features:
1) Allow traceopoints to be enabled right after mm_init(). By passing
in the trace_event= kernel command line parameter, tracepoints can be
enabled at boot up. For debugging things like the initialization of
interrupts, it is needed to have tracepoints enabled very early. People
have asked about this before and this has been on my todo list. As it
can be helpful for Thomas to debug his upcoming 3.20 IRQ work, I'm
pushing this now. This way he can add tracepoints into the IRQ set up
and have users enable them when things go wrong.
2) Have the tracepoints printed via printk() (the console) when they
are triggered. If the irq code locks up or reboots the box, having the
tracepoint output go into the kernel ring buffer is useless for
debugging. But being able to add the tp_printk kernel command line
option along with the trace_event= option will have these tracepoints
printed as they occur, and that can be really useful for debugging
early lock up or reboot problems.
This code is not that intrusive and it passed all my tests. Thomas tried
them out too and it works for his needs.
Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUjv3kAAoJEEjnJuOKh9ldLNsIANAe5EmDCBw0WjR72n+G3qOH
NC8calXfkjqHU0bv8Q3dRv20KH4MHOy6l4+EiV9/ovt71LOF3NEyUJ3HuShf9a8b
sWcUhYbX3D1hViQe5sOzv9AWhBCFlKQGoNmQnydX9xa8ivRsBaTGJIGktWlHcwBE
jF1i3fj3l3vRQSS8qZFXp3bzreunlGyPoSHcT6eWQeos+utj4sKwQWTLXTLQeM+6
oQtFKRx7E5yX04qO1qFczS8qIEC6JH2C2jIRYEKUGepaELlnGkb8O7jQV/RaLF4/
6P8VhZFG9YLS7fn7vWu0SnAN+Zwz5LzgjXAZt0FhGtIhLc18Oj8ouHH1UORsdQM=
=Z4Un
-----END PGP SIGNATURE-----
Merge tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"As the merge window is still open, and this code was not as complex as
I thought it might be. I'm pushing this in now.
This will allow Thomas to debug his irq work for 3.20.
This adds two new features:
1) Allow traceopoints to be enabled right after mm_init().
By passing in the trace_event= kernel command line parameter,
tracepoints can be enabled at boot up. For debugging things like
the initialization of interrupts, it is needed to have tracepoints
enabled very early. People have asked about this before and this
has been on my todo list. As it can be helpful for Thomas to debug
his upcoming 3.20 IRQ work, I'm pushing this now. This way he can
add tracepoints into the IRQ set up and have users enable them when
things go wrong.
2) Have the tracepoints printed via printk() (the console) when they
are triggered.
If the irq code locks up or reboots the box, having the tracepoint
output go into the kernel ring buffer is useless for debugging.
But being able to add the tp_printk kernel command line option
along with the trace_event= option will have these tracepoints
printed as they occur, and that can be really useful for debugging
early lock up or reboot problems.
This code is not that intrusive and it passed all my tests. Thomas
tried them out too and it works for his needs.
Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org"
* tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add tp_printk cmdline to have tracepoints go to printk()
tracing: Move enabling tracepoints to just after rcu_init()
Add the kernel command line tp_printk option that will have tracepoints
that are active sent to printk() as well as to the trace buffer.
Passing "tp_printk" will activate this. To turn it off, the sysctl
/proc/sys/kernel/tracepoint_printk can have '0' echoed into it. Note,
this only works if the cmdline option is used. Echoing 1 into the sysctl
file without the cmdline option will have no affect.
Note, this is a dangerous option. Having high frequency tracepoints send
their data to printk() can possibly cause a live lock. This is another
reason why this is only active if the command line option is used.
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Enabling tracepoints at boot up can be very useful. The tracepoint
can be initialized right after RCU has been. There's no need to
wait for the early_initcall() to be called. That's too late for some
things that can use tracepoints for debugging. Move the logic to
enable tracepoints out of the initcalls and into init/main.c to
right after rcu_init().
This also allows trace_printk() to be used early too.
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos
Link: http://lkml.kernel.org/r/20141214164104.307127356@goodmis.org
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull block driver core update from Jens Axboe:
"This is the pull request for the core block IO changes for 3.19. Not
a huge round this time, mostly lots of little good fixes:
- Fix a bug in sysfs blktrace interface causing a NULL pointer
dereference, when enabled/disabled through that API. From Arianna
Avanzini.
- Various updates/fixes/improvements for blk-mq:
- A set of updates from Bart, mostly fixing buts in the tag
handling.
- Cleanup/code consolidation from Christoph.
- Extend queue_rq API to be able to handle batching issues of IO
requests. NVMe will utilize this shortly. From me.
- A few tag and request handling updates from me.
- Cleanup of the preempt handling for running queues from Paolo.
- Prevent running of unmapped hardware queues from Ming Lei.
- Move the kdump memory limiting check to be in the correct
location, from Shaohua.
- Initialize all software queues at init time from Takashi. This
prevents a kobject warning when CPUs are brought online that
weren't online when a queue was registered.
- Single writeback fix for I_DIRTY clearing from Tejun. Queued with
the core IO changes, since it's just a single fix.
- Version X of the __bio_add_page() segment addition retry from
Maurizio. Hope the Xth time is the charm.
- Documentation fixup for IO scheduler merging from Jan.
- Introduce (and use) generic IO stat accounting helpers for non-rq
drivers, from Gu Zheng.
- Kill off artificial limiting of max sectors in a request from
Christoph"
* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
bio: modify __bio_add_page() to accept pages that don't start a new segment
blk-mq: Fix uninitialized kobject at CPU hotplugging
blktrace: don't let the sysfs interface remove trace from running list
blk-mq: Use all available hardware queues
blk-mq: Micro-optimize bt_get()
blk-mq: Fix a race between bt_clear_tag() and bt_get()
blk-mq: Avoid that __bt_get_word() wraps multiple times
blk-mq: Fix a use-after-free
blk-mq: prevent unmapped hw queue from being scheduled
blk-mq: re-check for available tags after running the hardware queue
blk-mq: fix hang in bt_get()
blk-mq: move the kdump check to blk_mq_alloc_tag_set
blk-mq: cleanup tag free handling
blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
blk: introduce generic io stat accounting help function
blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
genhd: check for int overflow in disk_expand_part_tbl()
blk-mq: add blk_mq_free_hctx_request()
blk-mq: export blk_mq_free_request()
blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
...
After commit b2b49ccbdd (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
selected) PM_RUNTIME is always set if PM is set, so files that are
build conditionally if CONFIG_PM_RUNTIME is set may now be build
if CONFIG_PM is set.
Replace CONFIG_PM_RUNTIME with CONFIG_PM in kernel/trace/Makefile
for this reason.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org.
clean ups from that branch.
This code solves the issue of performing stack dumps from NMI context.
The issue is that printk() is not safe from NMI context as if the NMI
were to trigger when a printk() was being performed, the NMI could
deadlock from the printk() internal locks. This has been seen in practice.
With lots of review from Petr Mladek, this code went through several
iterations, and we feel that it is now at a point of quality to be
accepted into mainline.
Here's what is contained in this patch set:
o Creates a "seq_buf" generic buffer utility that allows a descriptor
to be passed around where functions can write their own "printk()"
formatted strings into it. The generic version was pulled out of
the trace_seq() code that was made specifically for tracing.
o The seq_buf code was change to model the seq_file code. I have
a patch (not included for 3.19) that converts the seq_file.c code
over to use seq_buf.c like the trace_seq.c code does. This was done
to make sure that seq_buf.c is compatible with seq_file.c. I may
try to get that patch in for 3.20.
o The seq_buf.c file was moved to lib/ to remove it from being dependent
on CONFIG_TRACING.
o The printk() was updated to allow for a per_cpu "override" of
the internal calls. That is, instead of writing to the console, a call
to printk() may do something else. This made it easier to allow the
NMI to change what printk() does in order to call dump_stack() without
needing to update that code as well.
o Finally, the dump_stack from all CPUs via NMI code was converted to
use the seq_buf code. The caller to trigger the NMI code would wait
till all the NMIs finished, and then it would print the seq_buf
data to the console safely from a non NMI context.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhbrnAAoJEEjnJuOKh9ldsCoIAJ3sKIJ5B3jxJJTCHPAx/lZD
GVbV1J1mu4kTAZuhJZOAxW8D6PZGZMyEjg0y6ScDEnBGcjAZ9gTiWCdakPktf9EX
GfaPPqwiL9dZ18J9Qc6uR+7M1Ffpzzwbcc6lJrpoTcjRgkoH9wCiLS9ozFQyYzWb
/7m5UbUM/PIk9WAjLYXPW6UUVtPTPT0RdEQKofMGTeah+vgqj4TXCOROdlxsXXWF
77vqBvPd5TUPWFH9ftzJGDtZS8SroXVKCu3fZIqHgzAU0yqwVtH/JzDTy9u2UYhX
GzDEPeAIdp6m6Uyc406VuIf1QW0gfBgmA0ir80vFoP27uFMM6j5HlF7azgQfx34=
=YBgA
-----END PGP SIGNATURE-----
Merge tag 'trace-seq-buf-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull nmi-safe seq_buf printk update from Steven Rostedt:
"This code is a fork from the trace-3.19 pull as it needed the
trace_seq clean ups from that branch.
This code solves the issue of performing stack dumps from NMI context.
The issue is that printk() is not safe from NMI context as if the NMI
were to trigger when a printk() was being performed, the NMI could
deadlock from the printk() internal locks. This has been seen in
practice.
With lots of review from Petr Mladek, this code went through several
iterations, and we feel that it is now at a point of quality to be
accepted into mainline.
Here's what is contained in this patch set:
- Creates a "seq_buf" generic buffer utility that allows a descriptor
to be passed around where functions can write their own "printk()"
formatted strings into it. The generic version was pulled out of
the trace_seq() code that was made specifically for tracing.
- The seq_buf code was change to model the seq_file code. I have a
patch (not included for 3.19) that converts the seq_file.c code
over to use seq_buf.c like the trace_seq.c code does. This was
done to make sure that seq_buf.c is compatible with seq_file.c. I
may try to get that patch in for 3.20.
- The seq_buf.c file was moved to lib/ to remove it from being
dependent on CONFIG_TRACING.
- The printk() was updated to allow for a per_cpu "override" of the
internal calls. That is, instead of writing to the console, a call
to printk() may do something else. This made it easier to allow
the NMI to change what printk() does in order to call dump_stack()
without needing to update that code as well.
- Finally, the dump_stack from all CPUs via NMI code was converted to
use the seq_buf code. The caller to trigger the NMI code would
wait till all the NMIs finished, and then it would print the
seq_buf data to the console safely from a non NMI context
One added bonus is that this code also makes the NMI dump stack work
on PREEMPT_RT kernels. As printk() includes sleeping locks on
PREEMPT_RT, printk() only writes to console if the console does not
use any rt_mutex converted spin locks. Which a lot do"
* tag 'trace-seq-buf-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
x86/nmi: Fix use of unallocated cpumask_var_t
printk/percpu: Define printk_func when printk is not defined
x86/nmi: Perform a safe NMI stack trace on all CPUs
printk: Add per_cpu printk func to allow printk to be diverted
seq_buf: Move the seq_buf code to lib/
seq-buf: Make seq_buf_bprintf() conditional on CONFIG_BINARY_PRINTF
tracing: Add seq_buf_get_buf() and seq_buf_commit() helper functions
tracing: Have seq_buf use full buffer
seq_buf: Add seq_buf_can_fit() helper function
tracing: Add paranoid size check in trace_printk_seq()
tracing: Use trace_seq_used() and seq_buf_used() instead of len
tracing: Clean up tracing_fill_pipe_page()
seq_buf: Create seq_buf_used() to find out how much was written
tracing: Add a seq_buf_clear() helper and clear len and readpos in init
tracing: Convert seq_buf fields to be like seq_file fields
tracing: Convert seq_buf_path() to be like seq_path()
tracing: Create seq_buf layer in trace_seq
to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to the
seq_file code as well in another tree.
Some of the other goodies include:
o Added some "!" (NOT) logic to the tracing filter.
o Fixed the frame pointer logic to the x86_64 mcount trampolines
o Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated
and be called directly by functions that only have a single hook
to them.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhbLGAAoJEEjnJuOKh9ldRV4H/3NcLbgGB2iu96la1zdYE6pG
Q7cDJMxXK80YIIL70h9G0IItcD4t62LMb72lfBnMGRj3msgFb3AgISW57EuI0Pxk
xk24wuIPoTG2S7v9sc3SboNFwO8qbtIjxD2OBmqIUrGo2sZIiGjyj3gX7mCY3uzL
WB2bUOSFz/22OgaANinR5EELHA3pZZCf54Vz1K9ndmtK0xp0j1a7xJShD6TrMdYv
mZ3zH5ViIhW4A3mdcMceh6fy2JLQAiEKF0uPTvcMMz7NlVul0mxyL/+10P7AE/3R
Ehw4fzmm4NDshPDtBOkKH0LsppgXzuItFuQUTpact3JlqTg++bV6onSsrkt1hlY=
=Z7Cm
-----END PGP SIGNATURE-----
Merge tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"There was a lot of clean ups and minor fixes. One of those clean ups
was to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to
the seq_file code as well in another tree.
Some of the other goodies include:
- Added some "!" (NOT) logic to the tracing filter.
- Fixed the frame pointer logic to the x86_64 mcount trampolines
- Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated and be
called directly by functions that only have a single hook to them"
* tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (55 commits)
tracing: Truncated output is better than nothing
tracing: Add additional marks to signal very large time deltas
Documentation: describe trace_buf_size parameter more accurately
tracing: Allow NOT to filter AND and OR clauses
tracing: Add NOT to filtering logic
ftrace/fgraph/x86: Have prepare_ftrace_return() take ip as first parameter
ftrace/x86: Get rid of ftrace_caller_setup
ftrace/x86: Have save_mcount_regs macro also save stack frames if needed
ftrace/x86: Add macro MCOUNT_REG_SIZE for amount of stack used to save mcount regs
ftrace/x86: Simplify save_mcount_regs on getting RIP
ftrace/x86: Have save_mcount_regs store RIP in %rdi for first parameter
ftrace/x86: Rename MCOUNT_SAVE_FRAME and add more detailed comments
ftrace/x86: Move MCOUNT_SAVE_FRAME out of header file
ftrace/x86: Have static tracing also use ftrace_caller_setup
ftrace/x86: Have static function tracing always test for function graph
kprobes: Add IPMODIFY flag to kprobe_ftrace_ops
ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict
kprobes/ftrace: Recover original IP if pre_handler doesn't change it
tracing/trivial: Fix typos and make an int into a bool
tracing: Deletion of an unnecessary check before iput()
...
Currently, blktrace can be started/stopped via its ioctl-based interface
(used by the userspace blktrace tool) or via its ftrace interface. The
function blk_trace_remove_queue(), called each time an "enable" tunable
of the ftrace interface transitions to zero, removes the trace from the
running list, even if no function from the sysfs interface adds it to
such a list. This leads to a null pointer dereference. This commit
changes the blk_trace_remove_queue() function so that it does not remove
the blk_trace from the running list.
v2:
- Now the patch removes the invocation of list_del() instead of
adding an useless if branch, as suggested by Namhyung Kim.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The initial reason for this patch is that I noticed that:
if (len > TRACE_BUF_SIZE)
is off by one. In this code, if len == TRACE_BUF_SIZE, then it means we
have truncated the last character off the output string. If we truncate
two or more characters then we exit without printing.
After some discussion, we decided that printing truncated data is better
than not printing at all so we should just use vscnprintf() and remove
the test entirely. Also I have updated memcpy() to copy the NUL char
instead of setting the NUL in a separate step.
Link: http://lkml.kernel.org/r/20141127155752.GA21914@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, function graph tracer prints "!" or "+" just before
function execution time to signal a function overhead, depending
on the time. And some tracers tracing latency also print "!" or
"+" just after time to signal overhead, depending on the interval
between events. Even it is usually enough to do that, we sometimes
need to signal for bigger execution time than 100 micro seconds.
For example, I used function graph tracer to detect if there is
any case that exit_mm() takes too much time. I did following steps
in /sys/kernel/debug/tracing. It was easier to detect very large
excution time with patched kernel than with original kernel.
$ echo exit_mm > set_graph_function
$ echo function_graph > current_tracer
$ echo > trace
$ cat trace_pipe > $LOGFILE
... (do something and terminate logging)
$ grep "\\$" $LOGFILE
3) $ 22082032 us | } /* kernel_map_pages */
3) $ 22082040 us | } /* free_pages_prepare */
3) $ 22082113 us | } /* free_hot_cold_page */
3) $ 22083455 us | } /* free_hot_cold_page_list */
3) $ 22083895 us | } /* release_pages */
3) $ 22177873 us | } /* free_pages_and_swap_cache */
3) $ 22178929 us | } /* unmap_single_vma */
3) $ 22198885 us | } /* unmap_vmas */
3) $ 22206949 us | } /* exit_mmap */
3) $ 22207659 us | } /* mmput */
3) $ 22207793 us | } /* exit_mm */
And then, it was easy to find out that a schedule-out occured by
sub_preempt_count() within kernel_map_pages().
To detect very large function exection time caused by either problematic
function implementation or scheduling issues, this patch can be useful.
Link: http://lkml.kernel.org/r/1416789259-24038-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add support to allow not "!" for and (&&) and (||). That is:
!(field1 == X && field2 == Y)
Where the value of the full clause will be notted.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Ted noticed that he could not filter on an event for a bit being cleared.
That's because the filtering logic only tests event fields with a limited
number of comparisons which, for bit logic, only include "&", which can
test if a bit is set, but there's no good way to see if a bit is clear.
This adds a way to do: !(field & 2048)
Which returns true if the bit is not set, and false otherwise.
Note, currently !(field1 == 10 && field2 == 15) is not supported.
That is, the 'not' only works for direct comparisons, not for the
AND and OR logic.
Link: http://lkml.kernel.org/r/20141202021912.GA29096@thunk.org
Link: http://lkml.kernel.org/r/20141202120430.71979060@gandalf.local.home
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Suggested-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Introduce FTRACE_OPS_FL_IPMODIFY to avoid conflict among
ftrace users who may modify regs->ip to change the execution
path. If two or more users modify the regs->ip on the same
function entry, one of them will be broken. So they must add
IPMODIFY flag and make sure that ftrace_set_filter_ip() succeeds.
Note that ftrace doesn't allow ftrace_ops which has IPMODIFY
flag to have notrace hash, and the ftrace_ops must have a
filter hash (so that the ftrace_ops can hook only specific
entries), because it strongly depends on the address and
must be allowed for only few selected functions.
Link: http://lkml.kernel.org/r/20141121102516.11844.27829.stgit@localhost.localdomain
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Vojtech Pavlik <vojtech@suse.cz>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ fixed up some of the comments ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix up a few typos in comments and convert an int into a bool in
update_traceon_count().
Link: http://lkml.kernel.org/r/546DD445.5080108@hitachi.com
Suggested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The seq_buf functions are rather useful outside of tracing. Instead
of having it be dependent on CONFIG_TRACING, move the code into lib/
and allow other users to have access to it even when tracing is not
configured.
The seq_buf utility is similar to the seq_file utility, but instead of
writing sending data back up to userland, it writes it into a buffer
defined at seq_buf_init(). This allows us to send a descriptor around
that writes printf() formatted strings into it that can be retrieved
later.
It is currently used by the tracing facility for such things like trace
events to convert its binary saved data in the ring buffer into an
ASCII human readable context to be displayed in /sys/kernel/debug/trace.
It can also be used for doing NMI prints safely from NMI context into
the seq_buf and retrieved later and dumped to printk() safely. Doing
printk() from an NMI context is dangerous because an NMI can preempt
a current printk() and deadlock on it.
Link: http://lkml.kernel.org/p/20140619213952.058255809@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function bstr_printf() from lib/vsprnintf.c is only available if
CONFIG_BINARY_PRINTF is defined. This is due to the only user currently
being the tracing infrastructure, which needs to select this config
when tracing is configured. Until there is another user of the binary
printf formats, this will continue to be the case.
Since seq_buf.c is now lives in lib/ and is compiled even without
tracing, it must encompass its use of bstr_printf() which is used
by seq_buf_printf(). This too is only used by the tracing infrastructure
and is still encapsulated by the CONFIG_BINARY_PRINTF.
Link: http://lkml.kernel.org/r/20141104160222.969013383@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add two helper functions; seq_buf_get_buf() and seq_buf_commit() that
are used by seq_buf_path(). This makes the code similar to the
seq_file: seq_path() function, and will help to be able to consolidate
the functions between seq_file and trace_seq.
Link: http://lkml.kernel.org/r/20141104160222.644881406@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.977571447@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently seq_buf is full when all but one byte of the buffer is
filled. Change it so that the seq_buf is full when all of the
buffer is filled.
Some of the functions would fill the buffer completely and report
everything was fine. This was inconsistent with the max of size - 1.
Changing this to be max of size makes all functions consistent.
Link: http://lkml.kernel.org/r/20141104160222.502133196@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.811957882@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a seq_buf_can_fit() helper function that removes the possible mistakes
of comparing the seq_buf length plus added data compared to the size of
the buffer.
Link: http://lkml.kernel.org/r/20141118164025.GL23958@pathway.suse.cz
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To be really paranoid about writing out of bound data in
trace_printk_seq(), add another check of len compared to size.
Link: http://lkml.kernel.org/r/20141119144004.GB2332@dhcp128.suse.cz
Suggested-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the seq_buf->len will soon be +1 size when there's an overflow, we
must use trace_seq_used() or seq_buf_used() methods to get the real
length. This will prevent buffer overflow issues if just the len
of the seq_buf descriptor is used to copy memory.
Link: http://lkml.kernel.org/r/20141114121911.09ba3d38@gandalf.local.home
Reported-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing_fill_pipe_page() logic is a little confusing with the
use of count saving the seq.len and reusing it.
Instead of subtracting a number that is calculated from the saved
value of the seq.len from seq.len, just save the seq.len at the start
and if we need to reset it, just assign it again.
When the seq_buf overflow is len == size + 1, the current logic will
break. Changing it to use a saved length for resetting back to the
original value is more robust and will work when we change the way
seq_buf sets the overflow.
Link: http://lkml.kernel.org/r/20141118161546.GJ23958@pathway.suse.cz
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Rewrite seq_buf_path() like it is done in seq_path() and allow
it to accept any escape character instead of just "\n".
Making seq_buf_path() like seq_path() will help prevent problems
when converting seq_file to use the seq_buf logic.
Link: http://lkml.kernel.org/r/20141104160222.048795666@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.338523371@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Create a seq_buf layer that trace_seq sits on. The seq_buf will not
be limited to page size. This will allow other usages of seq_buf
instead of a hard set PAGE_SIZE one that trace_seq has.
Link: http://lkml.kernel.org/r/20141104160221.864997179@goodmis.org
Link: http://lkml.kernel.org/r/20141114011412.170377300@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/5468F875.7080907@users.sourceforge.net
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the trace_seq of ftrace_raw_output_prep() is full this function
returns TRACE_TYPE_PARTIAL_LINE, otherwise it returns zero.
The problem is that TRACE_TYPE_PARTIAL_LINE happens to be zero!
The thing is, the caller of ftrace_raw_output_prep() expects a
success to be zero. Change that to expect it to be
TRACE_TYPE_HANDLED.
Link: http://lkml.kernel.org/r/20141114112522.GA2988@dhcp128.suse.cz
Reminded-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_seq_printf() and friends are used to store strings into a buffer
that can be passed around from function to function. If the trace_seq buffer
fills up, it will not print any more. The return values were somewhat
inconsistant and using trace_seq_has_overflowed() was a better way to know
if the write to the trace_seq buffer succeeded or not.
Now that all users have removed reading the return value of the printf()
type functions, they can safely return void and keep future users of them
from reading the inconsistent values as well.
Link: http://lkml.kernel.org/r/20141114011411.992510720@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will not be returning values
soon and will be void functions. To know if they succeeded or not, the
functions trace_seq_has_overflowed() and trace_handle_return() should be
used instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will soon no longer have
return values. Using trace_seq_has_overflowed() and trace_handle_return()
should be used instead.
Link: http://lkml.kernel.org/r/20141114011411.693008134@goodmis.org
Link: http://lkml.kernel.org/r/20141115050602.333705855@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions trace_seq_printf() and friends will soon not have a return
value and will only be a void function. Use trace_seq_has_overflowed()
instead to know if the trace_seq operations succeeded or not.
Link: http://lkml.kernel.org/r/20141114011411.530216306@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The return values for trace_seq_printf() and friends are going to be
removed and they will become void functions. The mmio tracer checked
their return and even did so incorrectly.
Some of the funtions which returned the values were never checked
themselves. Removing all the checks simplifies the code.
Use trace_seq_has_overflowed() and trace_handle_return() where
necessary instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of checking the return value of trace_seq_printf() and friends
for overflowing of the buffer, use the trace_seq_has_overflowed() helper
function.
This cleans up the code quite a bit and also takes us a step closer to
changing the return values of trace_seq_printf() and friends to void.
Link: http://lkml.kernel.org/r/20141114011411.181812785@goodmis.org
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of doing individual checks all over the place that makes the code
very messy. Just check trace_seq_has_overflowed() at the end or in
strategic places.
This makes the code much cleaner and also helps with getting closer
to removing the return values of trace_seq_printf() and friends.
Link: http://lkml.kernel.org/r/20141114011410.987913836@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The branch tracer should not be checking the trace_seq_printf() return value
as that will soon be void. There's a new trace_handle_return() helper function
that will return TRACE_TYPE_PARTIAL_LINE if the trace_seq overflowed
and TRACE_TYPE_HANDLED otherwise.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove checking the return value of all trace_seq_puts(). It was wrong
anyway as only the last return value mattered. But as the trace_seq_puts()
is going to be a void function in the future, we should not be checking
the return value of it anyway.
Just return !trace_seq_has_overflowed() instead.
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Checking the return code of every trace_seq_printf() operation and having
to return early if it overflowed makes the code messy.
Using the new trace_seq_has_overflowed() and trace_handle_return() functions
allows us to clean up the code.
In the future, trace_seq_printf() and friends will be turning into void
functions and not returning a value. The trace_seq_has_overflowed() is to
be used instead. This cleanup allows that change to take place.
Cc: Jens Axboe <axboe@fb.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding a trace_seq_has_overflowed() which returns true if the trace_seq
had too much written into it allows us to simplify the code.
Instead of checking the return value of every call to trace_seq_printf()
and friends, they can all be called normally, and at the end we can
return !trace_seq_has_overflowed() instead.
Several functions also return TRACE_TYPE_PARTIAL_LINE when the trace_seq
overflowed and TRACE_TYPE_HANDLED otherwise. Another helper function
was created called trace_handle_return() which takes a trace_seq and
returns these enums. Using this helper function also simplifies the
code.
This change also makes it possible to remove the return values of
trace_seq_printf() and friends. They should instead just be
void functions.
Link: http://lkml.kernel.org/r/20141114011410.365183157@goodmis.org
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In trace_seq_bitmask() it calls bitmap_scnprintf() not from the current
position of the trace_seq buffer (s->buffer + s->len), but instead from
the beginning of the buffer (s->buffer).
Luckily, the only user of this "ipi_raise tracepoint" uses it as the
first parameter, and as such, the start of the temp buffer in
include/trace/ftrace.h (see __get_bitmask()).
Reported-by: Petr Mladek <pmladek@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Stack traces that happen from function tracing check if the address
on the stack is a __kernel_text_address(). That is, is the address
kernel code. This calls core_kernel_text() which returns true
if the address is part of the builtin kernel code. It also calls
is_module_text_address() which returns true if the address belongs
to module code.
But what is missing is ftrace dynamically allocated trampolines.
These trampolines are allocated for individual ftrace_ops that
call the ftrace_ops callback functions directly. But if they do a
stack trace, the code checking the stack wont detect them as they
are neither core kernel code nor module address space.
Adding another field to ftrace_ops that also stores the size of
the trampoline assigned to it we can create a new function called
is_ftrace_trampoline() that returns true if the address is a
dynamically allocate ftrace trampoline. Note, it ignores trampolines
that are not dynamically allocated as they will return true with
the core_kernel_text() function.
Link: http://lkml.kernel.org/r/20141119034829.497125839@goodmis.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function probe counting for traceon and traceoff suffered a race
condition where if the probe was executing on two or more CPUs at the
same time, it could decrement the counter by more than one when
disabling (or enabling) the tracer only once.
The way the traceon and traceoff probes are suppose to work is that
they disable (or enable) tracing once per count. If a user were to
echo 'schedule:traceoff:3' into set_ftrace_filter, then when the
schedule function was called, it would disable tracing. But the count
should only be decremented once (to 2). Then if the user enabled tracing
again (via tracing_on file), the next call to schedule would disable
tracing again and the count would be decremented to 1.
But if multiple CPUS called schedule at the same time, it is possible
that the count would be decremented more than once because of the
simple "count--" used.
By reading the count into a local variable and using memory barriers
we can guarantee that the count would only be decremented once per
disable (or enable).
The stack trace probe had a similar race, but here the stack trace will
decrement for each time it is called. But this had the read-modify-
write race, where it could stack trace more than the number of times
that was specified. This case we use a cmpxchg to stack trace only the
number of times specified.
The dump probes can still use the old "update_count()" function as
they only run once, and that is controlled by the dump logic
itself.
Link: http://lkml.kernel.org/r/20141118134643.4b550ee4@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Usually, "msecs" notation means milli-seconds, and "usecs" notation
means micro-seconds. Since the unit used in the code is micro-seconds,
the notation should be replaced from msecs to usecs.
Link: http://lkml.kernel.org/r/1415171926-9782-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On the function_graph tracer, the print_graph_irq() function prints a
trace line with the flag ==========> on an irq handler entry, and the
flag <========== on an irq handler return.
But when the latency-format is enable, it is not printing the
latency-format flags, causing the following error in the trace output:
0) ==========> |
0) d... | smp_apic_timer_interrupt() {
This patch fixes this issue by printing the latency-format flags when
it is enable.
Link: http://lkml.kernel.org/r/7c2e226dac20c940b6242178fab7f0e3c9b5ce58.1415233316.git.bristot@redhat.com
Reviewed-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Printing a single character to a seqfile might as well be done with
seq_putc instead of seq_puts; this avoids a strlen() call and a memory
access. It also shaves another few bytes off the generated code.
Link: http://lkml.kernel.org/r/1415479332-25944-4-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Consecutive seq_puts calls with literal strings can be merged to a
single call. This reduces the size of the generated code, and can also
lead to slight .rodata reduction (because of fewer nul and padding
bytes). It should also shave a off a few clock cycles.
Link: http://lkml.kernel.org/r/1415479332-25944-3-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using seq_printf to print a simple string or a single character is a
lot more expensive than it needs to be, since seq_puts and seq_putc
exist.
These patches do
seq_printf(m, s) -> seq_puts(m, s)
seq_printf(m, "%s", s) -> seq_puts(m, s)
seq_printf(m, "%c", c) -> seq_putc(m, c)
Subsequent patches will simplify further.
Link: http://lkml.kernel.org/r/1415479332-25944-2-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently kdb's ftdump command will livelock by constantly printk'ing
the empty string at KERN_EMERG level if it run when the ftrace system is
not in use. This occurs because trace_empty() never returns false when
the ring buffers are left at the start of a non-consuming read [launched
by ring_buffer_read_start()].
This patch changes the loop exit condition to use the result of
trace_find_next_entry_inc(). Effectively this switches the non-consuming
kdb dumper to follow the approach of the non-consuming userspace
interface [s_next()] rather than the consuming ftrace_dump().
Link: http://lkml.kernel.org/r/1415277716-19419-3-git-send-email-daniel.thompson@linaro.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently kdb's ftdump command unconditionally crashes due to a null
pointer de-reference whenever the command is run. This in turn causes
the kernel to panic.
The abridged stacktrace (gathered with ARCH=arm) is:
--- cut here ---
[<c09535ac>] (panic) from [<c02132dc>] (die+0x264/0x440)
[<c02132dc>] (die) from [<c0952eb8>]
(__do_kernel_fault.part.11+0x74/0x84)
[<c0952eb8>] (__do_kernel_fault.part.11) from [<c021f954>]
(do_page_fault+0x1d0/0x3c4)
[<c021f954>] (do_page_fault) from [<c020846c>] (do_DataAbort+0x48/0xac)
[<c020846c>] (do_DataAbort) from [<c0213c58>] (__dabt_svc+0x38/0x60)
Exception stack(0xc0deba88 to 0xc0debad0)
ba80: e8c29180 00000001 e9854304 e9854300 c0f567d8
c0df2580
baa0: 00000000 00000000 00000000 c0f117b8 c0e3a3c0 c0debb0c 00000000
c0debad0
bac0: 0000672e c02f4d60 60000193 ffffffff
[<c0213c58>] (__dabt_svc) from [<c02f4d60>] (kdb_ftdump+0x1e4/0x3d8)
[<c02f4d60>] (kdb_ftdump) from [<c02ce328>] (kdb_parse+0x2b8/0x698)
[<c02ce328>] (kdb_parse) from [<c02ceef0>] (kdb_main_loop+0x52c/0x784)
[<c02ceef0>] (kdb_main_loop) from [<c02d1b0c>] (kdb_stub+0x238/0x490)
--- cut here ---
The NULL deref occurs due to the initialized use of struct trace_iter's
buffer_iter member.
This is a regression, albeit a fairly elderly one. It was introduced
by commit 6d158a813e ("tracing: Remove NR_CPUS array from
trace_iterator").
This patch solves this by providing a collection of ring_buffer_iter(s)
and using this to initialize buffer_iter. Note that static allocation
is used solely because the trace_iter itself is also static allocated.
Static allocation also means that we have to NULL-ify the pointer during
cleanup to avoid use-after-free problems.
Link: http://lkml.kernel.org/r/1415277716-19419-2-git-send-email-daniel.thompson@linaro.org
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to the documentation, adding "traceoff_on_warning" to the boot
command line should be enough to enable the feature. But right now it is
necessary to specify "traceoff_on_warning=". Along with fixing that, also
verify if the value passed, if any, is either "0" or "off".
Link: http://lkml.kernel.org/r/20141112231400.GL12281@uudg.org
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the new logic, if only a single user of ftrace function hooks is
used, it will get its own trampoline assigned to it.
The problem is that the control_ops is an indirect ops that perf ops
uses. What that means is that when perf registers its ops with
register_ftrace_function(), it has the CONTROL flag set and gets added
to the control list instead of the global ftrace list. The control_ops
gets added to that instead and the mcount trampoline calls the control_ops
function. The control_ops function will iterate the control list and
call the ops functions that are attached to it.
But currently the trampoline is added to the perf ops and not the
control ops, and when ftrace tries to find a trampoline hook for it,
it fails to find one and gives the following splat:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 10133 at kernel/trace/ftrace.c:2033 ftrace_get_addr_new+0x6f/0xc0()
Modules linked in: [...]
CPU: 0 PID: 10133 Comm: perf Tainted: P 3.18.0-rc1-test+ #388
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
00000000000007f1 ffff8800c2643bc8 ffffffff814fca6e ffff88011ea0ed01
0000000000000000 ffff8800c2643c08 ffffffff81041ffd 0000000000000000
ffffffff810c388c ffffffff81a5a350 ffff880119b00000 ffffffff810001c8
Call Trace:
[<ffffffff814fca6e>] dump_stack+0x46/0x58
[<ffffffff81041ffd>] warn_slowpath_common+0x81/0x9b
[<ffffffff810c388c>] ? ftrace_get_addr_new+0x6f/0xc0
[<ffffffff810001c8>] ? 0xffffffff810001c8
[<ffffffff81042031>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810c388c>] ftrace_get_addr_new+0x6f/0xc0
[<ffffffff8102e938>] ftrace_replace_code+0xd6/0x334
[<ffffffff810c4116>] ftrace_modify_all_code+0x41/0xc5
[<ffffffff8102eba6>] arch_ftrace_update_code+0x10/0x19
[<ffffffff810c293c>] ftrace_run_update_code+0x21/0x42
[<ffffffff810c298f>] ftrace_startup_enable+0x32/0x34
[<ffffffff810c3049>] ftrace_startup+0x14e/0x15a
[<ffffffff810c307c>] register_ftrace_function+0x27/0x40
[<ffffffff810dc118>] perf_ftrace_event_register+0x3e/0xee
[<ffffffff810dbfbe>] perf_trace_init+0x29d/0x2a9
[<ffffffff810eb422>] perf_tp_event_init+0x27/0x3a
[<ffffffff810f18bc>] perf_init_event+0x9e/0xed
[<ffffffff810f1ba4>] perf_event_alloc+0x299/0x330
[<ffffffff810f236b>] SYSC_perf_event_open+0x3ee/0x816
[<ffffffff8115a066>] ? mntput+0x2d/0x2f
[<ffffffff81142b00>] ? __fput+0xa7/0x1b2
[<ffffffff81091300>] ? do_gettimeofday+0x22/0x3a
[<ffffffff810f279c>] SyS_perf_event_open+0x9/0xb
[<ffffffff81502a92>] system_call_fastpath+0x12/0x17
---[ end trace 81a53565150e4982 ]---
Bad trampoline accounting at: ffffffff810001c8 (run_init_process+0x0/0x2d) (10000001)
Update the control_ops trampoline instead of the perf ops one.
Reported-by: lkp@01.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The only code that references tracing_sched_switch_trace() and
tracing_sched_wakeup_trace() is the wakeup latency tracer. Those
two functions use to belong to the sched_switch tracer which has
long been removed. These functions were left behind because the
wakeup latency tracer used them. But since the wakeup latency tracer
is the only one to use them, they should be static functions inside
that code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After the previous patch it is clear that "tracer_enabled" can never be
true, we can remove the "if (tracer_enabled)" code in probe_sched_switch()
and probe_sched_wakeup(). Plus we can obviously remove tracer_enabled,
ctx_trace, and sched_stopped as well.
Link: http://lkml.kernel.org/p/20140723193503.GA30217@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_{start,stop}_sched_switch_record() have no callers since
87d80de280 "tracing: Remove obsolete sched_switch tracer".
The last caller of tracing_sched_switch_assign_trace() was removed
by 30dbb20e68 "tracing: Remove boot tracer".
Link: http://lkml.kernel.org/p/20140723193501.GA30214@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the introduction of the dynamic trampolines, it is useful that if
things go wrong that ftrace_bug() produces more information about what
the current state is. This can help debug issues that may arise.
Ftrace has lots of checks to make sure that the state of the system it
touchs is exactly what it expects it to be. When it detects an abnormality
it calls ftrace_bug() and disables itself to prevent any further damage.
It is crucial that ftrace_bug() produces sufficient information that
can be used to debug the situation.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the static ftrace_ops (like function tracer) enables tracing, and it
is the only callback that is referencing a function, a trampoline is
dynamically allocated to the function that calls the callback directly
instead of calling a loop function that iterates over all the registered
ftrace ops (if more than one ops is registered).
But when it comes to dynamically allocated ftrace_ops, where they may be
freed, on a CONFIG_PREEMPT kernel there's no way to know when it is safe
to free the trampoline. If a task was preempted while executing on the
trampoline, there's currently no way to know when it will be off that
trampoline.
But this is not true when it comes to !CONFIG_PREEMPT. The current method
of calling schedule_on_each_cpu() will force tasks off the trampoline,
becaues they can not schedule while on it (kernel preemption is not
configured). That means it is safe to free a dynamically allocated
ftrace ops trampoline when CONFIG_PREEMPT is not configured.
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch introduces several new flags to collect kdb commands into
groups (later allowing them to be optionally disabled).
This follows similar prior art to enable/disable magic sysrq
commands.
The commands have been categorized as follows:
Always on: go (w/o args), env, set, help, ?, cpu (w/o args), sr,
dmesg, disable_nmi, defcmd, summary, grephelp
Mem read: md, mdr, mdp, mds, ef, bt (with args), per_cpu
Mem write: mm
Reg read: rd
Reg write: go (with args), rm
Inspect: bt (w/o args), btp, bta, btc, btt, ps, pid, lsmod
Flow ctrl: bp, bl, bph, bc, be, bd, ss
Signal: kill
Reboot: reboot
All: cpu, kgdb, (and all of the above), nmi_console
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Since we now treat KDB_REPEAT_* as flags, there is no need to
pass KDB_REPEAT_NONE. It's just the default behaviour when no
flags are specified.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
We're about to add more options for commands behaviour, so let's give
a more generic name to the low-level kdb command registration function.
There are just various renames, no functional changes.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
If the read loop in trace_buffers_splice_read() keeps failing due to
memory allocation failures without reading even a single page then this
function will keep busy looping.
Remove the risk for that by exiting the function if memory allocation
failures are seen.
Link: http://lkml.kernel.org/r/1415309167-2373-2-git-send-email-rabin@rab.in
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On a !PREEMPT kernel, attempting to use trace-cmd results in a soft
lockup:
# trace-cmd record -e raw_syscalls:* -F false
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trace-cmd:61]
...
Call Trace:
[<ffffffff8105b580>] ? __wake_up_common+0x90/0x90
[<ffffffff81092e25>] wait_on_pipe+0x35/0x40
[<ffffffff810936e3>] tracing_buffers_splice_read+0x2e3/0x3c0
[<ffffffff81093300>] ? tracing_stats_read+0x2a0/0x2a0
[<ffffffff812d10ab>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff810dc87b>] ? do_read_fault+0x21b/0x290
[<ffffffff810de56a>] ? handle_mm_fault+0x2ba/0xbd0
[<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
[<ffffffff810951e2>] ? trace_buffer_lock_reserve+0x22/0x60
[<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
[<ffffffff8112415d>] do_splice_to+0x6d/0x90
[<ffffffff81126971>] SyS_splice+0x7c1/0x800
[<ffffffff812d1edd>] tracesys_phase2+0xd3/0xd8
The problem is this: tracing_buffers_splice_read() calls
ring_buffer_wait() to wait for data in the ring buffers. The buffers
are not empty so ring_buffer_wait() returns immediately. But
tracing_buffers_splice_read() calls ring_buffer_read_page() with full=1,
meaning it only wants to read a full page. When the full page is not
available, tracing_buffers_splice_read() tries to wait again with
ring_buffer_wait(), which again returns immediately, and so on.
Fix this by adding a "full" argument to ring_buffer_wait() which will
make ring_buffer_wait() wait until the writer has left the reader's
page, i.e. until full-page reads will succeed.
Link: http://lkml.kernel.org/r/1415645194-25379-1-git-send-email-rabin@rab.in
Cc: stable@vger.kernel.org # 3.16+
Fixes: b1169cc69b ("tracing: Remove mock up poll wait function")
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The file /sys/kernel/debug/tracing/eneabled_functions is used to debug
ftrace function hooks. Add to the output what function is being called
by the trampoline if the arch supports it.
Add support for this feature in x86_64.
Cc: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The current method of handling multiple function callbacks is to register
a list function callback that calls all the other callbacks based on
their hash tables and compare it to the function that the callback was
called on. But this is very inefficient.
For example, if you are tracing all functions in the kernel and then
add a kprobe to a function such that the kprobe uses ftrace, the
mcount trampoline will switch from calling the function trace callback
to calling the list callback that will iterate over all registered
ftrace_ops (in this case, the function tracer and the kprobes callback).
That means for every function being traced it checks the hash of the
ftrace_ops for function tracing and kprobes, even though the kprobes
is only set at a single function. The kprobes ftrace_ops is checked
for every function being traced!
Instead of calling the list function for functions that are only being
traced by a single callback, we can call a dynamically allocated
trampoline that calls the callback directly. The function graph tracer
already uses a direct call trampoline when it is being traced by itself
but it is not dynamically allocated. It's trampoline is static in the
kernel core. The infrastructure that called the function graph trampoline
can also be used to call a dynamically allocated one.
For now, only ftrace_ops that are not dynamically allocated can have
a trampoline. That is, users such as function tracer or stack tracer.
kprobes and perf allocate their ftrace_ops, and until there's a safe
way to free the trampoline, it can not be used. The dynamically allocated
ftrace_ops may, although, use the trampoline if the kernel is not
compiled with CONFIG_PREEMPT. But that will come later.
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls. If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.
# trace-cmd record -e raw_syscalls:* true && trace-cmd report
...
true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
true-653 [000] 384.675988: sys_exit: NR 983045 = 0
...
# trace-cmd record -e syscalls:* true
[ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
[ 17.289590] pgd = 9e71c000
[ 17.289696] [aaaaaace] *pgd=00000000
[ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 17.290169] Modules linked in:
[ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
[ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
[ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
[ 17.290866] LR is at syscall_trace_enter+0x124/0x184
Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.
Commit cd0980fc8a "tracing: Check invalid syscall nr while tracing syscalls"
added the check for less than zero, but it should have also checked
for greater than NR_syscalls.
Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in
Fixes: cd0980fc8a "tracing: Check invalid syscall nr while tracing syscalls"
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When modifying code, ftrace has several checks to make sure things
are being done correctly. One of them is to make sure any code it
modifies is exactly what it expects it to be before it modifies it.
In order to do so with the new trampoline logic, it must be able
to find out what trampoline a function is hooked to in order to
see if the code that hooks to it is what's expected.
The logic to find the trampoline from a record (accounting descriptor
for a function that is hooked) needs to only look at the "old_hash"
of an ops that is being modified. The old_hash is the list of function
an ops is hooked to before its update. Since a record would only be
pointing to an ops that is being modified if it was already hooked
before.
Currently, it can pick a modified ops based on its new functions it
will be hooked to, and this picks the wrong trampoline and causes
the check to fail, disabling ftrace.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace: squash into ordering of ops for modification
The code that checks for trampolines when modifying function hooks
tests against a modified ops "old_hash". But the ops old_hash pointer
is not being updated before the changes are made, making it possible
to not find the right hash to the callback and possibly causing
ftrace to break in accounting and disable itself.
Have the ops set its old_hash before the modifying takes place.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)
- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)
- sched/numa updates and optimizations (Rik van Riel)
- sysbench speedup (Vincent Guittot)
- capacity calculation cleanups/refactoring (Vincent Guittot)
- Various cleanups to thread group iteration (Oleg Nesterov)
- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)
- various sched/deadline fixes
... and lots of other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...
code screem nasty warnings:
WARNING: CPU: 0 PID: 91 at kernel/sched/core.c:7253 __might_sleep+0x9a/0x378()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8d79b511>] event_test_thread+0x48/0x93
Modules linked in:
CPU: 0 PID: 91 Comm: test-events Not tainted 3.17.0-rc7-00109-g2f85d18 #37
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
0000000000000000 ffff880010ec3c80 ffffffff8c696943 ffff880010ec3cb8
ffffffff8be7cae5 ffffffff8bead236 0000000000000001 ffff88001161fa01
0000000000000001 0000000000000000 ffff880010ec3d20 ffffffff8be7cb46
Call Trace:
[<ffffffff8c696943>] dump_stack+0x19/0x1b
[<ffffffff8be7cae5>] warn_slowpath_common+0x8f/0xa8
[<ffffffff8bead236>] ? __might_sleep+0x9a/0x378
[<ffffffff8be7cb46>] warn_slowpath_fmt+0x48/0x50
[<ffffffff8be0dd55>] ? sched_clock+0x9/0xd
[<ffffffff8d79b511>] ? event_test_thread+0x48/0x93
[<ffffffff8d79b511>] ? event_test_thread+0x48/0x93
[<ffffffff8bead236>] __might_sleep+0x9a/0x378
[<ffffffff8c6a0227>] down_read+0x26/0x98
[<ffffffff8be8f503>] exit_signals+0x27/0x1c2
[<ffffffff8be7fedd>] do_exit+0x193/0x10bd
[<ffffffff8bfd1969>] ? kfree+0x4a0/0x4d7
[<ffffffff8d79b4c9>] ? event_trace_self_tests+0x6d7/0x6d7
[<ffffffff8d79b4c9>] ? event_trace_self_tests+0x6d7/0x6d7
[<ffffffff8bea4b65>] kthread+0x156/0x156
[<ffffffff8c69c0f8>] ? wait_for_common+0x3e/0x224
[<ffffffff8bea4a0f>] ? insert_kthread_work+0xe7/0xe7
[<ffffffff8c6a353a>] ret_from_fork+0x7a/0xb0
[<ffffffff8bea4a0f>] ? insert_kthread_work+0xe7/0xe7
---[ end trace 14d02ef17adbc114 ]---
These are triggered by some self tests that run at start up when
configure in. Although the code is technically correct, they are a little
sloppy and not very robust. They work now because it runs at boot up
and the tests do not call anything that might trigger a spurious
wake up. But that doesn't mean those tests wont change in the future.
It's best to clean them now to make sure the tests used to test the
internal workings of the system don't cause breakage themselves.
This also quiets the warnings made by the new checks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUNrcAAAoJEKQekfcNnQGu+oYH/3NaLEKOwQU+x0aL/rfSFB86
qtIq3X4iHGekGjrlN38N2Z36izI9AoYuGrWYReMFy1VcvnRxPAl1mc0y0dZfdW/C
KRLwKTAu0t78Ab8vzyXVDxS+Bs/zEi6V/8HykBFbCthiDz5IbTvxCoeS19O/X9CU
ptVKllUlywjKQD5UMiJwk7eOB5GspOeBgNu9MOh61ZfbYBVsl1hPqmD0gEaSH2Me
wLyDlIyc0P9dfeYeaqYblkiBaXLk2urZDU2Enffi1aueEwwWuN5x+DPGc6d6nGQW
fnworqoiYzz+maQoASwaLdCfJAP3cX5Ye7qWQk7QEtp4Ypdh5j7EacAf9pKEJg8=
=goKt
-----END PGP SIGNATURE-----
Merge tag 'trace-3.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Seems that Peter Zijlstra added a new check that is making old code
scream nasty warnings:
WARNING: CPU: 0 PID: 91 at kernel/sched/core.c:7253 __might_sleep+0x9a/0x378()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8d79b511>] event_test_thread+0x48/0x93
Call Trace:
__might_sleep+0x9a/0x378
down_read+0x26/0x98
exit_signals+0x27/0x1c2
do_exit+0x193/0x10bd
kthread+0x156/0x156
ret_from_fork+0x7a/0xb0
These are triggered by some self tests that run at start up when
configure in. Although the code is technically correct, they are a
little sloppy and not very robust. They work now because it runs at
boot up and the tests do not call anything that might trigger a
spurious wake up. But that doesn't mean those tests wont change in
the future.
It's best to clean them now to make sure the tests used to test the
internal workings of the system don't cause breakage themselves.
This also quiets the warnings made by the new checks"
* tag 'trace-3.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Clean up scheduling in trace_wakeup_test_thread()
tracing: Robustify wait loop
of the trampoline logic.
The trampoline logic of 3.17 required a descriptor for every function
that is registered to be traced and uses a trampoline. Currently,
only the function graph tracer uses a trampoline, but if you were
to trace all 32,000 (give or take a few thousand) functions with the
function graph tracer, it would create 32,000 descriptors to let us
know that there's a trampoline associated with it. This takes up a bit
of memory when there's a better way to do it.
The redesign now reuses the ftrace_ops' (what registers the function graph
tracer) hash tables. The hash tables tell ftrace what the tracer
wants to trace or doesn't want to trace. There's two of them: one
that tells us what to trace, the other tells us what not to trace.
If the first one is empty, it means all functions should be traced,
otherwise only the ones that are listed should be. The second hash table
tells us what not to trace, and if it is empty, all functions may be
traced, and if there's any listed, then those should not be traced
even if they exist in the first hash table.
It took a bit of massaging, but now these hashes can be used to
keep track of what has a trampoline and what does not, and allows
the ftrace accounting to work. Now we can trace all functions when using
the function graph trampoline, and avoid needing to create any special
descriptors to hold all the functions that are being traced.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUMwp6AAoJEKQekfcNnQGuIoAIAIsqvTYAnULyzKKCweEZYUfb
XJzz6cN5FPGSXkoeda1ZvnfOlHjFRrWNXzXMB0jZYR2hU++pe3xjtghaNzvbRcyV
wlwDUTsnz235OcOuFEspIwBamhtah96Coiwf/2z/2q6srXlHd/1TrqXB+Fpj1tEK
BkAViGDUEdq/eLZX7nGen36cTb5gpNqV9NjY1CVAK6bSkU/xXk/ArqFy1qy0MPnc
z/9bXdIf+Z6VnG/IzwRc2rwiMFuD1+lpjLuHEqagoHp1D4teCjWPSJl1EKCVAS40
GaCOTUZi92zIVgx8Bb28TglSla9MN65CO3E8dw6hlXUIsNz1p0eavpctnC6ac/Y=
=vDpP
-----END PGP SIGNATURE-----
Merge tag 'trace-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This set has a few minor updates, but the big change is the redesign
of the trampoline logic.
The trampoline logic of 3.17 required a descriptor for every function
that is registered to be traced and uses a trampoline. Currently,
only the function graph tracer uses a trampoline, but if you were to
trace all 32,000 (give or take a few thousand) functions with the
function graph tracer, it would create 32,000 descriptors to let us
know that there's a trampoline associated with it. This takes up a
bit of memory when there's a better way to do it.
The redesign now reuses the ftrace_ops' (what registers the function
graph tracer) hash tables. The hash tables tell ftrace what the
tracer wants to trace or doesn't want to trace. There's two of them:
one that tells us what to trace, the other tells us what not to trace.
If the first one is empty, it means all functions should be traced,
otherwise only the ones that are listed should be. The second hash
table tells us what not to trace, and if it is empty, all functions
may be traced, and if there's any listed, then those should not be
traced even if they exist in the first hash table.
It took a bit of massaging, but now these hashes can be used to keep
track of what has a trampoline and what does not, and allows the
ftrace accounting to work. Now we can trace all functions when using
the function graph trampoline, and avoid needing to create any special
descriptors to hold all the functions that are being traced"
* tag 'trace-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Only disable ftrace_enabled to test buffer in selftest
ftrace: Add sanity check when unregistering last ftrace_ops
kernel: trace_syscalls: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
tracing: generate RCU warnings even when tracepoints are disabled
ftrace: Replace tramp_hash with old_*_hash to save space
ftrace: Annotate the ops operation on update
ftrace: Grab any ops for a rec for enabled_functions output
ftrace: Remove freeing of old_hash from ftrace_hash_move()
ftrace: Set callback to ftrace_stub when no ops are registered
ftrace: Add helper function ftrace_ops_get_func()
ftrace: Add separate function for non recursive callbacks
Peter's new debugging tool triggers when tasks exit with !TASK_RUNNING.
The code in trace_wakeup_test_thread() also has a single schedule() call
that should be encompassed by a loop.
This cleans up the code a little to make it a bit more robust and
also makes the return exit properly with TASK_RUNNING.
Link: http://lkml.kernel.org/p/20141008135216.76142204@gandalf.local.home
Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infreadead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The pending nested sleep debugging triggered on the potential stale
TASK_INTERRUPTIBLE in this code.
While there, fix the loop such that we won't revert to a while(1)
yield() 'spin' loop if we ever get a spurious wakeup.
And fix the actual issue by properly terminating the 'wait' loop by
setting TASK_RUNNING.
Link: http://lkml.kernel.org/p/20141008165110.GA14547@worktop.programming.kicks-ass.net
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 651e22f270 "ring-buffer: Always reset iterator to reader page"
fixed one bug but in the process caused another one. The reset is to
update the header page, but that fix also changed the way the cached
reads were updated. The cache reads are used to test if an iterator
needs to be updated or not.
A ring buffer iterator, when created, disables writes to the ring buffer
but does not stop other readers or consuming reads from happening.
Although all readers are synchronized via a lock, they are only
synchronized when in the ring buffer functions. Those functions may
be called by any number of readers. The iterator continues down when
its not interrupted by a consuming reader. If a consuming read
occurs, the iterator starts from the beginning of the buffer.
The way the iterator sees that a consuming read has happened since
its last read is by checking the reader "cache". The cache holds the
last counts of the read and the reader page itself.
Commit 651e22f270 changed what was saved by the cache_read when
the rb_iter_reset() occurred, making the iterator never match the cache.
Then if the iterator calls rb_iter_reset(), it will go into an
infinite loop by checking if the cache doesn't match, doing the reset
and retrying, just to see that the cache still doesn't match! Which
should never happen as the reset is suppose to set the cache to the
current value and there's locks that keep a consuming reader from
having access to the data.
Fixes: 651e22f270 "ring-buffer: Always reset iterator to reader page"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Tasks get their end of stack set to STACK_END_MAGIC with the
aim to catch stack overruns. Currently this feature does not
apply to init_task. This patch removes this restriction.
Note that a similar patch was posted by Prarit Bhargava
some time ago but was never merged:
http://marc.info/?l=linux-kernel&m=127144305403241&w=2
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dzickus@redhat.com
Cc: bmr@redhat.com
Cc: jcastillo@redhat.com
Cc: jgh@redhat.com
Cc: minchan@kernel.org
Cc: tglx@linutronix.de
Cc: hannes@cmpxchg.org
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ftrace_enabled variable is set to zero in the self tests to keep
delayed functions from being traced and messing with the checks. This
only needs to be done when the checks are being performed, otherwise,
if ftrace_enabled is off when calls back to the utility that is being
tested, it can cause errors to happen and the tests can fail with
false positives.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the last ftrace_ops is unregistered, all the function records should
have a zeroed flags value. Make sure that is the case when the last ftrace_ops
is unregistered.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The uses of "rcu_assign_pointer()" are NULLing out the pointers.
According to RCU_INIT_POINTER()'s block comment:
"1. This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.
The following Coccinelle semantic patch was used:
@@
@@
- rcu_assign_pointer
+ RCU_INIT_POINTER
(..., NULL)
Link: http://lkml.kernel.org/p/20140822142822.GA32391@ada
Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allowing function callbacks to declare their own trampolines requires
that each ftrace_ops that has a trampoline must have some sort of
accounting that keeps track of which ops has a trampoline attached
to a record.
The easy way to solve this was to add a "tramp_hash" that created a
hash entry for every function that a ops uses with a trampoline.
But since we can have literally tens of thousands of functions being
traced, that means we need tens of thousands of descriptors to map
the ops to the function in the hash. This is quite expensive and
can cause enabling and disabling the function graph tracer to take
some time to start and stop. It can take up to several seconds to
disable or enable all functions in the function graph tracer for this
reason.
The better approach albeit more complex, is to keep track of how ops
are being enabled and disabled, and use that along with the counting
of the number of ops attached to records, to determive what ops has
a trampoline attached to a record at enabling and disabling of
tracing.
To do this, the tramp_hash has been replaced with an old_filter_hash
and old_notrace_hash, which get the copy of the ops filter_hash and
notrace_hash respectively. The old hashes is kept until the ops has
been modified or removed and the old hashes are used with the logic
of the accounting to determine the ops that have the trampoline of
a record. The reason this has less of a footprint is due to the trick
that an "empty" hash in the filter_hash means "all functions" and
an empty hash in the notrace hash means "no functions" in the hash.
This is much more efficienct, doesn't have the delay, and takes up
much less memory, as we do not need to map all the functions but
just figure out which functions are mapped at the time it is
enabled or disabled.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add three new flags for ftrace_ops:
FTRACE_OPS_FL_ADDING
FTRACE_OPS_FL_REMOVING
FTRACE_OPS_FL_MODIFYING
These will be set for the ftrace_ops when they are first added
to the function tracing, being removed from function tracing
or just having their functions changed from function tracing,
respectively.
This will be needed to remove the tramp_hash, which can grow quite
big. The tramp_hash is used to note what functions a ftrace_ops
is using a trampoline for. Denoting which ftrace_ops is being
modified, will allow us to use the ftrace_ops hashes themselves,
which are much smaller as they have a global flag to denote if
a ftrace_ops is tracing all functions, as well as a notrace hash
if the ftrace_ops is tracing all but a few. The tramp_hash just
creates a hash item for every function, which can go into the 10s
of thousands if all functions are using the ftrace_ops trampoline.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When dumping the enabled_functions, use the first op that is
found with a trampoline to the record, as there should only be
one, as only one ops can be registered to a function that has
a trampoline.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_hash_move() currently frees the old hash that is passed to it
after replacing the pointer with the new hash. Instead of having the
function do that chore, have the caller perform the free.
This lets the ftrace_hash_move() be used a bit more freely, which
is needed for changing the way the trampoline logic is done.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The clean up that adds the helper function ftrace_ops_get_func()
caused the default function to not change when DYNAMIC_FTRACE was not
set and no ftrace_ops were registered. Although static tracing is
not very useful (not having DYNAMIC_FTRACE set), it is still supported
and we don't want to break it.
Clean up the if statement even more to specifically have the default
function call ftrace_stub when no ftrace_ops are registered. This
fixes the small bug for static tracing as well as makes the code a
bit more understandable.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the helper function to what the mcount trampoline is to call
for a ftrace_ops function. This helper will be used by arch code
in the future to set up dynamic trampolines. But as this does the
same tests that are performed in choosing what function to call for
the default mcount trampoline, might as well use it to clean up
the existing code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of using the generic list function for callbacks that
are not recursive, call a new helper function from the mcount
trampoline called ftrace_ops_recur_func() that will do the recursion
checking for the callback.
This eliminates an indirection as well as will help in future code
that will use dynamically allocated trampolines.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Epoll on trace_pipe can sometimes hang in a weird case. If the ring buffer is
empty when we set waiters_pending but an event shows up exactly at that moment
we can miss being woken up by the ring buffers irq work. Since
ring_buffer_empty() is inherently racey we will sometimes think that the buffer
is not empty. So we don't get woken up and we don't think there are any events
even though there were some ready when we added the watch, which makes us hang.
This patch fixes this by making sure that we are actually on the wait list
before we set waiters_pending, and add a memory barrier to make sure
ring_buffer_empty() is going to be correct.
Link: http://lkml.kernel.org/p/1408989581-23727-1-git-send-email-jbacik@fb.com
Cc: stable@vger.kernel.org # 3.10+
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In __ftrace_replace_code(), when converting the call to a nop in a function
it needs to compare against the "curr" (current) value of the ftrace ops, and
not the "new" one. It currently does not affect x86 which is the only arch
to do the trampolines with function graph tracer, but when other archs that do
depend on this code implement the function graph trampoline, it can crash.
Here's an example when ARM uses the trampolines (in the future):
------------[ cut here ]------------
WARNING: CPU: 0 PID: 9 at kernel/trace/ftrace.c:1716 ftrace_bug+0x17c/0x1f4()
Modules linked in: omap_rng rng_core ipv6
CPU: 0 PID: 9 Comm: migration/0 Not tainted 3.16.0-test-10959-gf0094b28f303-dirty #52
[<c02188f4>] (unwind_backtrace) from [<c021343c>] (show_stack+0x20/0x24)
[<c021343c>] (show_stack) from [<c095a674>] (dump_stack+0x78/0x94)
[<c095a674>] (dump_stack) from [<c02532a0>] (warn_slowpath_common+0x7c/0x9c)
[<c02532a0>] (warn_slowpath_common) from [<c02532ec>] (warn_slowpath_null+0x2c/0x34)
[<c02532ec>] (warn_slowpath_null) from [<c02cbac4>] (ftrace_bug+0x17c/0x1f4)
[<c02cbac4>] (ftrace_bug) from [<c02cc44c>] (ftrace_replace_code+0x80/0x9c)
[<c02cc44c>] (ftrace_replace_code) from [<c02cc658>] (ftrace_modify_all_code+0xb8/0x164)
[<c02cc658>] (ftrace_modify_all_code) from [<c02cc718>] (__ftrace_modify_code+0x14/0x1c)
[<c02cc718>] (__ftrace_modify_code) from [<c02c7244>] (multi_cpu_stop+0xf4/0x134)
[<c02c7244>] (multi_cpu_stop) from [<c02c6e90>] (cpu_stopper_thread+0x54/0x130)
[<c02c6e90>] (cpu_stopper_thread) from [<c0271cd4>] (smpboot_thread_fn+0x1ac/0x1bc)
[<c0271cd4>] (smpboot_thread_fn) from [<c026ddf0>] (kthread+0xe0/0xfc)
[<c026ddf0>] (kthread) from [<c020f318>] (ret_from_fork+0x14/0x20)
---[ end trace dc9ce72c5b617d8f ]---
[ 65.047264] ftrace failed to modify [<c0208580>] asm_do_IRQ+0x10/0x1c
[ 65.054070] actual: 85:1b:00:eb
Fixes: 7413af1fb7 "ftrace: Make get_ftrace_addr() and get_ftrace_addr_old() global"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The latest rewrite of ftrace removed the separate ftrace_ops of
the function tracer and the function graph tracer and had them
share the same ftrace_ops. This simplified the accounting by removing
the multiple layers of functions called, where the global_ops func
would call a special list that would iterate over the other ops that
were registered within it (like function and function graph), which
itself was registered to the ftrace ops list of all functions
currently active. If that sounds confusing, the code that implemented
it was also confusing and its removal is a good thing.
The problem with this change was that it assumed that the function
and function graph tracer can never be used at the same time.
This is mostly true, but there is an exception. That is when the
function profiler uses the function graph tracer to profile.
The function profiler can be activated the same time as the function
tracer, and this breaks the assumption and the result is that ftrace
will crash (it detects the error and shuts itself down, it does not
cause a kernel oops).
To solve this issue, a previous change allowed the hash tables
for the functions traced by a ftrace_ops to be a pointer and let
multiple ftrace_ops share the same hash. This allows the function
and function_graph tracer to have separate ftrace_ops, but still
share the hash, which is what is done.
Now the function and function graph tracers have separate ftrace_ops
again, and the function tracer can be run while the function_profile
is active.
Cc: stable@vger.kernel.org # 3.16 (apply after 3.17-rc4 is out)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that a ftrace_hash can be shared by multiple ftrace_ops, they can dec
the rec->flags by more than once (one per those that share the ftrace_hash).
This means that the tramp_hash may not have a hash item when it was added.
For example, if two ftrace_ops share a hash for a ftrace record, and the
first ops has a trampoline, when it adds itself it will set the rec->flags
TRAMP flag and increments its nr_trampolines counter. When the second ops
is added, it must clear that tramp flag but also decrement the other ops
that shares its hash. As the update to the function callbacks has not yet
been performed, the other ops will not have the tramp hash set yet and it
can not be used to know to decrement its nr_trampolines.
Luckily, the tramp_hash does not need to be used. As the ftrace_mutex is
held, a ops with a trampoline to a record during an update of another ops
that shares the record will have its func_hash pointing to it. Since a
trampoline can only be set for a record if only one ops is attached to it,
we can just check if the record has a trampoline (the FTRACE_FL_TRAMP flag
is set) and then find the ops that has this record in its hashes.
Also added some output to help debug when things go wrong.
Cc: stable@vger.kernel.org # 3.16+ (apply after 3.17-rc4 is out)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When updating what an ftrace_ops traces, if it is registered (that is,
actively tracing), and that ftrace_ops uses the shared global_ops
local_hash, then we need to update all tracers that are active and
also share the global_ops' ftrace_hash_ops.
Cc: stable@vger.kernel.org # 3.16 (apply after 3.17-rc4 is out)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the top level debug file system function tracer shares its
ftrace_ops with the function graph tracer. This was thought to be fine
because the tracers are not used together, as one can only enable
function or function_graph tracer in the current_tracer file.
But that assumption proved to be incorrect. The function profiler
can use the function graph tracer when function tracing is enabled.
Since all function graph users uses the function tracing ftrace_ops
this causes a conflict and when a user enables both function profiling
as well as the function tracer it will crash ftrace and disable it.
The quick solution so far is to move them as separate ftrace_ops like
it was earlier. The problem though is to synchronize the functions that
are traced because both function and function_graph tracer are limited
by the selections made in the set_ftrace_filter and set_ftrace_notrace
files.
To handle this, a new structure is made called ftrace_ops_hash. This
structure will now hold the filter_hash and notrace_hash, and the
ftrace_ops will point to this structure. That will allow two ftrace_ops
to share the same hashes.
Since most ftrace_ops do not share the hashes, and to keep allocation
simple, the ftrace_ops structure will include both a pointer to the
ftrace_ops_hash called func_hash, as well as the structure itself,
called local_hash. When the ops are registered, the func_hash pointer
will be initialized to point to the local_hash within the ftrace_ops
structure. Some of the ftrace internal ftrace_ops will be initialized
statically. This will allow for the function and function_graph tracer
to have separate ops but still share the same hash tables that determine
what functions they trace.
Cc: stable@vger.kernel.org # 3.16 (apply after 3.17-rc4 is out)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
rarely ever hit, and requires the user to do something that users rarely
do. It took a few special test cases to even trigger this bug,
and one of them was just one test in the process of finishing up as another
one started.
Both bugs have to do with the ring buffer iterator rb_iter_peek(), but one
is more indirect than the other.
The fist bug fix is simply an increase in the safety net loop counter.
The counter makes sure that the rb_iter_peek() only iterates the number
of times we expect it can, and no more. Well, there was one way it could
iterate one more than we expected, and that caused the ring buffer
to shutdown with a nasty warning. The fix was simply to up that counter by
one.
The other bug has to be with rb_iter_reset() (called by rb_iter_peek()).
This happens when a user reads both the trace_pipe and trace files.
The trace_pipe is a consuming read and does not use the ring buffer
iterator, but the trace file is not a consuming read and does use the
ring buffer iterator. When the trace file is being read, if it detects
that a consuming read occurred, it resets the iterator and starts over.
But the reset code that does this (rb_iter_reset()), checks if the
reader_page is linked to the ring buffer or not, and will look into
the ring buffer itself if it is not. This is wrong, as it should always
try to read the reader page first. Not to mention, the code that looked
into the ring buffer did it wrong, and used the header_page "read" offset
to start reading on that page. That offset is bogus for pages in the
writable ring buffer, and was corrupting the iterator, and it would start
returning bogus events.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT44tRAAoJEKQekfcNnQGuMVIH/3evbjKT+w6Kon4S0LfLSejl
YDsXYkeO/lGiO3MnUveqq1jfw2+yHtyBVUipvfG0A61urMUhyUvjveph8Ec2cQ4Q
qHl0J28vDfF5tpKiYzygRN01wHD6GXYh+XZSChkA4ItzzD8K51lsZT1Yd+I2pTy2
DgH01EEEYiwYJcih+T4vlbKqYju6pwgxqKNCTv0RdVXUPya/tG9X2Nf8VGQTbmKS
FIO73qObYR+P9iXGIuPfyOxk2EvBiWS15WownZmfhZZxOIKx9IrDYwTsiV1+AJw+
sJFER1ulobYGDtGDa9yyrNJQr6wgbrfCDELnNKmdLUTlSwgZjLXpE2HEmlelY/s=
=5mQl
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull trace file read iterator fixes from Steven Rostedt:
"This contains a fix for two long standing bugs. Both of which are
rarely ever hit, and requires the user to do something that users
rarely do. It took a few special test cases to even trigger this bug,
and one of them was just one test in the process of finishing up as
another one started.
Both bugs have to do with the ring buffer iterator rb_iter_peek(), but
one is more indirect than the other.
The fist bug fix is simply an increase in the safety net loop counter.
The counter makes sure that the rb_iter_peek() only iterates the
number of times we expect it can, and no more. Well, there was one
way it could iterate one more than we expected, and that caused the
ring buffer to shutdown with a nasty warning. The fix was simply to
up that counter by one.
The other bug has to be with rb_iter_reset() (called by
rb_iter_peek()). This happens when a user reads both the trace_pipe
and trace files. The trace_pipe is a consuming read and does not use
the ring buffer iterator, but the trace file is not a consuming read
and does use the ring buffer iterator. When the trace file is being
read, if it detects that a consuming read occurred, it resets the
iterator and starts over. But the reset code that does this
(rb_iter_reset()), checks if the reader_page is linked to the ring
buffer or not, and will look into the ring buffer itself if it is not.
This is wrong, as it should always try to read the reader page first.
Not to mention, the code that looked into the ring buffer did it
wrong, and used the header_page "read" offset to start reading on that
page. That offset is bogus for pages in the writable ring buffer, and
was corrupting the iterator, and it would start returning bogus
events"
* tag 'trace-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Always reset iterator to reader page
ring-buffer: Up rb_iter_peek() loop count to 3
When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).
When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.
Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.
My tests would sometimes trigger this bug on my i386 box:
WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
[<c18796b3>] dump_stack+0x4b/0x75
[<c103a0e3>] warn_slowpath_common+0x7e/0x95
[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
[<c103a185>] warn_slowpath_fmt+0x33/0x35
[<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
[<c10bed04>] trace_find_cmdline+0x40/0x64
[<c10c3c16>] trace_print_context+0x27/0xec
[<c10c4360>] ? trace_seq_printf+0x37/0x5b
[<c10c0b15>] print_trace_line+0x319/0x39b
[<c10ba3fb>] ? ring_buffer_read+0x47/0x50
[<c10c13b1>] s_show+0x192/0x1ab
[<c10bfd9a>] ? s_next+0x5a/0x7c
[<c112e76e>] seq_read+0x267/0x34c
[<c1115a25>] vfs_read+0x8c/0xef
[<c112e507>] ? seq_lseek+0x154/0x154
[<c1115ba2>] SyS_read+0x54/0x7f
[<c188488e>] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
>>>> ##### CPU 1 buffer started ####
Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.
After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.
After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().
As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.
I originally wrote a patch to have it set the iter->head to 0 instead
of iter->head_page->read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid. The list_empty(reader_page->list) just means that it was
successful in swapping out. But the reader_page may still have data.
There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.
Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.
Cc: stable@vger.kernel.org # 2.6.28+
Fixes: d769041f86 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After writting a test to try to trigger the bug that caused the
ring buffer iterator to become corrupted, I hit another bug:
WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
Modules linked in: ipt_MASQUERADE sunrpc [...]
CPU: 1 PID: 5281 Comm: grep Tainted: G W 3.16.0-rc3-test+ #143
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
Call Trace:
[<ffffffff81503fb0>] ? dump_stack+0x4a/0x75
[<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
[<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
[<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
[<ffffffff810c14df>] ? ring_buffer_iter_peek+0x2d/0x5c
[<ffffffff810c6f73>] ? tracing_iter_reset+0x6e/0x96
[<ffffffff810c74a3>] ? s_start+0xd7/0x17b
[<ffffffff8112b13e>] ? kmem_cache_alloc_trace+0xda/0xea
[<ffffffff8114cf94>] ? seq_read+0x148/0x361
[<ffffffff81132d98>] ? vfs_read+0x93/0xf1
[<ffffffff81132f1b>] ? SyS_read+0x60/0x8e
[<ffffffff8150bf9f>] ? tracesys+0xdd/0xe2
Debugging this bug, which triggers when the rb_iter_peek() loops too
many times (more than 2 times), I discovered there's a case that can
cause that function to legitimately loop 3 times!
rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
only deals with the reader page (it's for consuming reads). The
rb_iter_peek() is for traversing the buffer without consuming it, and as
such, it can loop for one more reason. That is, if we hit the end of
the reader page or any page, it will go to the next page and try again.
That is, we have this:
1. iter->head > iter->head_page->page->commit
(rb_inc_iter() which moves the iter to the next page)
try again
2. event = rb_iter_head_event()
event->type_len == RINGBUF_TYPE_TIME_EXTEND
rb_advance_iter()
try again
3. read the event.
But we never get to 3, because the count is greater than 2 and we
cause the WARNING and return NULL.
Up the counter to 3.
Cc: stable@vger.kernel.org # 2.6.37+
Fixes: 69d1b839f7 "ring-buffer: Bind time extend and data events together"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull timer and time updates from Thomas Gleixner:
"A rather large update of timers, timekeeping & co
- Core timekeeping code is year-2038 safe now for 32bit machines.
Now we just need to fix all in kernel users and the gazillion of
user space interfaces which rely on timespec/timeval :)
- Better cache layout for the timekeeping internal data structures.
- Proper nanosecond based interfaces for in kernel users.
- Tree wide cleanup of code which wants nanoseconds but does hoops
and loops to convert back and forth from timespecs. Some of it
definitely belongs into the ugly code museum.
- Consolidation of the timekeeping interface zoo.
- A fast NMI safe accessor to clock monotonic for tracing. This is a
long standing request to support correlated user/kernel space
traces. With proper NTP frequency correction it's also suitable
for correlation of traces accross separate machines.
- Checkpoint/restart support for timerfd.
- A few NOHZ[_FULL] improvements in the [hr]timer code.
- Code move from kernel to kernel/time of all time* related code.
- New clocksource/event drivers from the ARM universe. I'm really
impressed that despite an architected timer in the newer chips SoC
manufacturers insist on inventing new and differently broken SoC
specific timers.
[ Ed. "Impressed"? I don't think that word means what you think it means ]
- Another round of code move from arch to drivers. Looks like most
of the legacy mess in ARM regarding timers is sorted out except for
a few obnoxious strongholds.
- The usual updates and fixlets all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
timekeeping: Fixup typo in update_vsyscall_old definition
clocksource: document some basic timekeeping concepts
timekeeping: Use cached ntp_tick_length when accumulating error
timekeeping: Rework frequency adjustments to work better w/ nohz
timekeeping: Minor fixup for timespec64->timespec assignment
ftrace: Provide trace clocks monotonic
timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
seqcount: Add raw_write_seqcount_latch()
seqcount: Provide raw_read_seqcount()
timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
timekeeping: Create struct tk_read_base and use it in struct timekeeper
timekeeping: Restructure the timekeeper some more
clocksource: Get rid of cycle_last
clocksource: Move cycle_last validation to core code
clocksource: Make delta calculation a function
wireless: ath9k: Get rid of timespec conversions
drm: vmwgfx: Use nsec based interfaces
drm: i915: Use nsec based interfaces
timekeeping: Provide ktime_get_raw()
hangcheck-timer: Use ktime_get_ns()
...
Pull perf changes from Ingo Molnar:
"Kernel side changes:
- Consolidate the PMU interrupt-disabled code amongst architectures
(Vince Weaver)
- misc fixes
Tooling changes (new features, user visible changes):
- Add support for pagefault tracing in 'trace', please see multiple
examples in the changeset messages (Stanislav Fomichev).
- Add pagefault statistics in 'trace' (Stanislav Fomichev)
- Add header for columns in 'top' and 'report' TUI browsers (Jiri
Olsa)
- Add pagefault statistics in 'trace' (Stanislav Fomichev)
- Add IO mode into timechart command (Stanislav Fomichev)
- Fallback to syscalls:* when raw_syscalls:* is not available in the
perl and python perf scripts. (Daniel Bristot de Oliveira)
- Add --repeat global option to 'perf bench' to be used in benchmarks
such as the existing 'futex' one, that was modified to use it
instead of a local option. (Davidlohr Bueso)
- Fix fd -> pathname resolution in 'trace', be it using /proc or a
vfs_getname probe point. (Arnaldo Carvalho de Melo)
- Add suggestion of how to set perf_event_paranoid sysctl, to help
non-root users trying tools like 'trace' to get a working
environment. (Arnaldo Carvalho de Melo)
- Updates from trace-cmd for traceevent plugin_kvm plus args cleanup
(Steven Rostedt, Jan Kiszka)
- Support S/390 in 'perf kvm stat' (Alexander Yarygin)
Tooling infrastructure changes:
- Allow reserving a row for header purposes in the hists browser
(Arnaldo Carvalho de Melo)
- Various fixes and prep work related to supporting Intel PT (Adrian
Hunter)
- Introduce multiple debug variables control (Jiri Olsa)
- Add callchain and additional sample information for python scripts
(Joseph Schuchart)
- More prep work to support Intel PT: (Adrian Hunter)
- Polishing 'script' BTS output
- 'inject' can specify --kallsym
- VDSO is per machine, not a global var
- Expose data addr lookup functions previously private to 'script'
- Large mmap fixes in events processing
- Include standard stringify macros in power pc code (Sukadev
Bhattiprolu)
Tooling cleanups:
- Convert open coded equivalents to asprintf() (Andy Shevchenko)
- Remove needless reassignments in 'trace' (Arnaldo Carvalho de Melo)
- Cache the is_exit syscall test in 'trace) (Arnaldo Carvalho de
Melo)
- No need to reimplement err() in 'perf bench sched-messaging', drop
barf(). (Davidlohr Bueso).
- Remove ev_name argument from perf_evsel__hists_browse, can be
obtained from the other parameters. (Jiri Olsa)
Tooling fixes:
- Fix memory leak in the 'sched-messaging' perf bench test.
(Davidlohr Bueso)
- The -o and -n 'perf bench mem' options are mutually exclusive, emit
error when both are specified. (Davidlohr Bueso)
- Fix scrollbar refresh row index in the ui browser, problem exposed
now that headers will be added and will be allowed to be switched
on/off. (Jiri Olsa)
- Handle the num array type in python properly (Sebastian Andrzej
Siewior)
- Fix wrong condition for allocation failure (Jiri Olsa)
- Adjust callchain based on DWARF debug info on powerpc (Sukadev
Bhattiprolu)
- Fix a risk for doing free on uninitialized pointer in traceevent
lib (Rickard Strandqvist)
- Update attr test with PERF_FLAG_FD_CLOEXEC flag (Jiri Olsa)
- Enable close-on-exec flag on perf file descriptor (Yann Droneaud)
- Fix build on gcc 4.4.7 (Arnaldo Carvalho de Melo)
- Event ordering fixes (Jiri Olsa)"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (123 commits)
Revert "perf tools: Fix jump label always changing during tracing"
perf tools: Fix perf usage string leftover
perf: Check permission only for parent tracepoint event
perf record: Store PERF_RECORD_FINISHED_ROUND only for nonempty rounds
perf record: Always force PERF_RECORD_FINISHED_ROUND event
perf inject: Add --kallsyms parameter
perf tools: Expose 'addr' functions so they can be reused
perf session: Fix accounting of ordered samples queue
perf powerpc: Include util/util.h and remove stringify macros
perf tools: Fix build on gcc 4.4.7
perf tools: Add thread parameter to vdso__dso_findnew()
perf tools: Add dso__type()
perf tools: Separate the VDSO map name from the VDSO dso name
perf tools: Add vdso__new()
perf machine: Fix the lifetime of the VDSO temporary file
perf tools: Group VDSO global variables into a structure
perf session: Add ability to skip 4GiB or more
perf session: Add ability to 'skip' a non-piped event stream
perf tools: Pass machine to vdso__dso_findnew()
perf tools: Add dso__data_size()
...
As he found some small bugs that went into 3.16, and these changes were
based on that, I had to apply his changes to a separate branch than
my main development branch.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT3533AAoJEKQekfcNnQGuaNkIALM9uf6gvIXgvyAlhK1qfjJL
0pxAlgppukA7hZT9gcubGuStPxKnwtVSvd/GwGI/mc7Ls57vOvdA61TEHN48l2MW
SUev1Jeu1Nx3Wh4B2ivRq+dJadrwU9NwG507yvuXn2L8Gqzwal133dg3AHnsTYgz
yu7oe8baxMniFFYlAmSELFp2cA3kli/WrRuQB7weudbT12jZlpOMqz7OUN9o7dfo
E4XfwkkDUuo9EiKD1QwVMk2cIkCaHFI8lJDYJbrM3L5XJKfnKzZNRMWfwDG2L7hS
DGjMl6T0SjU+ZIoTidHdbyF/Q+I0o9kKTnUrErGu88+2Z1gfEKN5O8zovNMbPUM=
=dT/x
-----END PGP SIGNATURE-----
Merge tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing filter cleanups from Steven Rostedt:
"Oleg Nesterov did several clean ups with the tracing filter code. As
he found some small bugs that went into 3.16, and these changes were
based on that, I had to apply his changes to a separate branch than my
main development branch.
This was based on work that was already pulled into 3.16, and is a
separate pull request to keep from having local merges in my pull
request"
* tag 'trace-3.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Kill "filter_string" arg of replace_preds()
tracing: Change apply_subsystem_event_filter() paths to check file->system == dir
tracing: Kill ftrace_event_call->files
tracing/uprobes: Kill the dead TRACE_EVENT_FL_USE_CALL_FILTER logic
tracing: Kill call_filter_disable()
tracing: Kill destroy_call_preds()
tracing: Kill destroy_preds() and destroy_file_preds()
to the ftrace function callback infrastructure. It's introducing a
way to allow different functions to call directly different trampolines
instead of all calling the same "mcount" one.
The only user of this for now is the function graph tracer, which always
had a different trampoline, but the function tracer trampoline was called
and did basically nothing, and then the function graph tracer trampoline
was called. The difference now, is that the function graph tracer
trampoline can be called directly if a function is only being traced by
the function graph trampoline. If function tracing is also happening on
the same function, the old way is still done.
The accounting for this takes up more memory when function graph tracing
is activated, as it needs to keep track of which functions it uses.
I have a new way that wont take as much memory, but it's not ready yet
for this merge window, and will have to wait for the next one.
Another big change was the removal of the ftrace_start/stop() calls that
were used by the suspend/resume code that stopped function tracing when
entering into suspend and resume paths. The stop of ftrace was done
because there was some function that would crash the system if one called
smp_processor_id()! The stop/start was a big hammer to solve the issue
at the time, which was when ftrace was first introduced into Linux.
Now ftrace has better infrastructure to debug such issues, and I found
the problem function and labeled it with "notrace" and function tracing
can now safely be activated all the way down into the guts of suspend
and resume.
Other changes include clean ups of uprobe code.
Clean up of the trace_seq() code.
And other various small fixes and clean ups to ftrace and tracing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJT35zXAAoJEKQekfcNnQGuOz0H/38zqM0nLFhrgvz3EPk2UOjn
xqpX8qyb2V7TJZL+IqeXU2a5cQZl5ba0D4WtBGpxbTae3CJYiuQ87iKUNFoH0om5
FDpn80igb368k8V3qRdRsziKVCCf0XBd/NkHJXc0ZkfXGyzB2Ga4bBxALxp2gj9y
bnO+vKo6+tWYKG4hyQb4P3LRXUrK8/LWEsPr39cH2QH1Rdj69Lx9CgrCdUVJmwcb
Bj8hEiLXL/RYCFNn79A3wNTUvW0rG/AOIf4SLqXtasSRZ0ToaU0ZyDnrNv+0Ol47
rX8tSk+LfXchL9hpIvjCf1vlAYq3pO02favteR/jip3lx/dTjEDE4RJ9qtJzZ4Q=
=fwQY
-----END PGP SIGNATURE-----
Merge tag 'trace-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This pull request has a lot of work done. The main thing is the
changes to the ftrace function callback infrastructure. It's
introducing a way to allow different functions to call directly
different trampolines instead of all calling the same "mcount" one.
The only user of this for now is the function graph tracer, which
always had a different trampoline, but the function tracer trampoline
was called and did basically nothing, and then the function graph
tracer trampoline was called. The difference now, is that the
function graph tracer trampoline can be called directly if a function
is only being traced by the function graph trampoline. If function
tracing is also happening on the same function, the old way is still
done.
The accounting for this takes up more memory when function graph
tracing is activated, as it needs to keep track of which functions it
uses. I have a new way that wont take as much memory, but it's not
ready yet for this merge window, and will have to wait for the next
one.
Another big change was the removal of the ftrace_start/stop() calls
that were used by the suspend/resume code that stopped function
tracing when entering into suspend and resume paths. The stop of
ftrace was done because there was some function that would crash the
system if one called smp_processor_id()! The stop/start was a big
hammer to solve the issue at the time, which was when ftrace was first
introduced into Linux. Now ftrace has better infrastructure to debug
such issues, and I found the problem function and labeled it with
"notrace" and function tracing can now safely be activated all the way
down into the guts of suspend and resume
Other changes include clean ups of uprobe code, clean up of the
trace_seq() code, and other various small fixes and clean ups to
ftrace and tracing"
* tag 'trace-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (57 commits)
ftrace: Add warning if tramp hash does not match nr_trampolines
ftrace: Fix trampoline hash update check on rec->flags
ring-buffer: Use rb_page_size() instead of open coded head_page size
ftrace: Rename ftrace_ops field from trampolines to nr_trampolines
tracing: Convert local function_graph functions to static
ftrace: Do not copy old hash when resetting
tracing: let user specify tracing_thresh after selecting function_graph
ring-buffer: Always run per-cpu ring buffer resize with schedule_work_on()
tracing: Remove function_trace_stop and HAVE_FUNCTION_TRACE_MCOUNT_TEST
s390/ftrace: remove check of obsolete variable function_trace_stop
arm64, ftrace: Remove check of obsolete variable function_trace_stop
Blackfin: ftrace: Remove check of obsolete variable function_trace_stop
metag: ftrace: Remove check of obsolete variable function_trace_stop
microblaze: ftrace: Remove check of obsolete variable function_trace_stop
MIPS: ftrace: Remove check of obsolete variable function_trace_stop
parisc: ftrace: Remove check of obsolete variable function_trace_stop
sh: ftrace: Remove check of obsolete variable function_trace_stop
sparc64,ftrace: Remove check of obsolete variable function_trace_stop
tile: ftrace: Remove check of obsolete variable function_trace_stop
ftrace: x86: Remove check of obsolete variable function_trace_stop
...
There's no need to check cloned event's permission once the
parent was already checked.
Also the code is checking 'current' process permissions, which
is not owner process for cloned events, thus could end up with
wrong permission check result.
Reported-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com>
Tested-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1405079782-8139-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After adding all the records to the tramp_hash, add a check that makes
sure that the number of records added matches the number of records
expected to match and do a WARN_ON and disable ftrace if they do
not match.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the loop of ftrace_save_ops_tramp_hash(), it adds all the recs
to the ops hash if the rec has only one callback attached and the
ops is connected to the rec. It gives a nasty warning and shuts down
ftrace if the rec doesn't have a trampoline set for it. But this
can happen with the following scenario:
# cd /sys/kernel/debug/tracing
# echo schedule do_IRQ > set_ftrace_filter
# mkdir instances/foo
# echo schedule > instances/foo/set_ftrace_filter
# echo function_graph > current_function
# echo function > instances/foo/current_function
# echo nop > instances/foo/current_function
The above would then trigger the following warning and disable
ftrace:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 3145 at kernel/trace/ftrace.c:2212 ftrace_run_update_code+0xe4/0x15b()
Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ip [...]
CPU: 1 PID: 3145 Comm: bash Not tainted 3.16.0-rc3-test+ #136
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000000 ffffffff81808a88 ffffffff81502130 0000000000000000
ffffffff81040ca1 ffff880077c08000 ffffffff810bd286 0000000000000001
ffffffff81a56830 ffff88007a041be0 ffff88007a872d60 00000000000001be
Call Trace:
[<ffffffff81502130>] ? dump_stack+0x4a/0x75
[<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
[<ffffffff810bd286>] ? ftrace_run_update_code+0xe4/0x15b
[<ffffffff810bd286>] ? ftrace_run_update_code+0xe4/0x15b
[<ffffffff810bda1a>] ? ftrace_shutdown+0x11c/0x16b
[<ffffffff810bda87>] ? unregister_ftrace_function+0x1e/0x38
[<ffffffff810cc7e1>] ? function_trace_reset+0x1a/0x28
[<ffffffff810c924f>] ? tracing_set_tracer+0xc1/0x276
[<ffffffff810c9477>] ? tracing_set_trace_write+0x73/0x91
[<ffffffff81132383>] ? __sb_start_write+0x9a/0xcc
[<ffffffff8120478f>] ? security_file_permission+0x1b/0x31
[<ffffffff81130e49>] ? vfs_write+0xac/0x11c
[<ffffffff8113115d>] ? SyS_write+0x60/0x8e
[<ffffffff81508112>] ? system_call_fastpath+0x16/0x1b
---[ end trace 938c4415cbc7dc96 ]---
------------[ cut here ]------------
Link: http://lkml.kernel.org/r/20140723120805.GB21376@redhat.com
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a helper function to get a ring buffer page size (the number
of bytes of data recorded on the page), called rb_page_size().
Use that instead of open coding it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Expose the new NMI safe accessor to clock monotonic to the tracer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Having two fields within the same struct that is off by one character
can be confusing and error prone. Rename the counter "trampolines"
to "nr_trampolines" to explicitly show it is a counter and not to
be confused by the "trampoline" field.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The "uptime" trace clock added in:
commit 8aacf017b0
tracing: Add "uptime" trace clock that uses jiffies
has wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
(u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds. An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).
Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).
Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com
Cc: stable@vger.kernel.org # 3.10+
Fixes: 8aacf017b0 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Do not waste time copying the old hash if the hash is going to be
reset. Just allocate a new hash and free the old one, as that is
the same result as copying te old one and then resetting it.
Link: http://lkml.kernel.org/p/1405384820-48837-1-git-send-email-wangnan0@huawei.com
Signed-off-by: Wang Nan <wangnan0@huawei.com>
[ SDR: Removed unused ftrace_filter_reset() function ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, tracing_thresh works only if we specify it before selecting
function_graph tracer. If we do the opposite, tracing_thresh will change
it's value, but it will not be applied.
To fix it, we add update_thresh callback which is called whenever
tracing_thresh is updated and for function_graph tracer we register
handler which reinitializes tracer depending on tracing_thresh.
Link: http://lkml.kernel.org/p/20140718111727.GA3206@stfomichev-desktop.yandex.net
Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code for resizing the trace ring buffers has to run the per-cpu
resize on the CPU itself. The code was using preempt_off() and
running the code for the current CPU directly, otherwise calling
schedule_work_on().
At least on RT this could result in the following:
|BUG: sleeping function called from invalid context at kernel/rtmutex.c:673
|in_atomic(): 1, irqs_disabled(): 0, pid: 607, name: bash
|3 locks held by bash/607:
|CPU: 0 PID: 607 Comm: bash Not tainted 3.12.15-rt25+ #124
|(rt_spin_lock+0x28/0x68)
|(free_hot_cold_page+0x84/0x3b8)
|(free_buffer_page+0x14/0x20)
|(rb_update_pages+0x280/0x338)
|(ring_buffer_resize+0x32c/0x3dc)
|(free_snapshot+0x18/0x38)
|(tracing_set_tracer+0x27c/0x2ac)
probably via
|cd /sys/kernel/debug/tracing/
|echo 1 > events/enable ; sleep 2
|echo 1024 > buffer_size_kb
If we just always use schedule_work_on(), there's no need for the
preempt_off(). So do that.
Link: http://lkml.kernel.org/p/1405537633-31518-1-git-send-email-cminyard@mvista.com
Reported-by: Stanislav Meduna <stano@meduna.org>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All users of function_trace_stop and HAVE_FUNCTION_TRACE_MCOUNT_TEST have
been removed. We can safely remove them from the kernel.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
function_trace_stop is no longer used to stop function tracing.
Remove the check from __ftrace_ops_list_func().
Also, call FTRACE_WARN_ON() instead of setting function_trace_stop
if a ops has no func to call.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When function tracing is being updated function_trace_stop is set to
keep from tracing the updates. This was fine when function tracing
was done from stop machine. But it is no longer done that way and
this can cause real tracing to be missed.
Remove it.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All archs now use ftrace_graph_is_dead() to stop function graph
tracing. Remove the usage of ftrace_stop() as that is no longer
needed.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_stop() is going away as it disables parts of function tracing
that affects users that should not be affected. But ftrace_graph_stop()
is built on ftrace_stop(). Here's another example of killing all of
function tracing because something went wrong with function graph
tracing.
Instead of disabling all users of function tracing on function graph
error, disable only function graph tracing.
A new function is created called ftrace_graph_is_dead(). This is called
in strategic paths to prevent function graph from doing more harm and
allowing at least a warning to be printed before the system crashes.
NOTE: ftrace_stop() is still used until all the archs are converted over
to use ftrace_graph_is_dead(). After that, ftrace_stop() will be removed.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cosmetic, but replace_preds() doesn't need/use "char *filter_string".
Remove it to microsimplify the code.
Link: http://lkml.kernel.org/p/20140715184832.GA20519@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
filter_free_subsystem_preds(), filter_free_subsystem_filters() and
replace_system_preds() can simply check file->system->subsystem and
avoid strcmp(call->class->system).
Better yet, we can pass "struct ftrace_subsystem_dir *dir" instead of
event_subsystem and just check file->system == dir.
Thanks to Namhyung Kim who pointed out that replace_system_preds() can
be changed too.
Link: http://lkml.kernel.org/p/20140715184829.GA20516@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
alloc_trace_uprobe() sets TRACE_EVENT_FL_USE_CALL_FILTER for unknown
reason and this is simply wrong. Fortunately this has no effect because
register_uprobe_event() clears call->flags after that.
Kill both. This trace_uprobe was kzalloc'ed and we rely on this fact
anyway.
Link: http://lkml.kernel.org/p/20140715184824.GA20505@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It seems that the only purpose of call_filter_disable() is to
make filter_disable() less clear and symmetrical, remove it.
Link: http://lkml.kernel.org/p/20140715184821.GA20498@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove destroy_call_preds(). Its only caller, __trace_remove_event_call(),
can use free_event_filter() and nullify ->filter by hand.
Perhaps we could keep this trivial helper although imo it is pointless, but
then it should be static in trace_events.c.
Link: http://lkml.kernel.org/p/20140715184816.GA20495@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
destroy_preds() makes no sense.
The only caller, event_remove(), actually wants destroy_file_preds().
__trace_remove_event_call() does destroy_call_preds() which takes care
of call->filter.
And after the previous change we can simply remove destroy_preds() from
event_remove(), we are going to call remove_event_from_tracers() which
in turn calls remove_event_file_dir()->free_event_filter().
Link: http://lkml.kernel.org/p/20140715184813.GA20488@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently if an arch supports function graph tracing, the core code will
just assign the function graph trampoline to the function graph addr that
gets called.
But as the old method for function graph tracing always calls the function
trampoline first and that calls the function graph trampoline, some
archs may have the function graph trampoline dependent on operations that
were done in the function trampoline. This causes function graph tracer
to break on those archs.
Instead of having the default be to set the function graph ftrace_ops
to the function graph trampoline, have it instead just set it to zero
which will keep it from jumping to a trampoline that is not set up
to be jumped directly too.
Link: http://lkml.kernel.org/r/53BED155.9040607@nvidia.com
Reported-by: Tuomas Tynkkynen <ttynkkynen@nvidia.com>
Tested-by: Tuomas Tynkkynen <ttynkkynen@nvidia.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available. Otherwise, the following epoll and
read sequence will eventually hang forever:
1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever
~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
ring_buffer_poll_wait() returns immediately without adding poll_table,
which has poll_table->_qproc pointing to ep_poll_callback(), to its
wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
it has to call ep_poll_callback() because it is not in its wait queue.
Hence, block forever.
Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do. For example, tcp_poll() in tcp.c.
Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
Cc: stable@vger.kernel.org # 2.6.27
Fixes: 2a2cc8f7c4 "ftrace: allow the event pipe to be polled"
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Martin Lau <kafai@fb.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The TRACE_ITER_PRINTK check in __trace_puts/__trace_bputs is missing,
so add it, to be consistent with __trace_printk/__trace_bprintk.
Those functions are all called by the same function: trace_printk().
Link: http://lkml.kernel.org/p/51E7A7D6.8090900@huawei.com
Cc: stable@vger.kernel.org # 3.11+
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Running my ftrace tests on PowerPC, it failed the test that checks
if function_graph tracer is affected by the stack tracer. It was.
Looking into this, I found that the update_function_graph_func()
must be called even if the trampoline function is not changed.
This is because archs like PowerPC do not support ftrace_ops being
passed by assembly and instead uses a helper function (what the
trampoline function points to). Since this function is not changed
even when multiple ftrace_ops are added to the code, the test that
falls out before calling update_function_graph_func() will miss that
the update must still be done.
Call update_function_graph_function() for all calls to
update_ftrace_function()
Cc: stable@vger.kernel.org # 3.3+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently trace option stacktrace is not applicable for
trace_printk with constant string argument, the reason is
in __trace_puts/__trace_bputs ftrace_trace_stack is missing.
In contrast, when using trace_printk with non constant string
argument(will call into __trace_printk/__trace_bprintk), then
trace option stacktrace is workable, this inconstant result
will confuses users a lot.
Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Disabling reading and writing to the trace file should not be able to
disable all function tracing callbacks. There's other users today
(like kprobes and perf). Reading a trace file should not stop those
from happening.
Cc: stable@vger.kernel.org # 3.0+
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When there's no entry in set_ftrace_notrace, it'll print nothing, but
it's better to print something like below like set_graph_notrace does:
#### no functions disabled ####
Link: http://lkml.kernel.org/p/1402644246-4649-1-git-send-email-namhyung@kernel.org
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When there's no entry in set_graph_notrace, it'll print below message
#### all functions enabled ####
While this is technically correct, it's better to print like below:
#### no functions disabled ####
Link: http://lkml.kernel.org/p/1402590233-22321-3-git-send-email-namhyung@kernel.org
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_graph_notrace option is for specifying notrace filter for
function graph tracer at boot time. It can be altered after boot
using set_graph_notrace file on the debugfs.
Link: http://lkml.kernel.org/p/1402590233-22321-2-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It seems like it's a leftover from commit 4104d326b6 ("ftrace:
Remove global function list and call function directly"). As it
isn't updated at all, checking its value is meaningless.
Let's get rid of it.
Link: http://lkml.kernel.org/p/1402584972-17824-1-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's several locations in the kernel that open code the calculation
of the next location in the trace_seq buffer. This is usually done with
p->buffer + p->len
Instead of having this open coded, supply a helper function in the
header to do it for them. This function is called trace_seq_buffer_ptr().
Link: http://lkml.kernel.org/p/20140626220129.452783019@goodmis.org
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_seq_reserve() has no users in the kernel, it just wastes space.
Remove it.
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently trace_seq_putmem_hex() can only take as a parameter a pointer
to something that is 8 bytes or less, otherwise it will overflow the
buffer. This is protected by a macro that encompasses the call to
trace_seq_putmem_hex() that has a BUILD_BUG_ON() for the variable before
it is passed in. This is not very robust and if trace_seq_putmem_hex() ever
gets used outside that macro it will cause issues.
Instead of only being able to produce a hex output of memory that is for
a single word, change it to be more robust and allow any size input.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For using trace_seq_*() functions in NMI context, I posted a patch to move
it to the lib/ directory. This caused Andrew Morton to take a look at the code.
He went through and gave a lot of comments about missing kernel doc,
inconsistent types for the save variable, mix match of EXPORT_SYMBOL_GPL()
and EXPORT_SYMBOL() as well as missing EXPORT_SYMBOL*()s. There were
a few comments about the way variables were being compared (int vs uint).
All these were good review comments and should be implemented regardless of
if trace_seq.c should be moved to lib/ or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_seq_*() functions are a nice utility that allows users to manipulate
buffers with printf() like formats. It has its own trace_seq.h header in
include/linux and should be in its own file. Being tied with trace_output.c
is rather awkward.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Simplify ftrace_hash_disable/enable path in ftrace_hash_move
for hardening the process if the memory allocation failed.
Link: http://lkml.kernel.org/p/20140617110442.15167.81076.stgit@kbuild-fedora.novalocal
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enabled_functions is used to help debug the dynamic function tracing.
Adding what trampolines are attached to files is useful for debugging.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Function graph tracing is a bit different than the function tracers, as
it is processed after either the ftrace_caller or ftrace_regs_caller
and we only have one place to modify the jump to ftrace_graph_caller,
the jump needs to happen after the restore of registeres.
The function graph tracer is dependent on the function tracer, where
even if the function graph tracing is going on by itself, the save and
restore of registers is still done for function tracing regardless of
if function tracing is happening, before it calls the function graph
code.
If there's no function tracing happening, it is possible to just call
the function graph tracer directly, and avoid the wasted effort to save
and restore regs for function tracing.
This requires adding new flags to the dyn_ftrace records:
FTRACE_FL_TRAMP
FTRACE_FL_TRAMP_EN
The first is set if the count for the record is one, and the ftrace_ops
associated to that record has its own trampoline. That way the mcount code
can call that trampoline directly.
In the future, trampolines can be added to arbitrary ftrace_ops, where you
can have two or more ftrace_ops registered to ftrace (like kprobes and perf)
and if they are not tracing the same functions, then instead of doing a
loop to check all registered ftrace_ops against their hashes, just call the
ftrace_ops trampoline directly, which would call the registered ftrace_ops
function directly.
Without this patch perf showed:
0.05% hackbench [kernel.kallsyms] [k] ftrace_caller
0.05% hackbench [kernel.kallsyms] [k] arch_local_irq_save
0.05% hackbench [kernel.kallsyms] [k] native_sched_clock
0.04% hackbench [kernel.kallsyms] [k] __buffer_unlock_commit
0.04% hackbench [kernel.kallsyms] [k] preempt_trace
0.04% hackbench [kernel.kallsyms] [k] prepare_ftrace_return
0.04% hackbench [kernel.kallsyms] [k] __this_cpu_preempt_check
0.04% hackbench [kernel.kallsyms] [k] ftrace_graph_caller
See that the ftrace_caller took up more time than the ftrace_graph_caller
did.
With this patch:
0.05% hackbench [kernel.kallsyms] [k] __buffer_unlock_commit
0.04% hackbench [kernel.kallsyms] [k] call_filter_check_discard
0.04% hackbench [kernel.kallsyms] [k] ftrace_graph_caller
0.04% hackbench [kernel.kallsyms] [k] sched_clock
The ftrace_caller is no where to be found and ftrace_graph_caller still
takes up the same percentage.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The usage of uprobe_buffer_enable() added by dcad1a20 is very wrong,
1. uprobe_buffer_enable() and uprobe_buffer_disable() are not balanced,
_enable() should be called only if !enabled.
2. If uprobe_buffer_enable() fails probe_event_enable() should clear
tp.flags and free event_file_link.
3. If uprobe_register() fails it should do uprobe_buffer_disable().
Link: http://lkml.kernel.org/p/20140627170146.GA18332@redhat.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Fixes: dcad1a204f "tracing/uprobes: Fetch args before reserving a ring buffer"
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I do not know why dd9fa555d7 "tracing/uprobes: Move argument fetching
to uprobe_dispatcher()" added the UPROBE_HANDLER_REMOVE, but it looks
wrong.
OK, perhaps it makes sense to avoid store_trace_args() if the tracee is
nacked by uprobe_perf_filter(). But then we should kill the same code
in uprobe_perf_func() and unify the TRACE/PROFILE filtering (we need to
do this anyway to mix perf/ftrace). Until then this code actually adds
the pessimization because uprobe_perf_filter() will be called twice and
return T in likely case.
Link: http://lkml.kernel.org/p/20140627170143.GA18329@redhat.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This reverts commit 43fe98913c.
This patch is very wrong. Firstly, this change leads to unbalanced
uprobe_unregister(). Just for example,
# perf probe -x /lib/libc.so.6 syscall
# echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
# perf record -e probe_libc:syscall whatever
after that uprobe is dead (unregistered) but the user of ftrace/perf
can't know this, and it looks as if nobody hits this probe.
This would be easy to fix, but there are other reasons why it is not
simple to mix ftrace and perf. If nothing else, they can't share the
same ->consumer.filter. This is fixable too, but probably we need to
fix the poorly designed uprobe_register() interface first. At least
"register" and "apply" should be clearly separated.
Link: http://lkml.kernel.org/p/20140627170136.GA18319@redhat.com
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com>
Cc: stable@vger.kernel.org # v3.14
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace dynamic record has a flags element that also has a counter.
Instead of hard coding "rec->flags & ~FTRACE_FL_MASK" all over the
place. Use a macro instead.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When registering a function callback for the function tracer, the ops
can specify if it wants to save full regs (like an interrupt would)
for each function that it traces, or if it does not care about regs
and just wants to have the fastest return possible.
Once a ops has registered a function, if other ops register that
function they all will receive the regs too. That's because it does
the work once, it does it for everyone.
Now if the ops wanting regs unregisters the function so that there's
only ops left that do not care about regs, those ops will still
continue getting regs and going through the work for it on that
function. This is because the disabling of the rec counter only
sees the ops registered, and does not see the ops that are still
attached, and does not know if the current ops that are still attached
want regs or not. To play it safe, it just keeps regs being processed
until no function is registered anymore.
Instead of doing that, check the ops that are still registered for that
function and if none want regs for it anymore, then disable the
processing of regs.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
if "possible cpus" is greater than actual CPUs (including offline CPUs).
Namhyung Kim did some reviews of the patches I sent this merge window and
found a memory leak and had a few clean ups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTmcYfAAoJEKQekfcNnQGusVIH/2wvfh1D4Mu+qCUsG7HqMLWM
wHhTHcJsiE5rBpfcvc+XoLLBMGn9IKCeClGG59KYJGzznbJHHLmk1dy4qdSqNAel
POTEtGh+AUX0wFBZtVLl2AesFZRsbtaxoNqD0ZBLhkHCV9FBEKbsm8uDCtJIf8wR
vDDz27nPXkiFWX+wM9Z+tFSL7rohxvwREKlQjfk5Z9plyUcURE8GDbrWl870ZSJX
uYzSYzioCyYdy/cpasDuTX22/I+nNUxA4dWTeyciigYC+6IqLJRIwqEXJ4RjYmfm
v9+hf6KkqF8CY6mDjObLkKmy0gubPBQ8Op1G/3ChluUrjkwdffgb6oiLb352yHY=
=XD1A
-----END PGP SIGNATURE-----
Merge tag 'trace-3.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing cleanups and bugfixes from Steven Rostedt:
"One bug fix that goes back to 3.10. Accessing a non existent buffer
if "possible cpus" is greater than actual CPUs (including offline
CPUs).
Namhyung Kim did some reviews of the patches I sent this merge window
and found a memory leak and had a few clean ups"
* tag 'trace-3.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix check of ftrace_trace_arrays list_empty() check
tracing: Fix leak of per cpu max data in instances
tracing: Cleanup saved_cmdlines_size changes
ring-buffer: Check if buffer exists before polling
Pull more perf updates from Ingo Molnar:
"A second round of perf updates:
- wide reaching kprobes sanitization and robustization, with the hope
of fixing all 'probe this function crashes the kernel' bugs, by
Masami Hiramatsu.
- uprobes updates from Oleg Nesterov: tmpfs support, corner case
fixes and robustization work.
- perf tooling updates and fixes from Jiri Olsa, Namhyung Ki, Arnaldo
et al:
* Add support to accumulate hist periods (Namhyung Kim)
* various fixes, refactorings and enhancements"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
perf: Differentiate exec() and non-exec() comm events
perf: Fix perf_event_comm() vs. exec() assumption
uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates
perf/documentation: Add description for conditional branch filter
perf/x86: Add conditional branch filtering support
perf/tool: Add conditional branch filter 'cond' to perf record
perf: Add new conditional branch filter 'PERF_SAMPLE_BRANCH_COND'
uprobes: Teach copy_insn() to support tmpfs
uprobes: Shift ->readpage check from __copy_insn() to uprobe_register()
perf/x86: Use common PMU interrupt disabled code
perf/ARM: Use common PMU interrupt disabled code
perf: Disable sampled events if no PMU interrupt
perf: Fix use after free in perf_remove_from_context()
perf tools: Fix 'make help' message error
perf record: Fix poll return value propagation
perf tools: Move elide bool into perf_hpp_fmt struct
perf tools: Remove elide setup for SORT_MODE__MEMORY mode
perf tools: Fix "==" into "=" in ui_browser__warning assignment
perf tools: Allow overriding sysfs and proc finding with env var
perf tools: Consider header files outside perf directory in tags target
...
The check that tests if ftrace_trace_arrays is empty in
top_trace_array(), uses the .prev pointer:
if (list_empty(ftrace_trace_arrays.prev))
instead of testing the variable itself:
if (list_empty(&ftrace_trace_arrays))
Although it is technically correct, it is awkward and confusing.
Use the proper method.
Link: http://lkml.kernel.org/r/87oay1bas8.fsf@sejong.aot.lge.com
Reported-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The freeing of an instance, if max data is configured, there will be
per cpu data structures created. But these are not freed when the instance
is deleted, which causes a memory leak.
A new helper function is added that frees the individual buffers within a
trace array, instead of duplicating the code. This way changes made for one
are applied to the other (normal buffer vs max buffer).
Link: http://lkml.kernel.org/r/87k38pbake.fsf@sejong.aot.lge.com
Reported-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The recent addition of saved_cmdlines_size file had some remaining
(minor - mostly coding style) issues. Fix them by passing pointer
name to sizeof() and using scnprintf().
Link: http://lkml.kernel.org/p/1402384295-23680-1-git-send-email-namhyung@kernel.org
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The per_cpu buffers are created one per possible CPU. But these do
not mean that those CPUs are online, nor do they even exist.
With the addition of the ring buffer polling, it assumes that the
caller polls on an existing buffer. But this is not the case if
the user reads trace_pipe from a CPU that does not exist, and this
causes the kernel to crash.
Simple fix is to check the cpu against buffer bitmask against to see
if the buffer was allocated or not and return -ENODEV if it is
not.
More updates were done to pass the -ENODEV back up to userspace.
Link: http://lkml.kernel.org/r/5393DB61.6060707@oracle.com
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
to help out the rest of the kernel to ease their use of trace events.
The big change for this release is the allowing of other tracers,
such as the latency tracers, to be used in the trace instances and allow
for function or function graph tracing to be in the top level
simultaneously.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTlbUqAAoJEKQekfcNnQGuP+8H+wTBG06beHsqe6XcaeXcKNkt
Mimm0O04oQdw89CBWeJvXyOwRTtiN4M/4hxHXBTDtChxM9oUyWw1o0IpSMMuQ16O
w9r3DfC8e1air+ufEuYWM0QNtyzHi8EfDSNia55ON5jvtkCZTXOEKZD+n8M9w28p
I7PVgr0PDztsCpethCpg0M8beK9zuQPWMzsHAQCsKI06Xl5z33kPIJR15Exh+Kr1
uVVTZW7JFVAPuSnteLSIx9pN6OjsVGzOZCljg+O+9/v/02u5nkMiS2nURxae86kg
RTSiRYT6Hvl/MCBhdss/w5kgSk6BYiZ0hXbLtwetvre+vQrOR5CnDw2DxZ7e+gU=
=oudH
-----END PGP SIGNATURE-----
Merge tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Lots of tweaks, small fixes, optimizations, and some helper functions
to help out the rest of the kernel to ease their use of trace events.
The big change for this release is the allowing of other tracers, such
as the latency tracers, to be used in the trace instances and allow
for function or function graph tracing to be in the top level
simultaneously"
* tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
tracing: Fix memory leak on instance deletion
tracing: Fix leak of ring buffer data when new instances creation fails
tracing/kprobes: Avoid self tests if tracing is disabled on boot up
tracing: Return error if ftrace_trace_arrays list is empty
tracing: Only calculate stats of tracepoint benchmarks for 2^32 times
tracing: Convert stddev into u64 in tracepoint benchmark
tracing: Introduce saved_cmdlines_size file
tracing: Add __get_dynamic_array_len() macro for trace events
tracing: Remove unused variable in trace_benchmark
tracing: Eliminate double free on failure of allocation on boot up
ftrace/x86: Call text_ip_addr() instead of the duplicated code
tracing: Print max callstack on stacktrace bug
tracing: Move locking of trace_cmdline_lock into start/stop seq calls
tracing: Try again for saved cmdline if failed due to locking
tracing: Have saved_cmdlines use the seq_read infrastructure
tracing: Add tracepoint benchmark tracepoint
tracing: Print nasty banner when trace_printk() is in use
tracing: Add funcgraph_tail option to print function name after closing braces
tracing: Eliminate duplicate TRACE_GRAPH_PRINT_xx defines
tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
...
When an instance is created, it also gets a snapshot ring buffer
allocated (with minimum of pages). But when it is deleted the snapshot
buffer is not. There was a helper function added to match the allocation
of these ring buffers to a way to free them, but it wasn't used by
the deletion of an instance. Using that helper function solves this
memory leak.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Yoshihiro Yunomae reported that the ring buffer data for a trace
instance does not get properly cleaned up when it fails. He proposed
a patch that manually cleaned the data up and addad a bunch of labels.
The labels are not needed because all trace array is allocated with
a kzalloc which initializes it to 0 and all kfree()s can take a NULL
pointer and will ignore it.
Adding a new helper function free_trace_buffers() that can also take
null buffers to free the buffers that were allocated by
allocate_trace_buffers().
Link: http://lkml.kernel.org/r/20140605223522.32311.31664.stgit@yunodevel
Reported-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If tracing is disabled on boot up, the kernel should not execute tracing
self tests. The kernel should check whether tracing is disabled or not
before executing any of the tracing self tests.
Link: http://lkml.kernel.org/p/20140605223520.32311.56097.stgit@yunodevel
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_trace_arrays links global_trace.list. However, global_trace
is not added to ftrace_trace_arrays if trace_alloc_buffers() failed.
As the result, ftrace_trace_arrays becomes an empty list. If
ftrace_trace_arrays is an empty list, current top_trace_array() returns
an invalid pointer. As the result, the kernel can induce memory corruption
or panic.
Current implementation does not check whether ftrace_trace_arrays is empty
list or not. So, in this patch, if ftrace_trace_arrays is empty list,
top_trace_array() returns NULL. Moreover, this patch makes all functions
calling top_trace_array() handle it appropriately.
Link: http://lkml.kernel.org/p/20140605223517.32311.99233.stgit@yunodevel
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When calculating the average and standard deviation, it is required that
the count be less than UINT_MAX, otherwise the do_div() will get
undefined results. After 2^32 counts of data, the average and standard
deviation should pretty much be set anyway.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I've been told that do_div() expects an unsigned 64 bit number, and
is undefined if a signed is used. This gave a warning on the MIPS
build. I'm not sure if a signed 64 bit dividend is really an issue
or not, but the calculation this is used for is standard deviation,
and that isn't going to be negative. We can just convert it to
unsigned and be safe.
Reported-by: David Daney <ddaney.cavm@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Introduce saved_cmdlines_size file for changing the number of saved pid-comms.
saved_cmdlines currently stores 128 command names using SAVED_CMDLINES, but
'no-existing processes' names are often lost in saved_cmdlines when we
read the trace data. So, by introducing saved_cmdlines_size file, we can
now change the 128 command names saved to something much larger if needed.
When we write a value to saved_cmdlines_size, the number of the value will
be stored in pid-comm list:
# echo 1024 > /sys/kernel/debug/tracing/saved_cmdlines_size
Here, 1024 command names can be stored. The default number is 128 and the maximum
number is PID_MAX_DEFAULT (=32768 if CONFIG_BASE_SMALL is not set). So, if we
want to avoid losing any command names, we need to set 32768 to
saved_cmdlines_size.
We can read the maximum number of the list:
# cat /sys/kernel/debug/tracing/saved_cmdlines_size
128
Link: http://lkml.kernel.org/p/20140605012427.22115.16173.stgit@yunodevel
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Somehow this unused variable warning sneaked past my warnings check
(probably due to it depending on a new config).
kernel/trace/trace_benchmark.c: In function 'trace_do_benchmark':
kernel/trace/trace_benchmark.c:38:6: warning: unused variable 'seedsq' [-Wunused-variable]
u64 seedsq;
^
Link: http://lkml.kernel.org/r/20140604160921.4f4e69c4@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If allocation of the max_buffer fails on boot up, the error path will
free both per_cpu data structures from the buffers. With the new redesign
of the code, those structures are freed if allocations failed. That is,
the helper function that allocates the buffers will free the per cpu data
on failure. No need to do it again. In fact, the second free will cause
a bug as the code can not handle a double free.
Link: http://lkml.kernel.org/p/20140603042803.27308.30956.stgit@yunodevel
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the conversion of the saved_cmdlines output to use seq_read, there
is now a race between accessing the values of the saved_cmdlines and
the writing to them. The trace_cmdline_lock needs to be taken at
the start and stop of the seq calls.
A new __trace_find_cmdline() call is created to allow for the look up
to happen without taking the lock.
Fixes: 42584c81c5 tracing: Have saved_cmdlines use the seq_read infrastructure
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to prevent the saved cmdline cache from being filled when
tracing is not active, the comms are only recorded after a trace event
is recorded.
The problem is, a comm can fail to be recorded if the trace_cmdline_lock
is held. That lock is taken via a trylock to allow it to happen from
any context (including NMI). If the lock fails to be taken, the comm
is skipped. No big deal, as we will try again later.
But! Because of the code that was added to only record after an event,
we may not try again later as the recording is made as a oneshot per
event per CPU.
Only disable the recording of the comm if the comm is actually recorded.
Fixes: 7ffbd48d5c "tracing: Cache comms only after an event occurred"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Current tracing_saved_cmdlines_read() implementation is naive; It allocates
a large buffer, constructs output data to that buffer for each read
operation, and then copies a portion of the buffer to the user space
buffer. This has several issues such as slow memory allocation, high
CPU usage, and even corruption of the output data.
The seq_read infrastructure is made to handle this type of work.
By converting it to use seq_read() the code becomes smaller, simplified,
as well as correct.
Link: http://lkml.kernel.org/p/20140220084431.3839.51793.stgit@yunodevel
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to help benchmark the time tracepoints take, a new config
option is added called CONFIG_TRACEPOINT_BENCHMARK. When this option
is set a tracepoint is created called "benchmark:benchmark_event".
When the tracepoint is enabled, it kicks off a kernel thread that
goes into an infinite loop (calling cond_sched() to let other tasks
run), and calls the tracepoint. Each iteration will record the time
it took to write to the tracepoint and the next iteration that
data will be passed to the tracepoint itself. That is, the tracepoint
will report the time it took to do the previous tracepoint.
The string written to the tracepoint is a static string of 128 bytes
to keep the time the same. The initial string is simply a write of
"START". The second string records the cold cache time of the first
write which is not added to the rest of the calculations.
As it is a tight loop, it benchmarks as hot cache. That's fine because
we care most about hot paths that are probably in cache already.
An example of the output:
START
first=3672 [COLD CACHED]
last=632 first=3672 max=632 min=632 avg=316 std=446 std^2=199712
last=278 first=3672 max=632 min=278 avg=303 std=316 std^2=100337
last=277 first=3672 max=632 min=277 avg=296 std=258 std^2=67064
last=273 first=3672 max=632 min=273 avg=292 std=224 std^2=50411
last=273 first=3672 max=632 min=273 avg=288 std=200 std^2=40389
last=281 first=3672 max=632 min=273 avg=287 std=183 std^2=33666
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_printk() is used to debug fast paths within the kernel. Places
that gets called in any context (interrupt or NMI) or thousands of
times a second. Something you do not want to do with a printk().
In order to make it completely lockless as it needs a temporary buffer
to handle some of the string formatting, a page is created per cpu for
every context (four per cpu; normal, softirq, irq, NMI).
Since trace_printk() should only be used for debugging purposes,
there's no reason to waste memory on these buffers on a production
system. That means, trace_printk() should never be used unless a
developer is debugging their kernel. There's macro magic to allocate
the buffers if trace_printk() is used anywhere in the kernel.
To help enforce that trace_printk() isn't used outside of development,
when it is used, a nasty banner is displayed on bootup (or when a module
is loaded that uses trace_printk() and the kernel core does not).
Here's the banner:
**********************************************************
** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
** **
** trace_printk() being used. Allocating extra memory. **
** **
** This means that this is a DEBUG kernel and it is **
** unsafe for produciton use. **
** **
** If you see this message and you are not debugging **
** the kernel, report this immediately to your vendor! **
** **
** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
**********************************************************
That should hopefully keep developers from trying to sneak in a
trace_printk() or two.
Link: http://lkml.kernel.org/p/20140528131440.2283213c@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the function-graph tracer, add a funcgraph_tail option
to print the function name on all } lines, not just
functions whose first line is no longer in the trace
buffer.
If a function calls other traced functions, its total
time appears on its } line. This change allows grep
to be used to determine the function for which the
line corresponds.
Update Documentation/trace/ftrace.txt to describe
this new option.
Link: http://lkml.kernel.org/p/20140520221041.8359.6782.stgit@beardog.cce.hp.com
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Eliminate duplicate TRACE_GRAPH_PRINT_xx defines
in trace_functions_graph.c that are already in
trace.h.
Add TRACE_GRAPH_PRINT_IRQS to trace.h, which is
the only one that is missing.
Link: http://lkml.kernel.org/p/20140520221031.8359.24733.stgit@beardog.cce.hp.com
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Being able to show a cpumask of events can be useful as some events
may affect only some CPUs. There is no standard way to record the
cpumask and converting it to a string is rather expensive during
the trace as traces happen in hotpaths. It would be better to record
the raw event mask and be able to parse it at print time.
The following macros were added for use with the TRACE_EVENT() macro:
__bitmask()
__assign_bitmask()
__get_bitmask()
To test this, I added this to the sched_migrate_task event, which
looked like this:
TRACE_EVENT(sched_migrate_task,
TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),
TP_ARGS(p, dest_cpu, cpus),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( int, prio )
__field( int, orig_cpu )
__field( int, dest_cpu )
__bitmask( cpumask, num_possible_cpus() )
),
TP_fast_assign(
memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
__entry->pid = p->pid;
__entry->prio = p->prio;
__entry->orig_cpu = task_cpu(p);
__entry->dest_cpu = dest_cpu;
__assign_bitmask(cpumask, cpumask_bits(cpus), num_possible_cpus());
),
TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
__entry->comm, __entry->pid, __entry->prio,
__entry->orig_cpu, __entry->dest_cpu,
__get_bitmask(cpumask))
);
With the output of:
ksmtuned-3613 [003] d..2 485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
migration/1-13 [001] d..5 485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
awk-3615 [002] d.H5 485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
migration/2-18 [002] d..5 485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f
Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com
Link: http://lkml.kernel.org/r/20140506132238.22e136d1@gandalf.local.home
Suggested-by: Javi Merino <javi.merino@arm.com>
Tested-by: Javi Merino <javi.merino@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the decision to what needs to be done (converting a call to the
ftrace_caller to ftrace_caller_regs or to convert from ftrace_caller_regs
to ftrace_caller) can easily be determined from the rec->flags of
FTRACE_FL_REGS and FTRACE_FL_REGS_EN, there's no need to have the
ftrace_check_record() return either a UPDATE_MODIFY_CALL_REGS or a
UPDATE_MODIFY_CALL. Just he latter is enough. This added flag causes
more complexity than is required. Remove it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the moving of the functions that determine what the mcount call site
should be replaced with into the generic code, there is a few places
in the generic code that can use them instead of hard coding it as it
does.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move and rename get_ftrace_addr() and get_ftrace_addr_old() to
ftrace_get_addr_new() and ftrace_get_addr_curr() respectively.
This moves these two helper functions in the generic code out from
the arch specific code, and renames them to have a better generic
name. This will allow other archs to use them as well as makes it
a bit easier to work on getting separate trampolines for different
functions.
ftrace_get_addr_new() returns the trampoline address that the mcount
call address will be converted to.
ftrace_get_addr_curr() returns the trampoline address of what the
mcount call address currently jumps to.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_hash_empty() function is a simple test:
return !hash || !hash->count;
But gcc seems to want to make it a call. As this is in an extreme
hot path of the function tracer, there's no reason it needs to be
a call. I only wrote it to be a helper function anyway, otherwise
it would have been inlined manually.
Force gcc to inline it, as it could have also been a macro.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Back in 2011 Commit ed926f9b35 "ftrace: Use counters to enable
functions to trace" changed the way ftrace accounts for enabled
and disabled traced functions. There was a comment started as:
/*
*
*/
But never finished. Well, that's rather useless. I probably forgot
to save the file before committing it. And it passed review from all
this time.
Anyway, better late than never. I updated the comment to express what
is happening in that somewhat complex code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 4104d326b6 "ftrace: Remove global function list and call
function directly" cleaned up the global_ops filtering and made
the code simpler, but it left a variable "hash_enable" that was used
to know if the hash functions should be updated or not. It was
updated if the global_ops did not override them. As the global_ops
are now no different than any other ftrace_ops, the hash always
gets updated and there's no reason to use the hash_enable boolean.
The same goes for hash_disable used in ftrace_shutdown().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Replace uses of &__get_cpu_var for address calculation with this_cpu_ptr.
Link: http://lkml.kernel.org/p/alpine.DEB.2.10.1404291415560.18364@gentwo.org
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As trace event triggers are now part of the mainline kernel, I added
my trace event trigger tests to my test suite I run on all my kernels.
Now these tests get run under different config options, and one of
those options is CONFIG_PROVE_RCU, which checks under lockdep that
the rcu locking primitives are being used correctly. This triggered
the following splat:
===============================
[ INFO: suspicious RCU usage. ]
3.15.0-rc2-test+ #11 Not tainted
-------------------------------
kernel/trace/trace_events_trigger.c:80 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
4 locks held by swapper/1/0:
#0: ((&(&j_cdbs->work)->timer)){..-...}, at: [<ffffffff8104d2cc>] call_timer_fn+0x5/0x1be
#1: (&(&pool->lock)->rlock){-.-...}, at: [<ffffffff81059856>] __queue_work+0x140/0x283
#2: (&p->pi_lock){-.-.-.}, at: [<ffffffff8106e961>] try_to_wake_up+0x2e/0x1e8
#3: (&rq->lock){-.-.-.}, at: [<ffffffff8106ead3>] try_to_wake_up+0x1a0/0x1e8
stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc2-test+ #11
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
0000000000000001 ffff88007e083b98 ffffffff819f53a5 0000000000000006
ffff88007b0942c0 ffff88007e083bc8 ffffffff81081307 ffff88007ad96d20
0000000000000000 ffff88007af2d840 ffff88007b2e701c ffff88007e083c18
Call Trace:
<IRQ> [<ffffffff819f53a5>] dump_stack+0x4f/0x7c
[<ffffffff81081307>] lockdep_rcu_suspicious+0x107/0x110
[<ffffffff810ee51c>] event_triggers_call+0x99/0x108
[<ffffffff810e8174>] ftrace_event_buffer_commit+0x42/0xa4
[<ffffffff8106aadc>] ftrace_raw_event_sched_wakeup_template+0x71/0x7c
[<ffffffff8106bcbf>] ttwu_do_wakeup+0x7f/0xff
[<ffffffff8106bd9b>] ttwu_do_activate.constprop.126+0x5c/0x61
[<ffffffff8106eadf>] try_to_wake_up+0x1ac/0x1e8
[<ffffffff8106eb77>] wake_up_process+0x36/0x3b
[<ffffffff810575cc>] wake_up_worker+0x24/0x26
[<ffffffff810578bc>] insert_work+0x5c/0x65
[<ffffffff81059982>] __queue_work+0x26c/0x283
[<ffffffff81059999>] ? __queue_work+0x283/0x283
[<ffffffff810599b7>] delayed_work_timer_fn+0x1e/0x20
[<ffffffff8104d3a6>] call_timer_fn+0xdf/0x1be^M
[<ffffffff8104d2cc>] ? call_timer_fn+0x5/0x1be
[<ffffffff81059999>] ? __queue_work+0x283/0x283
[<ffffffff8104d823>] run_timer_softirq+0x1a4/0x22f^M
[<ffffffff8104696d>] __do_softirq+0x17b/0x31b^M
[<ffffffff81046d03>] irq_exit+0x42/0x97
[<ffffffff81a08db6>] smp_apic_timer_interrupt+0x37/0x44
[<ffffffff81a07a2f>] apic_timer_interrupt+0x6f/0x80
<EOI> [<ffffffff8100a5d8>] ? default_idle+0x21/0x32
[<ffffffff8100a5d6>] ? default_idle+0x1f/0x32
[<ffffffff8100ac10>] arch_cpu_idle+0xf/0x11
[<ffffffff8107b3a4>] cpu_startup_entry+0x1a3/0x213
[<ffffffff8102a23c>] start_secondary+0x212/0x219
The cause is that the triggers are protected by rcu_read_lock_sched() but
the data is dereferenced with rcu_dereference() which expects it to
be protected with rcu_read_lock(). The proper reference should be
rcu_dereference_sched().
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 4104d326b6 "ftrace: Remove global function list and call
function directly" cleaned up the global_ops filtering and made
the code simpler. But it left out function graph filtering which
also depended on that code. The function graph filtering still
needs to use global_ops as the filter otherwise it wont filter
at all.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
uprobe_perf_open()->uprobe_apply() can fail, but this error is wrongly
ignored. Change uprobe_perf_open() to do uprobe_perf_close() and return
the error code in this case.
Change uprobe_perf_close() to propogate the error from uprobe_apply()
as well, although it should not fail.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Preparation. Move uprobe_perf_close() up before uprobe_perf_open() to
avoid the forward declaration in the next patch and make it readable.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Now that the ring buffer has a built in way to wake up readers
when there's data, using irq_work such that it is safe to do it
in any context. But it was still using the old "poor man's"
wait polling that checks every 1/10 of a second to see if it
should wake up a waiter. This makes the latency for a wake up
excruciatingly long. No need to do that anymore.
Completely remove the different wait_poll types from the tracers
and have them all use the default one now.
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When reading from trace_pipe, if tracing is off but nothing was read
it should block. If something is read and tracing is off, then EOF
is returned. If tracing is on and there's nothing to read, it will block.
But because the check of whether tracing is off and something was read
is done after the block on the pipe, it is hit or miss if the EOF is
returned or not leading to inconsistent behavior.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A race exists between module loading and enabling of function tracer.
CPU 1 CPU 2
----- -----
load_module()
module->state = MODULE_STATE_COMING
register_ftrace_function()
mutex_lock(&ftrace_lock);
ftrace_startup()
update_ftrace_function();
ftrace_arch_code_modify_prepare()
set_all_module_text_rw();
<enables-ftrace>
ftrace_arch_code_modify_post_process()
set_all_module_text_ro();
[ here all module text is set to RO,
including the module that is
loading!! ]
blocking_notifier_call_chain(MODULE_STATE_COMING);
ftrace_init_module()
[ tries to modify code, but it's RO, and fails!
ftrace_bug() is called]
When this race happens, ftrace_bug() will produces a nasty warning and
all of the function tracing features will be disabled until reboot.
The simple solution is to treate module load the same way the core
kernel is treated at boot. To hardcode the ftrace function modification
of converting calls to mcount into nops. This is done in init/main.c
there's no reason it could not be done in load_module(). This gives
a better control of the changes and doesn't tie the state of the
module to its notifiers as much. Ftrace is special, it needs to be
treated as such.
The reason this would work, is that the ftrace_module_init() would be
called while the module is in MODULE_STATE_UNFORMED, which is ignored
by the set_all_module_text_ro() call.
Link: http://lkml.kernel.org/r/1395637826-3312-1-git-send-email-indou.takao@jp.fujitsu.com
Reported-by: Takao Indoh <indou.takao@jp.fujitsu.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions ftrace_set_global_filter() and ftrace_set_global_notrace()
still have their old names in the kernel doc (ftrace_set_filter and
ftrace_set_notrace respectively). Replace these with the real names.
Link: http://lkml.kernel.org/p/1398006644-5935-3-git-send-email-wangjiaxing@insigma.com.cn
Signed-off-by: Jiaxing Wang <wangjiaxing@insigma.com.cn>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When using ftrace_ops_list_func, we should skip 4 instead of 3,
to avoid ftrace_call+0x5/0xb appearing in the stack trace:
Depth Size Location (110 entries)
----- ---- --------
0) 2956 0 update_curr+0xe/0x1e0
1) 2956 68 ftrace_call+0x5/0xb
2) 2888 92 enqueue_entity+0x53/0xe80
3) 2796 80 enqueue_task_fair+0x47/0x7e0
4) 2716 28 enqueue_task+0x45/0x70
5) 2688 12 activate_task+0x22/0x30
Add a function using_ftrace_ops_list_func() to test for this while keeping
ftrace_ops_list_func to remain static.
Link: http://lkml.kernel.org/p/1398006644-5935-2-git-send-email-wangjiaxing@insigma.com.cn
Signed-off-by: Jiaxing Wang <wangjiaxing@insigma.com.cn>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use NOKPROBE_SYMBOL macro to protect functions from
kprobes instead of __kprobes annotation in ftrace.
This applies nokprobe_inline annotation for some cases,
because NOKPROBE_SYMBOL() will inhibit inlining by
referring the symbol address.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140417081828.26341.55152.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need to prohibit probing on the functions
used for preparation and uprobe only fetch functions.
Those are safely probed because those are not invoked
from kprobe's breakpoint/fault/debug handlers. So there
is no chance to cause recursive exceptions.
Following functions are now removed from the kprobes blacklist:
update_bitfield_fetch_param
free_bitfield_fetch_param
kprobe_register
FETCH_FUNC_NAME(stack, type) in trace_uprobe.c
FETCH_FUNC_NAME(memory, type) in trace_uprobe.c
FETCH_FUNC_NAME(memory, string) in trace_uprobe.c
FETCH_FUNC_NAME(memory, string_size) in trace_uprobe.c
FETCH_FUNC_NAME(file_offset, type) in trace_uprobe.c
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140417081800.26341.56504.stgit@ltc230.yrl.intra.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds static to the following functions:
-cycle_t buffer_ftrace_now
-void free_snapshot
-int trace_selftest_startup_dynamic_tracing
Link: http://lkml.kernel.org/p/20140417214442.d7abc7c0b0e4b90e7fedecc9@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of initializing the pm notifier block in register_ftrace_graph(),
initialize it statically. This safes us some code.
Found in the PaX patch, written by the PaX Team.
Link: http://lkml.kernel.org/p/1396186310-3156-1-git-send-email-minipli@googlemail.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: PaX Team <pageexec@freemail.hu>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The irqsoff, preemptoff and preemptirqsoff tracers can now be used by
instances. But they may only be used by one instance at a time (including
the top level directory). This allows multiple tracers to run while the
irqsoff (and friends) tracer is running simultaneously.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The wakeup and wakeup_rt tracers can now be used by instances.
But they may only be used by one instance at a time (including the
top level directory). This allows multiple tracers to run while
the wakeup tracer is running simultaneously.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for having tracers enabled in instances, the max_lock
should be unique as updating the max for one tracer is a separate
operation than updating it for another tracer using a different max.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for letting the latency tracers be used by instances,
remove the global tracing_max_latency variable and add a max_latency
field to the trace_array that the latency tracers will now use.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of having a list of global functions that are called,
as only one global function is allow to be enabled at a time, there's
no reason to have a list.
Instead, simply have all the users of the global ops, use the global ops
directly, instead of registering their own ftrace_ops. Just switch what
function is used before enabling the function tracer.
This removes a lot of code as well as the complexity involved with it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The first is to remove a duplication of creating debugfs files that
already exist and causes an error report to be printed due to the
failure of the second creation.
The second is a memory leak fix that was introduced in 3.14.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTUGZwAAoJEKQekfcNnQGu7W8IAIAMBVfrWdP6cmGle4tGfhVE
sHcwqTH+07oANQJ3eFwFs5wBMb08s3hXwUHUxXcpjyq2Bs+AHr0vSL/nqCG4k8Ap
2T4ntL7esC1BWKw2lVVVYD12FiL7grUXVlx/q0WE2NuhCzWzNRTyb8sKrPoCRUEB
3o5rAt9+45PKUb2k/eqGBGhK8b4XDz2Wtk5Gj6YB3xttse/yjjcuw0gWMHN1JWfm
eRuQUUBDDGUGkfF98k1aLrjPZooT3LIAV8L8md5C3ebEcXSC/h86hTYCGXv3oBDO
8sxcT0zoQcLuFhjkYLL1J1lBW6gxaVh052jYmQwMppQMos+WID2un2E92Ccg49E=
=BwLF
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This contains two fixes.
The first is to remove a duplication of creating debugfs files that
already exist and causes an error report to be printed due to the
failure of the second creation.
The second is a memory leak fix that was introduced in 3.14"
* tag 'trace-fixes-v3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/uprobes: Fix uprobe_cpu_buffer memory leak
tracing: Do not try to recreated toplevel set_ftrace_* files
With the restructing of the function tracer working with instances, the
"top level" buffer is a bit special, as the function tracing is mapped
to the same set of filters. This is done by using a "global_ops" descriptor
and having the "set_ftrace_filter" and "set_ftrace_notrace" map to it.
When an instance is created, it creates the same files but its for the
local instance and not the global_ops.
The issues is that the local instance creation shares some code with
the global instance one and we end up trying to create th top level
"set_ftrace_*" files twice, and on boot up, we get an error like this:
Could not create debugfs 'set_ftrace_filter' entry
Could not create debugfs 'set_ftrace_notrace' entry
The reason they failed to be created was because they were created
twice, and the second time gives this error as you can not create the
same file twice.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.
Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.
This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...
design of tracepoints and how a user could register a tracepoint
and have that tracepoint not be activated but no error was shown.
The design was for an out of tree module but broke in tree users.
The clean up was to remove the saving of the hash table of tracepoint
names such that they can be enabled before they exist (enabling
a module tracepoint before that module is loaded). This added more
complexity than needed. The clean up was to remove that code and
just enable tracepoints that exist or fail if they do not.
This removed a lot of code as well as the complexity that it brought.
As a side effect, instead of registering a tracepoint by its name,
the tracepoint needs to be registered with the tracepoint descriptor.
This removes having to duplicate the tracepoint names that are
enabled.
The second patch was added that simplified the way modules were
searched for.
This cleanup required changes that were in the 3.15 queue as well as
some changes that were added late in the 3.14-rc cycle. This final
change waited till the two were merged in upstream and then the
change was added and full tests were run. Unfortunately, the
test found some errors, but after it was already submitted to the
for-next branch and not to be rebased. Sparse errors were detected
by Fengguang Wu's bot tests, and my internal tests discovered that
the anonymous union initialization triggered a bug in older gcc compilers.
Luckily, there was a bugzilla for the gcc bug which gave a work around
to the problem. The third and fourth patch handled the sparse error
and the gcc bug respectively.
A final patch was tagged along to fix a missing documentation for
the README file.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTR+pwAAoJEKQekfcNnQGuvfoH/A4XZu4/1h2ZuKhzGi6lrrWr
+zHUQ+JmGiAYRziQFwr2t/gqJ2vmDfHJnbDjKi6Emx8JcxesHas6CQOWps4zEic0
dwYSQjvuGNGFIFt+7I0K1OxfVVdt2PQ2lVrB5WgYdbash5J4Bi+09QBv0RbUKheo
37dKSeN3pbsuQsR70OTVP8laG3dA9IbHW7PsKnxIEB5zeIUHUBME/QdPPj/CuJwk
wxZjXC2dbc3rdRlQjTVtWV3ZkGgZJB0k+JxjvZTA0N6u8Hj8LiFPuNawzf7ceBHx
gc++57+WuMW0f0X/ar5/+3UPGFQKMSvKmdxIQCnWXQz5seTYYKDEx7mTH22fxgg=
=OgeQ
-----END PGP SIGNATURE-----
Merge tag 'trace-3.15-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing updates from Steven Rostedt:
"This includes the final patch to clean up and fix the issue with the
design of tracepoints and how a user could register a tracepoint and
have that tracepoint not be activated but no error was shown.
The design was for an out of tree module but broke in tree users. The
clean up was to remove the saving of the hash table of tracepoint
names such that they can be enabled before they exist (enabling a
module tracepoint before that module is loaded). This added more
complexity than needed. The clean up was to remove that code and just
enable tracepoints that exist or fail if they do not.
This removed a lot of code as well as the complexity that it brought.
As a side effect, instead of registering a tracepoint by its name, the
tracepoint needs to be registered with the tracepoint descriptor.
This removes having to duplicate the tracepoint names that are
enabled.
The second patch was added that simplified the way modules were
searched for.
This cleanup required changes that were in the 3.15 queue as well as
some changes that were added late in the 3.14-rc cycle. This final
change waited till the two were merged in upstream and then the change
was added and full tests were run. Unfortunately, the test found some
errors, but after it was already submitted to the for-next branch and
not to be rebased. Sparse errors were detected by Fengguang Wu's bot
tests, and my internal tests discovered that the anonymous union
initialization triggered a bug in older gcc compilers. Luckily, there
was a bugzilla for the gcc bug which gave a work around to the
problem. The third and fourth patch handled the sparse error and the
gcc bug respectively.
A final patch was tagged along to fix a missing documentation for the
README file"
* tag 'trace-3.15-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add missing function triggers dump and cpudump to README
tracing: Fix anonymous unions in struct ftrace_event_call
tracepoint: Fix sparse warnings in tracepoint.c
tracepoint: Simplify tracepoint module search
tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints
that commit has fixed only the parts of that mess in fs/splice.c itself;
there had been more in several other ->splice_read() instances...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The debugfs tracing README file lists all the function triggers except for
dump and cpudump. These should be added too.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Register/unregister tracepoint probes with struct tracepoint pointer
rather than tracepoint name.
This change, which vastly simplifies tracepoint.c, has been proposed by
Steven Rostedt. It also removes 8.8kB (mostly of text) to the vmlinux
size.
From this point on, the tracers need to pass a struct tracepoint pointer
to probe register/unregister. A probe can now only be connected to a
tracepoint that exists. Moreover, tracers are responsible for
unregistering the probe before the module containing its associated
tracepoint is unloaded.
text data bss dec hex filename
10443444 4282528 10391552 25117524 17f4354 vmlinux.orig
10434930 4282848 10391552 25109330 17f2352 vmlinux
Link: http://lkml.kernel.org/r/1396992381-23785-2-git-send-email-mathieu.desnoyers@efficios.com
CC: Ingo Molnar <mingo@kernel.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Frank Ch. Eigler <fche@redhat.com>
CC: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
[ SDR - fixed return val in void func in tracepoint_module_going() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Merge second patch-bomb from Andrew Morton:
- the rest of MM
- zram updates
- zswap updates
- exit
- procfs
- exec
- wait
- crash dump
- lib/idr
- rapidio
- adfs, affs, bfs, ufs
- cris
- Kconfig things
- initramfs
- small amount of IPC material
- percpu enhancements
- early ioremap support
- various other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (156 commits)
MAINTAINERS: update Intel C600 SAS driver maintainers
fs/ufs: remove unused ufs_super_block_third pointer
fs/ufs: remove unused ufs_super_block_second pointer
fs/ufs: remove unused ufs_super_block_first pointer
fs/ufs/super.c: add __init to init_inodecache()
doc/kernel-parameters.txt: add early_ioremap_debug
arm64: add early_ioremap support
arm64: initialize pgprot info earlier in boot
x86: use generic early_ioremap
mm: create generic early_ioremap() support
x86/mm: sparse warning fix for early_memremap
lglock: map to spinlock when !CONFIG_SMP
percpu: add preemption checks to __this_cpu ops
vmstat: use raw_cpu_ops to avoid false positives on preemption checks
slub: use raw_cpu_inc for incrementing statistics
net: replace __this_cpu_inc in route.c with raw_cpu_inc
modules: use raw_cpu_write for initialization of per cpu refcount.
mm: use raw_cpu ops for determining current NUMA node
percpu: add raw_cpu_ops
slub: fix leak of 'name' in sysfs_slab_add
...
To increase compiler portability there is <linux/compiler.h> which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes
with the right macro in the kernel subsystem.
Signed-off-by: Gideon Israel Dsouza <gidisrael@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The purpose of this single series of commits from Srivatsa S Bhat (with
a small piece from Gautham R Shenoy) touching multiple subsystems that use
CPU hotplug notifiers is to provide a way to register them that will not
lead to deadlocks with CPU online/offline operations as described in the
changelog of commit 93ae4f978c (CPU hotplug: Provide lockless versions
of callback registration functions).
The first three commits in the series introduce the API and document it
and the rest simply goes through the users of CPU hotplug notifiers and
converts them to using the new method.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTQow2AAoJEILEb/54YlRxW4QQAJlYRDUzwFJzJzYhltQYuVR+
4D74XMtvXgoJfg3cwdSWvMKKpJZnA9BVN0f7Hcx9wYmgdexYUuHeZJmMNyc3S2+g
KjKBIsugvgmZhHbbLd6TJ6GBbhGT5JLt9VmSfL9zIkveInU1YHFUUqL/mxdHm4J0
BSGKjk2rN3waRJgmY+xfliFLtQjDKFwJpMuvrgtoUyfas3f4sIV43UNbqdvA/weJ
rzedxXOlKH/id4b56lj/4iIzcoL3mwvJJ7r6n0CEMsKv87z09kqR0O+69Tsq/cgs
j17CsvoJOmZGk3QTeKVMQWBsvk6aPoDu3zK83gLbQMt+qjOpSTbJLz/3HZw4/TrW
ss4nuZne1DLMGS+6hoxYbTP+6Ni//Kn+l/LrHc5jb7m1X3lMO4W2aV3IROtIE1rv
lEP1IG01NU4u9YwkVj1dyhrkSp8tLPul4SrUK8W+oNweOC5crjJV7vJbIPJgmYiM
IZN55wln0yVRtR4TX+rmvN0PixsInE8MeaVCmReApyF9pdzul/StxlBze5BKLSJD
cqo1kNPpsmdxoDucqUpQ/gSvy+IOl2qnlisB5PpV93sk7De6TFDYrGHxjYIW7jMf
StXwdCDDQhzd2Q8Kfpp895A1dbIl8rKtwA6bTU2eX+BfMVFzuMdT44cvosx1+UdQ
sWl//rg76nb13dFjvF+q
=SW7Q
-----END PGP SIGNATURE-----
Merge tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull CPU hotplug notifiers registration fixes from Rafael Wysocki:
"The purpose of this single series of commits from Srivatsa S Bhat
(with a small piece from Gautham R Shenoy) touching multiple
subsystems that use CPU hotplug notifiers is to provide a way to
register them that will not lead to deadlocks with CPU online/offline
operations as described in the changelog of commit 93ae4f978c ("CPU
hotplug: Provide lockless versions of callback registration
functions").
The first three commits in the series introduce the API and document
it and the rest simply goes through the users of CPU hotplug notifiers
and converts them to using the new method"
* tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (52 commits)
net/iucv/iucv.c: Fix CPU hotplug callback registration
net/core/flow.c: Fix CPU hotplug callback registration
mm, zswap: Fix CPU hotplug callback registration
mm, vmstat: Fix CPU hotplug callback registration
profile: Fix CPU hotplug callback registration
trace, ring-buffer: Fix CPU hotplug callback registration
xen, balloon: Fix CPU hotplug callback registration
hwmon, via-cputemp: Fix CPU hotplug callback registration
hwmon, coretemp: Fix CPU hotplug callback registration
thermal, x86-pkg-temp: Fix CPU hotplug callback registration
octeon, watchdog: Fix CPU hotplug callback registration
oprofile, nmi-timer: Fix CPU hotplug callback registration
intel-idle: Fix CPU hotplug callback registration
clocksource, dummy-timer: Fix CPU hotplug callback registration
drivers/base/topology.c: Fix CPU hotplug callback registration
acpi-cpufreq: Fix CPU hotplug callback registration
zsmalloc: Fix CPU hotplug callback registration
scsi, fcoe: Fix CPU hotplug callback registration
scsi, bnx2fc: Fix CPU hotplug callback registration
scsi, bnx2i: Fix CPU hotplug callback registration
...
Pull ARM changes from Russell King:
- Perf updates from Will Deacon:
- Support for Qualcomm Krait processors (run perf on your phone!)
- Support for Cortex-A12 (run perf stat on your FPGA!)
- Support for perf_sample_event_took, allowing us to automatically decrease
the sample rate if we can't handle the PMU interrupts quickly enough
(run perf record on your FPGA!).
- Basic uprobes support from David Long:
This patch series adds basic uprobes support to ARM. It is based on
patches developed earlier by Rabin Vincent. That approach of adding
hooks into the kprobes instruction parsing code was not well received.
This approach separates the ARM instruction parsing code in kprobes out
into a separate set of functions which can be used by both kprobes and
uprobes. Both kprobes and uprobes then provide their own semantic action
tables to process the results of the parsing.
- ARMv7M (microcontroller) updates from Uwe Kleine-König
- OMAP DMA updates (recently added Vinod's Ack even though they've been
sitting in linux-next for a few months) to reduce the reliance of
omap-dma on the code in arch/arm.
- SA11x0 changes from Dmitry Eremin-Solenikov and Alexander Shiyan
- Support for Cortex-A12 CPU
- Align support for ARMv6 with ARMv7 so they can cooperate better in a
single zImage.
- Addition of first AT_HWCAP2 feature bits for ARMv8 crypto support.
- Removal of IRQ_DISABLED from various ARM files
- Improved efficiency of virt_to_page() for single zImage
- Patch from Ulf Hansson to permit runtime PM callbacks to be available for
AMBA devices for suspend/resume as well.
- Finally kill asm/system.h on ARM.
* 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (89 commits)
dmaengine: omap-dma: more consolidation of CCR register setup
dmaengine: omap-dma: move IRQ handling to omap-dma
dmaengine: omap-dma: move register read/writes into omap-dma.c
ARM: omap: dma: get rid of 'p' allocation and clean up
ARM: omap: move dma channel allocation into plat-omap code
ARM: omap: dma: get rid of errata global
ARM: omap: clean up DMA register accesses
ARM: omap: remove almost-const variables
ARM: omap: remove references to disable_irq_lch
dmaengine: omap-dma: cleanup errata 3.3 handling
dmaengine: omap-dma: provide register read/write functions
dmaengine: omap-dma: use cached CCR value when enabling DMA
dmaengine: omap-dma: move barrier to omap_dma_start_desc()
dmaengine: omap-dma: move clnk_ctrl setting to preparation functions
dmaengine: omap-dma: improve efficiency loading C.SA/C.EI/C.FI registers
dmaengine: omap-dma: consolidate clearing channel status register
dmaengine: omap-dma: move CCR buffering disable errata out of the fast path
dmaengine: omap-dma: provide register definitions
dmaengine: omap-dma: consolidate setup of CCR
dmaengine: omap-dma: consolidate setup of CSDP
...
But there were a few features that were added.
Uprobes now work with event triggers and multi buffers.
Uprobes have support under ftrace and perf.
The big feature is that the function tracer can now be used within the
multi buffer instances. That is, you can now trace some functions
in one buffer, others in another buffer, all functions in a third buffer
and so on. They are basically agnostic from each other. This only
works for the function tracer and not for the function graph trace,
although you can have the function graph tracer running in the top level
buffer (or any tracer for that matter) and have different function tracing
going on in the sub buffers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTOthtAAoJEKQekfcNnQGu5c8H/Ana/U+0tmksp1dbHkRHsKSH
+Fsv4Jeu8gf1NaFKHEhkUTcFtnzE6qAPV2VCrcJwXbhAhhwZm+LjrnWdoy3215S3
cQW4LftLEonh2cM36Cos74TulMEYN6XmL6dQZV+CILKQkDrWU4qJjQ64okXEkqrd
9iG3p/mSXyvJcmnyg61ALnMOhZDLsXY3djBhWBPhiTPGS6BRb9zh4Pmw6Zv0n2rJ
U93Gt/3AQrv1ybu73dUxqP0abp60oXOiWoF/R2jcbKqIM+K9RPJX79unCV3jq3u9
f+6jMlB9PgAMqQj6ihJdwxKDDuzwyrVdEPnsgvl4jarCBCtVVwhKedBaKN/KS8k=
=HdXY
-----END PGP SIGNATURE-----
Merge tag 'trace-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Most of the changes were largely clean ups, and some documentation.
But there were a few features that were added:
Uprobes now work with event triggers and multi buffers and have
support under ftrace and perf.
The big feature is that the function tracer can now be used within the
multi buffer instances. That is, you can now trace some functions in
one buffer, others in another buffer, all functions in a third buffer
and so on. They are basically agnostic from each other. This only
works for the function tracer and not for the function graph trace,
although you can have the function graph tracer running in the top
level buffer (or any tracer for that matter) and have different
function tracing going on in the sub buffers"
* tag 'trace-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (45 commits)
tracing: Add BUG_ON when stack end location is over written
tracepoint: Remove unused API functions
Revert "tracing: Move event storage for array from macro to standalone function"
ftrace: Constify ftrace_text_reserved
tracepoints: API doc update to tracepoint_probe_register() return value
tracepoints: API doc update to data argument
ftrace: Fix compilation warning about control_ops_free
ftrace/x86: BUG when ftrace recovery fails
ftrace: Warn on error when modifying ftrace function
ftrace: Remove freelist from struct dyn_ftrace
ftrace: Do not pass data to ftrace_dyn_arch_init
ftrace: Pass retval through return in ftrace_dyn_arch_init()
ftrace: Inline the code from ftrace_dyn_table_alloc()
ftrace: Cleanup of global variables ftrace_new_pgs and ftrace_update_cnt
tracing: Evaluate len expression only once in __dynamic_array macro
tracing: Correctly expand len expressions from __dynamic_array macro
tracing/module: Replace include of tracepoint.h with jump_label.h in module.h
tracing: Fix event header migrate.h to include tracepoint.h
tracing: Fix event header writeback.h to include tracepoint.h
tracing: Warn if a tracepoint is not set via debugfs
...
Pull core block layer updates from Jens Axboe:
"This is the pull request for the core block IO bits for the 3.15
kernel. It's a smaller round this time, it contains:
- Various little blk-mq fixes and additions from Christoph and
myself.
- Cleanup of the IPI usage from the block layer, and associated
helper code. From Frederic Weisbecker and Jan Kara.
- Duplicate code cleanup in bio-integrity from Gu Zheng. This will
give you a merge conflict, but that should be easy to resolve.
- blk-mq notify spinlock fix for RT from Mike Galbraith.
- A blktrace partial accounting bug fix from Roman Pen.
- Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"
* 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
blk-mq: add REQ_SYNC early
rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
blk-mq: support partial I/O completions
blk-mq: merge blk_mq_insert_request and blk_mq_run_request
blk-mq: remove blk_mq_alloc_rq
blk-mq: don't dump CPU -> hw queue map on driver load
blk-mq: fix wrong usage of hctx->state vs hctx->flags
blk-mq: allow blk_mq_init_commands() to return failure
block: remove old blk_iopoll_enabled variable
blktrace: fix accounting of partially completed requests
smp: Rename __smp_call_function_single() to smp_call_function_single_async()
smp: Remove wait argument from __smp_call_function_single()
watchdog: Simplify a little the IPI call
smp: Move __smp_call_function_single() below its safe version
smp: Consolidate the various smp_call_function_single() declensions
smp: Teach __smp_call_function_single() to check for offline cpus
smp: Remove unused list_head from csd
smp: Iterate functions through llist_for_each_entry_safe()
block: Stop abusing rq->csd.list in blk-softirq
block: Remove useless IPI struct initialization
...
Pull x86 LTO changes from Peter Anvin:
"More infrastructure work in preparation for link-time optimization
(LTO). Most of these changes is to make sure symbols accessed from
assembly code are properly marked as visible so the linker doesn't
remove them.
My understanding is that the changes to support LTO are still not
upstream in binutils, but are on the way there. This patchset should
conclude the x86-specific changes, and remaining patches to actually
enable LTO will be fed through the Kbuild tree (other than keeping up
with changes to the x86 code base, of course), although not
necessarily in this merge window"
* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
Kbuild, lto: Handle basic LTO in modpost
Kbuild, lto: Disable LTO for asm-offsets.c
Kbuild, lto: Add a gcc-ld script to let run gcc as ld
Kbuild, lto: add ld-version and ld-ifversion macros
Kbuild, lto: Drop .number postfixes in modpost
Kbuild, lto, workaround: Don't warn for initcall_reference in modpost
lto: Disable LTO for sys_ni
lto: Handle LTO common symbols in module loader
lto, workaround: Add workaround for initcall reordering
lto: Make asmlinkage __visible
x86, lto: Disable LTO for the x86 VDSO
initconst, x86: Fix initconst mistake in ts5500 code
initconst: Fix initconst mistake in dcdbas
asmlinkage: Make trace_hardirqs_on/off_caller visible
asmlinkage, x86: Fix 32bit memcpy for LTO
asmlinkage Make __stack_chk_failed and memcmp visible
asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
asmlinkage: Make main_extable_sort_needed visible
asmlinkage, mutex: Mark __visible
asmlinkage: Make trace_hardirq visible
...
Pull scheduler changes from Ingo Molnar:
"Bigger changes:
- sched/idle restructuring: they are WIP preparation for deeper
integration between the scheduler and idle state selection, by
Nicolas Pitre.
- add NUMA scheduling pseudo-interleaving, by Rik van Riel.
- optimize cgroup context switches, by Peter Zijlstra.
- RT scheduling enhancements, by Thomas Gleixner.
The rest is smaller changes, non-urgnt fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
sched: Clean up the task_hot() function
sched: Remove double calculation in fix_small_imbalance()
sched: Fix broken setscheduler()
sparc64, sched: Remove unused sparc64_multi_core
sched: Remove unused mc_capable() and smt_capable()
sched/numa: Move task_numa_free() to __put_task_struct()
sched/fair: Fix endless loop in idle_balance()
sched/core: Fix endless loop in pick_next_task()
sched/fair: Push down check for high priority class task into idle_balance()
sched/rt: Fix picking RT and DL tasks from empty queue
trace: Replace hardcoding of 19 with MAX_NICE
sched: Guarantee task priority in pick_next_task()
sched/idle: Remove stale old file
sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
cpuidle/arm64: Remove redundant cpuidle_idle_call()
cpuidle/powernv: Remove redundant cpuidle_idle_call()
sched, nohz: Exclude isolated cores from load balancing
sched: Fix select_task_rq_fair() description comments
workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
...
Pull perf changes from Ingo Molnar:
"Main changes:
Kernel side changes:
- Add SNB/IVB/HSW client uncore memory controller support (Stephane
Eranian)
- Fix various x86/P4 PMU driver bugs (Don Zickus)
Tooling, user visible changes:
- Add several futex 'perf bench' microbenchmarks (Davidlohr Bueso)
- Speed up thread map generation (Don Zickus)
- Introduce 'perf kvm --list-cmds' command line option for use by
scripts (Ramkumar Ramachandra)
- Print the evsel name in the annotate stdio output, prep to fix
support outputting annotation for multiple events, not just for the
first one (Arnaldo Carvalho de Melo)
- Allow setting preferred callchain method in .perfconfig (Jiri Olsa)
- Show in what binaries/modules 'perf probe's are set (Masami
Hiramatsu)
- Support distro-style debuginfo for uprobe in 'perf probe' (Masami
Hiramatsu)
Tooling, internal changes and fixes:
- Use tid in mmap/mmap2 events to find maps (Don Zickus)
- Record the reason for filtering an address_location (Namhyung Kim)
- Apply all filters to an addr_location (Namhyung Kim)
- Merge al->filtered with hist_entry->filtered in report/hists
(Namhyung Kim)
- Fix memory leak when synthesizing thread records (Namhyung Kim)
- Use ui__has_annotation() in 'report' (Namhyung Kim)
- hists browser refactorings to reuse code accross UIs (Namhyung Kim)
- Add support for the new DWARF unwinder library in elfutils (Jiri
Olsa)
- Fix build race in the generation of bison files (Jiri Olsa)
- Further streamline the feature detection display, trimming it a bit
to show just the libraries detected, using VF=1 gets a more verbose
output, showing the less interesting feature checks as well (Jiri
Olsa).
- Check compatible symtab type before loading dso (Namhyung Kim)
- Check return value of filename__read_debuglink() (Stephane Eranian)
- Move some hashing and fs related code from tools/perf/util/ to
tools/lib/ so that it can be used by more tools/ living utilities
(Borislav Petkov)
- Prepare DWARF unwinding code for using an elfutils alternative
unwinding library (Jiri Olsa)
- Fix DWARF unwind max_stack processing (Jiri Olsa)
- Add dwarf unwind 'perf test' entry (Jiri Olsa)
- 'perf probe' improvements including memory leak fixes, sharing the
intlist class with other tools, uprobes/kprobes code sharing and
use of ref_reloc_sym (Masami Hiramatsu)
- Shorten sample symbol resolving by adding cpumode to struct
addr_location (Arnaldo Carvalho de Melo)
- Fix synthesizing mmaps for threads (Don Zickus)
- Fix invalid output on event group stdio report (Namhyung Kim)
- Fixup header alignment in 'perf sched latency' output (Ramkumar
Ramachandra)
- Fix off-by-one error in 'perf timechart record' argv handling
(Ramkumar Ramachandra)
Tooling, cleanups:
- Remove unused thread__find_map function (Jiri Olsa)
- Remove unused simple_strtoul() function (Ramkumar Ramachandra)
Tooling, documentation updates:
- Update function names in debug messages (Ramkumar Ramachandra)
- Update some code references in design.txt (Ramkumar Ramachandra)
- Clarify load-latency information in the 'perf mem' docs (Andi
Kleen)
- Clarify x86 register naming in 'perf probe' docs (Andi Kleen)"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (96 commits)
perf tools: Remove unused simple_strtoul() function
perf tools: Update some code references in design.txt
perf evsel: Update function names in debug messages
perf tools: Remove thread__find_map function
perf annotate: Print the evsel name in the stdio output
perf report: Use ui__has_annotation()
perf tools: Fix memory leak when synthesizing thread records
perf tools: Use tid in mmap/mmap2 events to find maps
perf report: Merge al->filtered with hist_entry->filtered
perf symbols: Apply all filters to an addr_location
perf symbols: Record the reason for filtering an address_location
perf sched: Fixup header alignment in 'latency' output
perf timechart: Fix off-by-one error in 'record' argv handling
perf machine: Factor machine__find_thread to take tid argument
perf tools: Speed up thread map generation
perf kvm: introduce --list-cmds for use by scripts
perf ui hists: Pass evsel to hpp->header/width functions explicitly
perf symbols: Introduce thread__find_cpumode_addr_location
perf session: Change header.misc dump from decimal to hex
perf ui/tui: Reuse generic __hpp__fmt() code
...
While working on my tutorial for 2014 Linux Collaboration Summit
I found that the traceon trigger did not work when conditions were
used. The other triggers worked fine though. Looking into it, it
is because of the way the triggers use the ring buffer to store
the fields it will use for the condition. But if tracing is off, nothing
is stored in the buffer, and the tracepoint exits before calling the
trigger to test the condition. This is fine for all the triggers that
only work when tracing is on, but for traceon trigger that is to
work when tracing is off, nothing happens.
The fix is simple, just use a temp ring buffer to record the event
if tracing is off and the event has a trace event conditional trigger
enabled. The rest of the tracepoint code will work just fine, but
the tracepoint wont be recorded in the other buffers.
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It is difficult to detect a stack overrun when it
actually occurs.
We have observed that this type of corruption is often
silent and can go unnoticed. Once the corrupted region
is examined, the outcome is undefined and often
results in sporadic system crashes.
When the stack tracing feature is enabled, let's check
for this condition and take appropriate action.
Note: init_task doesn't get its stack end location
set to STACK_END_MAGIC.
Link: http://lkml.kernel.org/r/1395669837-30209-1-git-send-email-atomlin@redhat.com
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I originally wrote commit 35bb4399bd to shrink the size of the overhead of
tracepoints by several kilobytes. Later, I received a patch from Vaibhav
Nagarnaik that fixed a bug in the same code that this commit touches. Not
only did it fix a bug, it also removed code and shrunk the size of the
overhead of trace events even more than this commit did.
Since this commit is scheduled for 3.15 and Vaibhav's patch is already in
mainline, I need to revert this patch in order to keep it from conflicting
with Vaibhav's patch. Not to mention, Vaibhav's patch makes this patch
obsolete.
Link: http://lkml.kernel.org/r/20140320225637.0226041b@gandalf.local.home
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In event format strings, the array size is reported in two locations.
One in array subscript and then via the "size:" attribute. The values
reported there have a mismatch.
For e.g., in sched:sched_switch the prev_comm and next_comm character
arrays have subscript values as [32] where as the actual field size is
16.
name: sched_switch
ID: 301
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char prev_comm[32]; offset:8; size:16; signed:1;
field:pid_t prev_pid; offset:24; size:4; signed:1;
field:int prev_prio; offset:28; size:4; signed:1;
field:long prev_state; offset:32; size:8; signed:1;
field:char next_comm[32]; offset:40; size:16; signed:1;
field:pid_t next_pid; offset:56; size:4; signed:1;
field:int next_prio; offset:60; size:4; signed:1;
After bisection, the following commit was blamed:
92edca0 tracing: Use direct field, type and system names
This commit removes the duplication of strings for field->name and
field->type assuming that all the strings passed in
__trace_define_field() are immutable. This is not true for arrays, where
the type string is created in event_storage variable and field->type for
all array fields points to event_storage.
Use __stringify() to create a string constant for the type string.
Also, get rid of event_storage and event_storage_mutex that are not
needed anymore.
also, an added benefit is that this reduces the overhead of events a bit more:
text data bss dec hex filename
8424787 2036472 1302528 11763787 b3804b vmlinux
8420814 2036408 1302528 11759750 b37086 vmlinux.patched
Link: http://lkml.kernel.org/r/1392349908-29685-1-git-send-email-vnagarnaik@google.com
Cc: Laurent Chavey <chavey@google.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Instead, the correct and race-free way of performing the callback
registration is:
cpu_notifier_register_begin();
for_each_online_cpu(cpu)
init_cpu(cpu);
/* Note the use of the double underscored version of the API */
__register_cpu_notifier(&foobar_cpu_notifier);
cpu_notifier_register_done();
Fix the tracing ring-buffer code by using this latter form of callback
registration.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Suggested change from Oleg Nesterov. Fixes incomplete dependencies
for uprobes feature.
Signed-off-by: David A. Long <dave.long@linaro.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
With CONFIG_DYNAMIC_FTRACE=n, I see a warning:
kernel/trace/ftrace.c:240:13: warning: 'control_ops_free' defined but not used
static void control_ops_free(struct ftrace_ops *ops)
^
Move that function around to an already existing #ifdef
CONFIG_DYNAMIC_FTRACE block as the function is used solely from the
dynamic function tracing functions.
Link: http://lkml.kernel.org/r/1394484131-5107-1-git-send-email-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Recent issues with user space callchains processing within
page fault handler tracing showed as Peter said 'there's
just too much fail surface'.
The user space stack dump is just another source of the this issue.
Related list discussions:
http://marc.info/?t=139302086500001&r=1&w=2http://marc.info/?t=139301437300003&r=1&w=2
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393775800-13524-3-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recent issues with user space callchains processing within
page fault handler tracing showed as Peter said 'there's
just too much fail surface'.
Related list discussions:
http://marc.info/?t=139302086500001&r=1&w=2http://marc.info/?t=139301437300003&r=1&w=2
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393775800-13524-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We should print some warning and kill ftrace functionality when the ftrace
function is not set correctly. Otherwise, ftrace might do crazy things without
an explanation. The error value has been ignored so far.
Note that an error that happens during updating all the traced calls is handled
in ftrace_replace_code(). We print more details about the particular
failing address via ftrace_bug() there.
Link: http://lkml.kernel.org/r/1393258342-29978-3-git-send-email-pmladek@suse.cz
Signed-off-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the data parameter is not really used by any ftrace_dyn_arch_init,
remove that from ftrace_dyn_arch_init. This also removes the addr
local variable from ftrace_init which is now unused.
Note the documentation was imprecise as it did not suggest to set
(*data) to 0.
Link: http://lkml.kernel.org/r/1393268401-24379-4-git-send-email-jslaby@suse.cz
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
No architecture uses the "data" parameter in ftrace_dyn_arch_init() in any
way, it just sets the value to 0. And this is used as a return value
in the caller -- ftrace_init, which just checks the retval against
zero.
Note there is also "return 0" in every ftrace_dyn_arch_init. So it is
enough to check the retval and remove all the indirect sets of data on
all archs.
Link: http://lkml.kernel.org/r/1393268401-24379-3-git-send-email-jslaby@suse.cz
Cc: linux-arch@vger.kernel.org
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function used to do allocations some time ago. This no longer
happens and it only checks the count and prints some info. This patch
inlines the body to the only caller. There are two reasons:
* the name of the function was misleading
* it's clear what is going on in ftrace_init now
Link: http://lkml.kernel.org/r/1393268401-24379-2-git-send-email-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some of them can be local to functions, so make them local and pass
them as parameters where needed:
* __start_mcount_loc+__stop_mcount_loc are local to ftrace_init
* ftrace_new_pgs -> new_pgs/start_pg
* ftrace_update_cnt -> local update_cnt in ftrace_update_code
Link: http://lkml.kernel.org/r/1393268401-24379-1-git-send-email-jslaby@suse.cz
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions that assign the contents for the ftrace events are
defined by the TRACE_EVENT() macros. Each event has its own unique
way to assign data to its buffer. When you have over 500 events,
that means there's 500 functions assigning data uniquely for each
event (not really that many, as DECLARE_EVENT_CLASS() and multiple
DEFINE_EVENT()s will only need a single function).
By making helper functions in the core kernel to do some of the work
instead, we can shrink the size of the kernel down a bit.
With a kernel configured with 502 events, the change in size was:
text data bss dec hex filename
12987390 1913504 9785344 24686238 178ae9e /tmp/vmlinux
12959102 1913504 9785344 24657950 178401e /tmp/vmlinux.patched
That's a total of 28288 bytes, which comes down to 56 bytes per event.
Link: http://lkml.kernel.org/r/20120810034708.370808175@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code that shows array fields for events is defined for all events.
This can add up quite a bit when you have over 500 events.
By making helper functions in the core kernel to do the work
instead, we can shrink the size of the kernel down a bit.
With a kernel configured with 502 events, the change in size was:
text data bss dec hex filename
12990946 1913568 9785344 24689858 178bcc2 /tmp/vmlinux
12987390 1913504 9785344 24686238 178ae9e /tmp/vmlinux.patched
That's a total of 3556 bytes, which comes down to 7 bytes per event.
Although it's not much, this code is just called at initialization of
the events.
Link: http://lkml.kernel.org/r/20120810034708.084036335@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code for trace events to format the raw recorded event data
into human readable format in the 'trace' file is repeated for every
event in the system. When you have over 500 events, this can add up
quite a bit.
By making helper functions in the core kernel to do the work
instead, we can shrink the size of the kernel down a bit.
With a kernel configured with 502 events, the change in size was:
text data bss dec hex filename
12991007 1913568 9785344 24689919 178bcff /tmp/vmlinux.orig
12990946 1913568 9785344 24689858 178bcc2 /tmp/vmlinux.patched
Note, this version does not save as much as the version of this patch
I had a few years ago. That is because in the mean time, commit
f71130de5c ("tracing: Add a helper function for event print functions")
did a lot of the work my original patch did. But this change helps
slightly, and is part of a larger clean up to reduce the size much further.
Link: http://lkml.kernel.org/r/20120810034707.378538034@goodmis.org
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:
C R 232 + 240 [0]
C R 240 + 232 [0]
C R 248 + 224 [0]
C R 256 + 216 [0]
but should be:
C R 232 + 8 [0]
C R 240 + 8 [0]
C R 248 + 8 [0]
C R 256 + 8 [0]
Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.
This patch takes into account real completion size of the request and
fixes wrong completion accounting.
Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: linux-kernel@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
If a module fails to add its tracepoints due to module tainting, do not
create the module event infrastructure in the debugfs directory. As the events
will not work and worse yet, they will silently fail, making the user wonder
why the events they enable do not display anything.
Having a warning on module load and the events not visible to the users
will make the cause of the problem much clearer.
Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org
Fixes: 6d723736e4 "tracing/events: add support for modules to TRACE_EVENT"
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org # 2.6.31+
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use MAX_NICE instead of the value 19 for ring_buffer_benchmark.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1393251121-25534-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ENABLED flag needs to be cleared when a ftrace_ops is unregistered
otherwise it wont be able to be registered again.
This is only for static tracing and does not affect DYNAMIC_FTRACE at
all.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Being able to change the trace clock at boot can be advantageous if
you need a better source of when things happen across CPUs. The default
trace clock is the fastest, but it uses local clocks which may not be
synced across CPUs and it does not let you know when events took place
with respect to events on other CPUs.
The global trace clock can help in this case, and if you do not care
about timings, the counter "clock" is the best, as that is just a simple
atomic counter that is incremented for every event.
Usage is to add "trace_clock=counter" on the kernel command line. You
can replace counter with "global" or any of the clocks listed in
/sys/kernel/debug/tracing/trace_clock
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Appreciated-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It seems there's no reason to prevent mixed used of ftrace and perf
for a single uprobe event. At least the kprobes already support it.
Link: http://lkml.kernel.org/r/1389946120-19610-6-git-send-email-namhyung@kernel.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add support for event triggering to uprobes. This is same as kprobes
support added by Tom (plus cleanup by Steven).
Link: http://lkml.kernel.org/r/1389946120-19610-5-git-send-email-namhyung@kernel.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Support multi-buffer on uprobe-based dynamic events by
using ftrace_event_file.
This patch is based kprobe-based dynamic events multibuffer
support work initially, commited by Masami(commit 41a7dd420c),
but revised as below:
Oleg changed the kprobe-based multibuffer design from
array-pointers of ftrace_event_file into simple list,
so this patch also change to the list design.
rcu_read_lock/unlock added into uprobe_trace_func/uretprobe_trace_func,
to synchronize with ftrace_event_file list add and delete.
Even though we allow multi-uprobes instances now,
but TP_FLAG_PROFILE/TP_FLAG_TRACE are still mutually exclusive
in probe_event_enable currently, this means we cannot allow
one user is using uprobe-tracer, and another user is using
perf-probe on same uprobe concurrently.
(Perhaps this will be fix in future, kprobe don't have this
limitation now)
Link: http://lkml.kernel.org/r/1389946120-19610-4-git-send-email-namhyung@kernel.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A single uprobe event might serve different users like ftrace and
perf. And this is especially important for upcoming multi buffer
support. But in this case it'll fetch (same) data from userspace
multiple times. So move it to the beginning of the dispatcher
function and reuse it for each users.
Link: http://lkml.kernel.org/r/1389946120-19610-3-git-send-email-namhyung@kernel.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The uprobe_{trace,perf}_print functions are misnomers since what they
do is not printing. There's also a real print function named
print_uprobe_event() so they'll only increase confusion IMHO.
Rename them with double underscores to follow convention of kprobe.
Link: http://lkml.kernel.org/r/1389946120-19610-2-git-send-email-namhyung@kernel.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Create a "set_ftrace_filter" and "set_ftrace_notrace" files in the instance
directories to let users filter of functions to trace for the given instance.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In preparation for having the function tracing instances be able to
filter on functions, the generic filter functions must first be
converted to take in the global_ops as a parameter.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow instances (sub-buffers) to enable function tracing.
Each instance will have its own function tracing capability.
For now, instances will not have function stack tracing, or will
they be able to pick and choose what functions they can trace.
Picking and choosing their own functions will come later.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As tracers will soon be used by instances, the tracer enabled field
needs to be converted to a counter instead of a boolean.
This counter is protected by the trace_types_lock mutex.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When an instance is about to be deleted, make sure the tracer
is set to nop. If it isn't reset the tracer and set it to the nop
tracer, otherwise memory leaks and bad pointers may result.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If global_ops function is being called directly, instead of the global_ops
list function, set the global_ops private to be the same as the ops private
that's being called directly.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the tracers (function, function_graph, irqsoff, etc) can only
be used by the top level tracing directory (not for instances).
This sets up the infrastructure to allow instances to be able to
run a separate tracer apart from the what the top level tracing is
doing.
As tracers need to adapt for being used by instances, the tracers
must flag if they can be used by instances or not. Currently only the
'nop' tracer can be used by all instances.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As options (flags) may affect instances instead of being global
the flag_changed() callbacks need to receive the trace_array descriptor
of the instance they will be modifying.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As options (flags) may affect instances instead of being global
the set_flag() callbacks need to receive the trace_array descriptor
of the instance they will be modifying.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Each sub-buffer (buffer page) has a full 64 bit timestamp. The events on
that page use a 27 bit delta against that timestamp in order to save on
bits written to the ring buffer. If the time between events is larger than
what the 27 bits can hold, a "time extend" event is added to hold the
entire 64 bit timestamp again and the events after that hold a delta from
that timestamp.
As a "time extend" is always paired with an event, it is logical to just
allocate the event with the time extend, to make things a bit more efficient.
Unfortunately, when the pairing code was written, it removed the "delta = 0"
from the first commit on a page, causing the events on the page to be
slightly skewed.
Fixes: 69d1b839f7 "ring-buffer: Bind time extend and data events together"
Cc: stable@vger.kernel.org # 2.6.37+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
the new features added to 3.14.
The third patch is a minor bugfix to the trace_puts() functions that
will crash the system if a developer adds one before the tracing system
is setup. It also affects trace_printk() if it has no arguments, as
the code will convert it to a trace_puts() as well. Note, this bug
will not affect unmodified kernels, as trace_printk() and trace_puts()
should only be used by developers for testing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJS5oAkAAoJEKQekfcNnQGuOfkH/1ScwoRslERCCjhPEZu958bG
VugNvliB/uDwq3qkmBcncMHwqkFBgJay9ieah+JFeK6x3G6mO/uf+UhN5cmLzVBh
vA5dKR2zW6WTxKYSvXYapD7OzC7R+V6CLvpMl5WMZ1t3ESQhR6gUrpPdigecBWs9
015rMVA2xXjNnHNDM1nem1PhnMba78A/N98lFErYGpvVBjkCpB8mSt8adds9bYp8
a83P6tGUfXvsYO3sOqqBnOOqcfzCjjJbmr94v/F+SmLyuxuV0tVzBkIwXkKTNhEK
bQZj9Pfe+1oIkldXxstn5jWaOvI6RlAGW4b4qXt30AgrGRSyEo5xpvfLfyNUif4=
=N3kq
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"The first two patches fix the debugfs README file to reflect better
the new features added to 3.14.
The third patch is a minor bugfix to the trace_puts() functions that
will crash the system if a developer adds one before the tracing
system is setup. It also affects trace_printk() if it has no
arguments, as the code will convert it to a trace_puts() as well.
Note, this bug will not affect unmodified kernels, as trace_printk()
and trace_puts() should only be used by developers for testing"
* tag 'trace-fixes-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Check if tracing is enabled in trace_puts()
tracing: Fix formatting of trace README file
tracing/README: Add event file usage to tracing mini-HOWTO
If trace_puts() is used very early in boot up, it can crash the machine
if it is called before the ring buffer is allocated. If a trace_printk()
is used with no arguments, then it will be converted into a trace_puts()
and suffer the same fate.
Cc: stable@vger.kernel.org # 3.10+
Fixes: 09ae72348e "tracing: Add trace_puts() for even faster trace_printk() tracing"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix the formatting of the README file in the trace debugfs to fit in
an 80 character window.
Also add a comment about the event trigger counter with regards to
traceon and traceoff.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It would be useful to have a cheat-sheet for everything under
tracing/events/ alongside the existing text describing the other files
in the tracing/ dir.
Add short descriptions of the directories and files under events/
along with examples, similar to the existing text for the other files
in tracing/.
Also clean up a few minor alignment problems noticed when adding the
new text.
Link: http://lkml.kernel.org/r/1389993104.3040.445.camel@empanada
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
triggers by Tom Zanussi. A trigger is a way to enable an action when an
event is hit. The actions are:
o trace on/off - enable or disable tracing
o snapshot - save the current trace buffer in the snapshot
o stacktrace - dump the current stack trace to the ringbuffer
o enable/disable events - enable or disable another event
Namhyung Kim added updates to the tracing uprobes code. Having the
uprobes add support for fetch methods.
The rest are various bug fixes with the new code, and minor ones for
the old code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJS3Z9fAAoJEKQekfcNnQGuFf0H/0CteaN+BJjpif6Tnxia15Sp
pcftzU0lgqfNzsfitmbjiVTgXWqCghoZo8UI9tQZvBZ9wmDIxeXQR73uoBgVlSCQ
ovyBO/R8r+lq+7EsDCwntZvrLbcdn6s/jzoruRvt7r35ghK5pH81DNR1BOzTQBhW
x+361Xtc13aok7N7JN8KR96VDUP9f8KU6PWqJ5lgS2Zl+wbVw6b0p8OV8IMCHczP
MdYrx8y4Jv4QWW7rMShAAVBe9qJQ56JWiWA17ysa4kY8BkKQ7QtlEFr+r1YY0nX5
67brXiL8u0NFzRx5y2VRpGc25BbImnVBFpoLQ5Itluq9OdZE3aOQubzXlY70R6g=
=Hkho
-----END PGP SIGNATURE-----
Merge tag 'trace-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This pull request has a new feature to ftrace, namely the trace event
triggers by Tom Zanussi. A trigger is a way to enable an action when
an event is hit. The actions are:
o trace on/off - enable or disable tracing
o snapshot - save the current trace buffer in the snapshot
o stacktrace - dump the current stack trace to the ringbuffer
o enable/disable events - enable or disable another event
Namhyung Kim added updates to the tracing uprobes code. Having the
uprobes add support for fetch methods.
The rest are various bug fixes with the new code, and minor ones for
the old code"
* tag 'trace-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (38 commits)
tracing: Fix buggered tee(2) on tracing_pipe
tracing: Have trace buffer point back to trace_array
ftrace: Fix synchronization location disabling and freeing ftrace_ops
ftrace: Have function graph only trace based on global_ops filters
ftrace: Synchronize setting function_trace_op with ftrace_trace_function
tracing: Show available event triggers when no trigger is set
tracing: Consolidate event trigger code
tracing: Fix counter for traceon/off event triggers
tracing: Remove double-underscore naming in syscall trigger invocations
tracing/kprobes: Add trace event trigger invocations
tracing/probes: Fix build break on !CONFIG_KPROBE_EVENT
tracing/uprobes: Add @+file_offset fetch method
uprobes: Allocate ->utask before handler_chain() for tracing handlers
tracing/uprobes: Add support for full argument access methods
tracing/uprobes: Fetch args before reserving a ring buffer
tracing/uprobes: Pass 'is_return' to traceprobe_parse_probe_arg()
tracing/probes: Implement 'memory' fetch method for uprobes
tracing/probes: Add fetch{,_size} member into deref fetch method
tracing/probes: Move 'symbol' fetch method to kprobes
tracing/probes: Implement 'stack' fetch method for uprobes
...
In kernel/trace/trace.c we have this:
static void tracing_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
__free_page(buf->page);
}
static const struct pipe_buf_operations tracing_pipe_buf_ops = {
.can_merge = 0,
.map = generic_pipe_buf_map,
.unmap = generic_pipe_buf_unmap,
.confirm = generic_pipe_buf_confirm,
.release = tracing_pipe_buf_release,
.steal = generic_pipe_buf_steal,
.get = generic_pipe_buf_get,
};
with
void generic_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
{
page_cache_get(buf->page);
}
and I don't see anything that would've prevented tee(2) called on the pipe
that got stuff spliced into it from that sucker. ->ops->get() will be
called, then buf gets copied into target pipe's ->bufs[] and eventually
readers get to both copies of the buffer. With
get_page(page)
look at that page
__free_page(page)
look at that page
__free_page(page)
which is not a good thing, to put it mildly. AFAICS, that ought to use
the normal generic_pipe_buf_release() (aka page_cache_release(buf->page)),
shouldn't it?
[
SDR - As trace_pipe just allocates the page with alloc_page(GFP_KERNEL),
and doesn't do anything special with it (no LRU logic). The __free_page()
should be fine, as it wont actually free a page with reference count.
Maybe there's a chance to leak memory? Anyway, This change is at a minimum
good for being symmetric with generic_pipe_buf_get, it is fine to add.
]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ SDR - Removed no longer used tracing_pipe_buf_release ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace buffer has a descriptor pointer that goes back to the trace
array. But it was never assigned. Luckily, nothing uses it (yet), but
it will in the future.
Although nothing currently uses this, if any of the new features get
backported to older kernels, and because this is such a simple change,
I'm marking it for stable too.
Cc: stable@vger.kernel.org # v3.10+
Fixes: 12883efb67 "tracing: Consolidate max_tr into main trace_array structure"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The synchronization needed after ftrace_ops are unregistered must happen
after the callback is disabled from becing called by functions.
The current location happens after the function is being removed from the
internal lists, but not after the function callbacks were disabled, leaving
the functions susceptible of being called after their callbacks are freed.
This affects perf and any externel users of function tracing (LTTng and
SystemTap).
Cc: stable@vger.kernel.org # 3.0+
Fixes: cdbe61bfe7 "ftrace: Allow dynamically allocated function tracers"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Doing some different tests, I discovered that function graph tracing, when
filtered via the set_ftrace_filter and set_ftrace_notrace files, does
not always keep with them if another function ftrace_ops is registered
to trace functions.
The reason is that function graph just happens to trace all functions
that the function tracer enables. When there was only one user of
function tracing, the function graph tracer did not need to worry about
being called by functions that it did not want to trace. But now that there
are other users, this becomes a problem.
For example, one just needs to do the following:
# cd /sys/kernel/debug/tracing
# echo schedule > set_ftrace_filter
# echo function_graph > current_tracer
# cat trace
[..]
0) | schedule() {
------------------------------------------
0) <idle>-0 => rcu_pre-7
------------------------------------------
0) ! 2980.314 us | }
0) | schedule() {
------------------------------------------
0) rcu_pre-7 => <idle>-0
------------------------------------------
0) + 20.701 us | }
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
# cat trace
[..]
1) + 20.825 us | }
1) + 21.651 us | }
1) + 30.924 us | } /* SyS_ioctl */
1) | do_page_fault() {
1) | __do_page_fault() {
1) 0.274 us | down_read_trylock();
1) 0.098 us | find_vma();
1) | handle_mm_fault() {
1) | _raw_spin_lock() {
1) 0.102 us | preempt_count_add();
1) 0.097 us | do_raw_spin_lock();
1) 2.173 us | }
1) | do_wp_page() {
1) 0.079 us | vm_normal_page();
1) 0.086 us | reuse_swap_page();
1) 0.076 us | page_move_anon_rmap();
1) | unlock_page() {
1) 0.082 us | page_waitqueue();
1) 0.086 us | __wake_up_bit();
1) 1.801 us | }
1) 0.075 us | ptep_set_access_flags();
1) | _raw_spin_unlock() {
1) 0.098 us | do_raw_spin_unlock();
1) 0.105 us | preempt_count_sub();
1) 1.884 us | }
1) 9.149 us | }
1) + 13.083 us | }
1) 0.146 us | up_read();
When the stack tracer was enabled, it enabled all functions to be traced, which
now the function graph tracer also traces. This is a side effect that should
not occur.
To fix this a test is added when the function tracing is changed, as well as when
the graph tracer is enabled, to see if anything other than the ftrace global_ops
function tracer is enabled. If so, then the graph tracer calls a test trampoline
that will look at the function that is being traced and compare it with the
filters defined by the global_ops.
As an optimization, if there's no other function tracers registered, or if
the only registered function tracers also use the global ops, the function
graph infrastructure will call the registered function graph callback directly
and not go through the test trampoline.
Cc: stable@vger.kernel.org # 3.3+
Fixes: d2d45c7a03 "tracing: Have stack_tracer use a separate list of functions"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).
This is under development, in the meanwhile, as a temporary solution,
what this commits does is:
- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;
- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.
Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.
We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is very likely that systems that wants/needs to use the new
SCHED_DEADLINE policy also want to have the scheduling latency of
the -deadline tasks under control.
For this reason a new version of the scheduling wakeup latency,
called "wakeup_dl", is introduced.
As a consequence of applying this patch there will be three wakeup
latency tracer:
* "wakeup", that deals with all tasks in the system;
* "wakeup_rt", that deals with -rt and -deadline tasks only;
* "wakeup_dl", that deals with -deadline tasks only.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-9-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ftrace_trace_function is a variable that holds what function will be called
directly by the assembly code (mcount). If just a single function is
registered and it handles recursion itself, then the assembly will call that
function directly without any helper function. It also passes in the
ftrace_op that was registered with the callback. The ftrace_op to send is
stored in the function_trace_op variable.
The ftrace_trace_function and function_trace_op needs to be coordinated such
that the called callback wont be called with the wrong ftrace_op, otherwise
bad things can happen if it expected a different op. Luckily, there's no
callback that doesn't use the helper functions that requires this. But
there soon will be and this needs to be fixed.
Use a set_function_trace_op to store the ftrace_op to set the
function_trace_op to when it is safe to do so (during the update function
within the breakpoint or stop machine calls). Or if dynamic ftrace is not
being used (static tracing) then we have to do a bit more synchronization
when the ftrace_trace_function is set as that takes affect immediately
(as oppose to dynamic ftrace doing it with the modification of the trampoline).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently there's no way to know what triggers exist on a kernel without
looking at the source of the kernel or randomly trying out triggers.
Instead of creating another file in the debugfs system, simply show
what available triggers are there when cat'ing the trigger file when
it has no events:
[root /sys/kernel/debug/tracing]# cat events/sched/sched_switch/trigger
# Available triggers:
# traceon traceoff snapshot stacktrace enable_event disable_event
This stays consistent with other debugfs files where meta data like
this is always proceeded with a '#' at the start of the line so that
tools can strip these out.
Link: http://lkml.kernel.org/r/20140107103548.0a84536d@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The event trigger code that checks for callback triggers before and
after recording of an event has lots of flags checks. This code is
duplicated throughout the ftrace events, kprobes and system calls.
They all do the exact same checks against the event flags.
Added helper functions ftrace_trigger_soft_disabled(),
event_trigger_unlock_commit() and event_trigger_unlock_commit_regs()
that consolidated the code and these are used instead.
Link: http://lkml.kernel.org/r/20140106222703.5e7dbba2@gandalf.local.home
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The counters for the traceon and traceoff are only suppose to decrement
when the trigger enables or disables tracing. It is not suppose to decrement
every time the event is hit.
Only decrement the counter if the trigger actually did something.
Link: http://lkml.kernel.org/r/20140106223124.0e5fd0b4@gandalf.local.home
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's no reason to use double-underscores for any variable name in
ftrace_syscall_enter()/exit(), since those functions aren't generated
and there's no need to avoid namespace collisions as with the event
macros, which is where the original invocation code came from.
Link: http://lkml.kernel.org/r/0b489c9d1f7ee315cff60fa0e4c2b433ade8ae0d.1389036657.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add code to the kprobe/kretprobe event functions that will invoke any
event triggers associated with a probe's ftrace_event_file.
The code to do this is very similar to the invocation code already
used to invoke the triggers associated with static events and
essentially replaces the existing soft-disable checks with a superset
that preserves the original behavior but adds the bits needed to
support event triggers.
Link: http://lkml.kernel.org/r/f2d49f157b608070045fdb26c9564d5a05a5a7d0.1389036657.git.tom.zanussi@linux.intel.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When kprobe-based dynamic event tracer is not enabled, it caused
following build error:
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c8dd): undefined reference to `fetch_symbol_u8'
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c8e9): undefined reference to `fetch_symbol_u16'
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c8f5): undefined reference to `fetch_symbol_u32'
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c901): undefined reference to `fetch_symbol_u64'
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c909): undefined reference to `fetch_symbol_string'
kernel/built-in.o: In function `traceprobe_update_arg':
(.text+0x10c913): undefined reference to `fetch_symbol_string_size'
...
It was due to the fetch methods are referred from CHECK_FETCH_FUNCS
macro and since it was only defined in trace_kprobe.c. Move NULL
definition of such fetch functions to the header file.
Note, it also requires CONFIG_BRANCH_PROFILING enabled to trigger
this failure as well. This is because the "fetch_symbol_*" variables
are referenced in a "else if" statement that will only call
update_symbol_cache(), which is a static inline stub function
when CONFIG_KPROBE_EVENT is not enabled. gcc is smart enough
to optimize this "else if" out and that also removes the code that
references the undefined variables.
But when BRANCH_PROFILING is enabled, it fools gcc into keeping
the if statement around and thus references the undefined symbols
and fails to build.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Enable to fetch data from a file offset. Currently it only supports
fetching from same binary uprobe set. It'll translate the file offset
to a proper virtual address in the process.
The syntax is "@+OFFSET" as it does similar to normal memory fetching
(@ADDR) which does no address translation.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Enable to fetch other types of argument for the uprobes. IOW, we can
access stack, memory, deref, bitfield and retval from uprobes now.
The format for the argument types are same as kprobes (but @SYMBOL
type is not supported for uprobes), i.e:
@ADDR : Fetch memory at ADDR
$stackN : Fetch Nth entry of stack (N >= 0)
$stack : Fetch stack address
$retval : Fetch return value
+|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address
Note that the retval only can be used with uretprobes.
Original-patch-by: Hyeoncheol Lee <cheol.lee@lge.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Hyeoncheol Lee <cheol.lee@lge.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Fetching from user space should be done in a non-atomic context. So
use a per-cpu buffer and copy its content to the ring buffer
atomically. Note that we can migrate during accessing user memory
thus use a per-cpu mutex to protect concurrent accesses.
This is needed since we'll be able to fetch args from an user memory
which can be swapped out. Before that uprobes could fetch args from
registers only which saved in a kernel space.
While at it, use __get_data_size() and store_trace_args() to reduce
code duplication. And add struct uprobe_cpu_buffer and its helpers as
suggested by Oleg.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Currently uprobes don't pass is_return to the argument parser so that
it cannot make use of "$retval" fetch method since it only works for
return probes.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Use separate method to fetch from memory. Move existing functions to
trace_kprobe.c and make them static. Also add new memory fetch
implementation for uprobes.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The deref fetch methods access a memory region but it assumes that
it's a kernel memory since uprobes does not support them.
Add ->fetch and ->fetch_size member in order to provide a proper
access methods for supporting uprobes.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Hyeoncheol Lee <cheol.lee@lge.com>
[namhyung@kernel.org: Split original patch into pieces as requested]
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Move existing functions to trace_kprobe.c and add NULL entries to the
uprobes fetch type table. I don't make them static since some generic
routines like update/free_XXX_fetch_param() require pointers to the
functions.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Use separate method to fetch from stack. Move existing functions to
trace_kprobe.c and make them static. Also add new stack fetch
implementation for uprobes.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Use separate fetch_type_table for kprobes and uprobes. It currently
shares all fetch methods but some of them will be implemented
differently later.
This is not to break build if [ku]probes is configured alone (like
!CONFIG_KPROBE_EVENT and CONFIG_UPROBE_EVENT). So I added '__weak'
to the table declaration so that it can be safely omitted when it
configured out.
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Move fetch function helper macros/functions to the header file and
make them external. This is preparation of supporting uprobe fetch
table in next patch.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The set_print_fmt() functions are implemented almost same for
[ku]probes. Move it to a common place and get rid of the duplication.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The __get_data_size() and store_trace_args() will be used by uprobes
too. Move them to a common location.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Convert struct trace_uprobe to make use of the common trace_probe
structure.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
There are functions that can be shared to both of kprobes and uprobes.
Separate common data structure to struct trace_probe and use it from
the shared functions.
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The print format of s32 type was "ld" and it's casted to "long". So
it turned out to print 4294967295 for "-1" on 64-bit systems. Not
sure whether it worked well on 32-bit systems.
Anyway, it doesn't need to have cast argument at all since it already
casted using type pointer - just get rid of it. Thanks to Oleg for
pointing that out.
And print 0x prefix for unsigned type as it shows hex numbers.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The uprobe syntax requires an offset after a file path not a symbol.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The filter field of the event_trigger_data structure is protected under
RCU sched locks. It was not annotated as such, and after doing so,
sparse pointed out several locations that required fix ups.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Trace event triggers added a lseek that uses the ftrace_filter_lseek()
function. Unfortunately, when function tracing is not configured in
that function is not defined and the kernel fails to build.
This is the second time that function was added to a file ops and
it broke the build due to requiring special config dependencies.
Make a generic tracing_lseek() that all the tracing utilities may
use.
Also, modify the old ftrace_filter_lseek() to return 0 instead of
1 on WRONLY. Not sure why it was a 1 as that does not make sense.
This also changes the old tracing_seek() to modify the file pos
pointer on WRONLY as well.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA
189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m
LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7
k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an
eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ
YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU=
=/4/R
-----END PGP SIGNATURE-----
Merge tag 'v3.13-rc6' into for-3.14/core
Needed to bring blk-mq uptodate, since changes have been going in
since for-3.14/core was established.
Fixup merge issues related to the immutable biovec changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conflicts:
block/blk-flush.c
fs/btrfs/check-integrity.c
fs/btrfs/extent_io.c
fs/btrfs/scrub.c
fs/logfs/dev_bdev.c
Add a generic event_command.set_trigger_filter() op implementation and
have the current set of trigger commands use it - this essentially
gives them all support for filters.
Syntactically, filters are supported by adding 'if <filter>' just
after the command, in which case only events matching the filter will
invoke the trigger. For example, to add a filter to an
enable/disable_event command:
echo 'enable_event:system:event if common_pid == 999' > \
.../othersys/otherevent/trigger
The above command will only enable the system:event event if the
common_pid field in the othersys:otherevent event is 999.
As another example, to add a filter to a stacktrace command:
echo 'stacktrace if common_pid == 999' > \
.../somesys/someevent/trigger
The above command will only trigger a stacktrace if the common_pid
field in the event is 999.
The filter syntax is the same as that described in the 'Event
filtering' section of Documentation/trace/events.txt.
Because triggers can now use filters, the trigger-invoking logic needs
to be moved in those cases - e.g. for ftrace_raw_event_calls, if a
trigger has a filter associated with it, the trigger invocation now
needs to happen after the { assign; } part of the call, in order for
the trigger condition to be tested.
There's still a SOFT_DISABLED-only check at the top of e.g. the
ftrace_raw_events function, so when an event is soft disabled but not
because of the presence of a trigger, the original SOFT_DISABLED
behavior remains unchanged.
There's also a bit of trickiness in that some triggers need to avoid
being invoked while an event is currently in the process of being
logged, since the trigger may itself log data into the trace buffer.
Thus we make sure the current event is committed before invoking those
triggers. To do that, we split the trigger invocation in two - the
first part (event_triggers_call()) checks the filter using the current
trace record; if a command has the post_trigger flag set, it sets a
bit for itself in the return value, otherwise it directly invoks the
trigger. Once all commands have been either invoked or set their
return flag, event_triggers_call() returns. The current record is
then either committed or discarded; if any commands have deferred
their triggers, those commands are finally invoked following the close
of the current event by event_triggers_post_call().
To simplify the above and make it more efficient, the TRIGGER_COND bit
is introduced, which is set only if a soft-disabled trigger needs to
use the log record for filter testing or needs to wait until the
current log record is closed.
The syscall event invocation code is also changed in analogous ways.
Because event triggers need to be able to create and free filters,
this also adds a couple external wrappers for the existing
create_filter and free_filter functions, which are too generic to be
made extern functions themselves.
Link: http://lkml.kernel.org/r/7164930759d8719ef460357f143d995406e4eead.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that event triggers use ftrace_event_file(), it needs to be outside
the #ifdef CONFIG_DYNAMIC_FTRACE, as it can now be used when that is
not defined.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add 'enable_event' and 'disable_event' event_command commands.
enable_event and disable_event event triggers are added by the user
via these commands in a similar way and using practically the same
syntax as the analagous 'enable_event' and 'disable_event' ftrace
function commands, but instead of writing to the set_ftrace_filter
file, the enable_event and disable_event triggers are written to the
per-event 'trigger' files:
echo 'enable_event:system:event' > .../othersys/otherevent/trigger
echo 'disable_event:system:event' > .../othersys/otherevent/trigger
The above commands will enable or disable the 'system:event' trace
events whenever the othersys:otherevent events are hit.
This also adds a 'count' version that limits the number of times the
command will be invoked:
echo 'enable_event:system:event:N' > .../othersys/otherevent/trigger
echo 'disable_event:system:event:N' > .../othersys/otherevent/trigger
Where N is the number of times the command will be invoked.
The above commands will will enable or disable the 'system:event'
trace events whenever the othersys:otherevent events are hit, but only
N times.
This also makes the find_event_file() helper function extern, since
it's useful to use from other places, such as the event triggers code,
so make it accessible.
Link: http://lkml.kernel.org/r/f825f3048c3f6b026ee37ae5825f9fc373451828.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add 'stacktrace' event_command. stacktrace event triggers are added
by the user via this command in a similar way and using practically
the same syntax as the analogous 'stacktrace' ftrace function command,
but instead of writing to the set_ftrace_filter file, the stacktrace
event trigger is written to the per-event 'trigger' files:
echo 'stacktrace' > .../tracing/events/somesys/someevent/trigger
The above command will turn on stacktraces for someevent i.e. whenever
someevent is hit, a stacktrace will be logged.
This also adds a 'count' version that limits the number of times the
command will be invoked:
echo 'stacktrace:N' > .../tracing/events/somesys/someevent/trigger
Where N is the number of times the command will be invoked.
The above command will log N stacktraces for someevent i.e. whenever
someevent is hit N times, a stacktrace will be logged.
Link: http://lkml.kernel.org/r/0c30c008a0828c660aa0e1bbd3255cf179ed5c30.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add 'snapshot' event_command. snapshot event triggers are added by
the user via this command in a similar way and using practically the
same syntax as the analogous 'snapshot' ftrace function command, but
instead of writing to the set_ftrace_filter file, the snapshot event
trigger is written to the per-event 'trigger' files:
echo 'snapshot' > .../somesys/someevent/trigger
The above command will turn on snapshots for someevent i.e. whenever
someevent is hit, a snapshot will be done.
This also adds a 'count' version that limits the number of times the
command will be invoked:
echo 'snapshot:N' > .../somesys/someevent/trigger
Where N is the number of times the command will be invoked.
The above command will snapshot N times for someevent i.e. whenever
someevent is hit N times, a snapshot will be done.
Also adds a new tracing_alloc_snapshot() function - the existing
tracing_snapshot_alloc() function is a special version of
tracing_snapshot() that also does the snapshot allocation - the
snapshot triggers would like to be able to do just the allocation but
not take a snapshot; the existing tracing_snapshot_alloc() in turn now
also calls tracing_alloc_snapshot() underneath to do that allocation.
Link: http://lkml.kernel.org/r/c9524dd07ce01f9dcbd59011290e0a8d5b47d7ad.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
[ fix up from kbuild test robot <fengguang.wu@intel.com report ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add 'traceon' and 'traceoff' event_command commands. traceon and
traceoff event triggers are added by the user via these commands in a
similar way and using practically the same syntax as the analagous
'traceon' and 'traceoff' ftrace function commands, but instead of
writing to the set_ftrace_filter file, the traceon and traceoff
triggers are written to the per-event 'trigger' files:
echo 'traceon' > .../tracing/events/somesys/someevent/trigger
echo 'traceoff' > .../tracing/events/somesys/someevent/trigger
The above command will turn tracing on or off whenever someevent is
hit.
This also adds a 'count' version that limits the number of times the
command will be invoked:
echo 'traceon:N' > .../tracing/events/somesys/someevent/trigger
echo 'traceoff:N' > .../tracing/events/somesys/someevent/trigger
Where N is the number of times the command will be invoked.
The above commands will will turn tracing on or off whenever someevent
is hit, but only N times.
Some common register/unregister_trigger() implementations of the
event_command reg()/unreg() callbacks are also provided, which add and
remove trigger instances to the per-event list of triggers, and
arm/disarm them as appropriate. event_trigger_callback() is a
general-purpose event_command func() implementation that orchestrates
command parsing and registration for most normal commands.
Most event commands will use these, but some will override and
possibly reuse them.
The event_trigger_init(), event_trigger_free(), and
event_trigger_print() functions are meant to be common implementations
of the event_trigger_ops init(), free(), and print() ops,
respectively.
Most trigger_ops implementations will use these, but some will
override and possibly reuse them.
Link: http://lkml.kernel.org/r/00a52816703b98d2072947478dd6e2d70cde5197.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a 'trigger' file for each trace event, enabling 'trace event
triggers' to be set for trace events.
'trace event triggers' are patterned after the existing 'ftrace
function triggers' implementation except that triggers are written to
per-event 'trigger' files instead of to a single file such as the
'set_ftrace_filter' used for ftrace function triggers.
The implementation is meant to be entirely separate from ftrace
function triggers, in order to keep the respective implementations
relatively simple and to allow them to diverge.
The event trigger functionality is built on top of SOFT_DISABLE
functionality. It adds a TRIGGER_MODE bit to the ftrace_event_file
flags which is checked when any trace event fires. Triggers set for a
particular event need to be checked regardless of whether that event
is actually enabled or not - getting an event to fire even if it's not
enabled is what's already implemented by SOFT_DISABLE mode, so trigger
mode directly reuses that. Event trigger essentially inherit the soft
disable logic in __ftrace_event_enable_disable() while adding a bit of
logic and trigger reference counting via tm_ref on top of that in a
new trace_event_trigger_enable_disable() function. Because the base
__ftrace_event_enable_disable() code now needs to be invoked from
outside trace_events.c, a wrapper is also added for those usages.
The triggers for an event are actually invoked via a new function,
event_triggers_call(), and code is also added to invoke them for
ftrace_raw_event calls as well as syscall events.
The main part of the patch creates a new trace_events_trigger.c file
to contain the trace event triggers implementation.
The standard open, read, and release file operations are implemented
here.
The open() implementation sets up for the various open modes of the
'trigger' file. It creates and attaches the trigger iterator and sets
up the command parser. If opened for reading set up the trigger
seq_ops.
The read() implementation parses the event trigger written to the
'trigger' file, looks up the trigger command, and passes it along to
that event_command's func() implementation for command-specific
processing.
The release() implementation does whatever cleanup is needed to
release the 'trigger' file, like releasing the parser and trigger
iterator, etc.
A couple of functions for event command registration and
unregistration are added, along with a list to add them to and a mutex
to protect them, as well as an (initially empty) registration function
to add the set of commands that will be added by future commits, and
call to it from the trace event initialization code.
also added are a couple trigger-specific data structures needed for
these implementations such as a trigger iterator and a struct for
trigger-specific data.
A couple structs consisting mostly of function meant to be implemented
in command-specific ways, event_command and event_trigger_ops, are
used by the generic event trigger command implementations. They're
being put into trace.h alongside the other trace_event data structures
and functions, in the expectation that they'll be needed in several
trace_event-related files such as trace_events_trigger.c and
trace_events.c.
The event_command.func() function is meant to be called by the trigger
parsing code in order to add a trigger instance to the corresponding
event. It essentially coordinates adding a live trigger instance to
the event, and arming the triggering the event.
Every event_command func() implementation essentially does the
same thing for any command:
- choose ops - use the value of param to choose either a number or
count version of event_trigger_ops specific to the command
- do the register or unregister of those ops
- associate a filter, if specified, with the triggering event
The reg() and unreg() ops allow command-specific implementations for
event_trigger_op registration and unregistration, and the
get_trigger_ops() op allows command-specific event_trigger_ops
selection to be parameterized. When a trigger instance is added, the
reg() op essentially adds that trigger to the triggering event and
arms it, while unreg() does the opposite. The set_filter() function
is used to associate a filter with the trigger - if the command
doesn't specify a set_filter() implementation, the command will ignore
filters.
Each command has an associated trigger_type, which serves double duty,
both as a unique identifier for the command as well as a value that
can be used for setting a trigger mode bit during trigger invocation.
The signature of func() adds a pointer to the event_command struct,
used to invoke those functions, along with a command_data param that
can be passed to the reg/unreg functions. This allows func()
implementations to use command-specific blobs and supports code
re-use.
The event_trigger_ops.func() command corrsponds to the trigger 'probe'
function that gets called when the triggering event is actually
invoked. The other functions are used to list the trigger when
needed, along with a couple mundane book-keeping functions.
This also moves event_file_data() into trace.h so it can be used
outside of trace_events.c.
Link: http://lkml.kernel.org/r/316d95061accdee070aac8e5750afba0192fa5b9.1382622043.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Idea-by: Steve Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The problem is that the profiler only initializes the online
CPUs, and not possible CPUs. This causes issues if the user takes
CPUs online or offline while the profiler is running.
If we online a CPU after starting the profiler, we lose all the
trace information on the CPU going online.
If we offline a CPU after running a test and start a new test, it
will not clear the old data from that CPU.
This bug causes incorrect data to be reported to the user if they
online or offline CPUs during the profiling.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJSsNBHAAoJEKQekfcNnQGuKP8H/2mol/d7z2vANh7/FeNjTKIN
VkRzDEwUIwoaJBsL75EDDXBFx7w8jjAsXyoTrqrvMRV4UNcsfm46mohQTPAmK39y
muqodL1VnVXdKrUmtw/1nL7yDi2KltQH1UwOgvwXGuUFIq5cuCXNQxNK9/1fVVVn
tIMNz5kEAG3XCwnqP0PgQxWCuA7s+aQR0ijTf4vPf1G3IJujPyG9VhJWcGS3dJTR
t8TPyatd9D/S+7/r7iZ9hS8nWpaka3qJfhiWqk16SC9LiUXVA8oFOVMoN7n6Co5E
6r2dNo01WOABlojCxi1t3afUtcV1bUjBnVkiDva5cSc84pQSxe1qRrIpjTmHk00=
=MSZs
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"This fixes a long standing bug in the ftrace profiler. The problem is
that the profiler only initializes the online CPUs, and not possible
CPUs. This causes issues if the user takes CPUs online or offline
while the profiler is running.
If we online a CPU after starting the profiler, we lose all the trace
information on the CPU going online.
If we offline a CPU after running a test and start a new test, it will
not clear the old data from that CPU.
This bug causes incorrect data to be reported to the user if they
online or offline CPUs during the profiling"
* tag 'trace-fixes-v3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Initialize the ftrace profiler for each possible cpu
Ftrace currently initializes only the online CPUs. This implementation has
two problems:
- If we online a CPU after we enable the function profile, and then run the
test, we will lose the trace information on that CPU.
Steps to reproduce:
# echo 0 > /sys/devices/system/cpu/cpu1/online
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# echo 1 > /sys/devices/system/cpu/cpu1/online
# run test
- If we offline a CPU before we enable the function profile, we will not clear
the trace information when we enable the function profile. It will trouble
the users.
Steps to reproduce:
# cd <debugfs>/tracing/
# echo <some function name> >> set_ftrace_filter
# echo 1 > function_profile_enabled
# run test
# cat trace_stat/function*
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > function_profile_enabled
# echo 1 > function_profile_enabled
# cat trace_stat/function*
# run test
# cat trace_stat/function*
So it is better that we initialize the ftrace profiler for each possible cpu
every time we enable the function profile instead of just the online ones.
Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com
Cc: stable@vger.kernel.org # 2.6.31+
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
all events. This was prevalent when FTRACE_SELFTEST was enabled which
enables all events several times, and caused the system bootup to
pause for over a minute.
This was tracked down to an addition of a synchronize_sched() performed
when system call tracepoints are unregistered.
The synchronize_sched() is needed between the unregistering of the
system call tracepoint and a deletion of a tracing instance buffer.
But placing the synchronize_sched() in the unreg of *every* system call
tracepoint is a bit overboard. A single synchronize_sched() before
the deletion of the instance is sufficient.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
iQEcBAABAgAGBQJSofYNAAoJEKQekfcNnQGuTUcIAJOx8745KlkFjN4VX+nCWNfP
xrrpnLBymbGMA9lQ4fXk+kdiuhH8DjRYdKq9fU4T481MFYKkToUIZH6NeaLI5fr1
0nBPPjVyAlJ+yt9JbOAYa1jEYnAr27ORDHEtdQnqb6OJSky3oh9jCQi+toxmh2qX
Sv1tIYeAf3K2V/h5xt6uSl9oiZ6KBtwE3f+xkHWNizaU9i2rq2gxd77fSbPNTIps
wLdsESYziA2UeAm13eh8xXo1uqRbfvx7bPr59cu0+3AqdOoaXFG+JoE/MqmljIO5
HkyCKJVqP8HD+QieuEB+hw4zpSsl6A6iKSlaiHm8NMnRRocfM6qMKMQXQ28KhxA=
=sYm/
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"A regression showed up that there's a large delay when enabling all
events. This was prevalent when FTRACE_SELFTEST was enabled which
enables all events several times, and caused the system bootup to
pause for over a minute.
This was tracked down to an addition of a synchronize_sched()
performed when system call tracepoints are unregistered.
The synchronize_sched() is needed between the unregistering of the
system call tracepoint and a deletion of a tracing instance buffer.
But placing the synchronize_sched() in the unreg of *every* system
call tracepoint is a bit overboard. A single synchronize_sched()
before the deletion of the instance is sufficient"
* tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Only run synchronize_sched() at instance deletion time
It has been reported that boot up with FTRACE_SELFTEST enabled can take a
very long time. There can be stalls of over a minute.
This was tracked down to the synchronize_sched() called when a system call
event is disabled. As the self tests enable and disable thousands of events,
this makes the synchronize_sched() get called thousands of times.
The synchornize_sched() was added with d562aff93b "tracing: Add support
for SOFT_DISABLE to syscall events" which caused this regression (added
in 3.13-rc1).
The synchronize_sched() is to protect against the events being accessed
when a tracer instance is being deleted. When an instance is being deleted
all the events associated to it are unregistered. The synchronize_sched()
makes sure that no more users are running when it finishes.
Instead of calling synchronize_sched() for all syscall events, we only
need to call it once, after the events are unregistered and before the
instance is deleted. The event_mutex is held during this action to
prevent new users from enabling events.
Link: http://lkml.kernel.org/r/20131203124120.427b9661@gandalf.local.home
Reported-by: Petr Mladek <pmladek@suse.cz>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Petr Mladek <pmladek@suse.cz>
Tested-by: Petr Mladek <pmladek@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf fixes from Ingo Molnar:
"Misc kernel and tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools lib traceevent: Fix conversion of pointer to integer of different size
perf/trace: Properly use u64 to hold event_id
perf: Remove fragile swevent hlist optimization
ftrace, perf: Avoid infinite event generation loop
tools lib traceevent: Fix use of multiple options in processing field
perf header: Fix possible memory leaks in process_group_desc()
perf header: Fix bogus group name
perf tools: Tag thread comm as overriden
Commit 8c4f3c3fa9 "ftrace: Check module functions being traced on reload"
fixed module loading and unloading with respect to function tracing, but
it missed the function graph tracer. If you perform the following
# cd /sys/kernel/debug/tracing
# echo function_graph > current_tracer
# modprobe nfsd
# echo nop > current_tracer
You'll get the following oops message:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9()
Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt
CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000
0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668
ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000
Call Trace:
[<ffffffff814fe193>] dump_stack+0x4f/0x7c
[<ffffffff8103b80a>] warn_slowpath_common+0x81/0x9b
[<ffffffff810b2b9a>] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9
[<ffffffff8103b83e>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810b2b9a>] __ftrace_hash_rec_update.part.35+0x168/0x1b9
[<ffffffff81502f89>] ? __mutex_lock_slowpath+0x364/0x364
[<ffffffff810b2cc2>] ftrace_shutdown+0xd7/0x12b
[<ffffffff810b47f0>] unregister_ftrace_graph+0x49/0x78
[<ffffffff810c4b30>] graph_trace_reset+0xe/0x10
[<ffffffff810bf393>] tracing_set_tracer+0xa7/0x26a
[<ffffffff810bf5e1>] tracing_set_trace_write+0x8b/0xbd
[<ffffffff810c501c>] ? ftrace_return_to_handler+0xb2/0xde
[<ffffffff811240a8>] ? __sb_end_write+0x5e/0x5e
[<ffffffff81122aed>] vfs_write+0xab/0xf6
[<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
[<ffffffff81122dbd>] SyS_write+0x59/0x82
[<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
[<ffffffff8150a2d2>] system_call_fastpath+0x16/0x1b
---[ end trace 940358030751eafb ]---
The above mentioned commit didn't go far enough. Well, it covered the
function tracer by adding checks in __register_ftrace_function(). The
problem is that the function graph tracer circumvents that (for a slight
efficiency gain when function graph trace is running with a function
tracer. The gain was not worth this).
The problem came with ftrace_startup() which should always be called after
__register_ftrace_function(), if you want this bug to be completely fixed.
Anyway, this solution moves __register_ftrace_function() inside of
ftrace_startup() and removes the need to call them both.
Reported-by: Dave Wysochanski <dwysocha@redhat.com>
Fixes: ed926f9b35 ("ftrace: Use counters to enable functions to trace")
Cc: stable@vger.kernel.org # 3.0+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The 64-bit attr.config value for perf trace events was being copied into
an "int" before doing a comparison, meaning the top 32 bits were
being truncated.
As far as I can tell this didn't cause any errors, but it did mean
it was possible to create valid aliases for all the tracepoint ids
which I don't think was intended. (For example, 0xffffffff00000018
and 0x18 both enable the same tracepoint).
Signed-off-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1311151236100.11932@vincent-weaver-1.um.maine.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince's perf-trinity fuzzer found yet another 'interesting' problem.
When we sample the irq_work_exit tracepoint with period==1 (or
PERF_SAMPLE_PERIOD) and we add an fasync SIGNAL handler we create an
infinite event generation loop:
,-> <IPI>
| irq_work_exit() ->
| trace_irq_work_exit() ->
| ...
| __perf_event_overflow() -> (due to fasync)
| irq_work_queue() -> (irq_work_list must be empty)
'--------- arch_irq_work_raise()
Similar things can happen due to regular poll() wakeups if we exceed
the ring-buffer wakeup watermark, or have an event_limit.
To avoid this, dis-allow sampling this particular tracepoint.
In order to achieve this, create a special perf_perm function pointer
for each event and call this (when set) on trying to create a
tracepoint perf event.
[ roasted: use expr... to allow for ',' in your expression ]
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20131114152304.GC5364@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The only real feature that was added this release is from Namhyung Kim,
who introduced "set_graph_notrace" filter that lets you run the function
graph tracer and not trace particular functions and their call chain.
Tom Zanussi added some updates to the ftrace multibuffer tracing that
made it more consistent with the top level tracing.
One of the fixes for perf function tracing required an API change in
RCU; the addition of "rcu_is_watching()". As Paul McKenney is pushing
that change in this release too, he gave me a branch that included
all the changes to get that working, and I pulled that into my tree
in order to complete the perf function tracing fix.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJSgX5SAAoJEKQekfcNnQGulUAH/jORqJrKaNAulmZ314VsAqfa
zMtF5UAAPf7kqc3AN/jtFrhJUNEfxWOo7A4r0FsM/rKdWJF+98GA6aqYVD+XoWFt
+36fg1enxbXUjixQ96Uh+o1+BJUgYDqljuWzqSu/oiXWfWwl8+WL4kcbhb+V9WcF
SpdzLCWVZRfhyDiN3+0zvyQ8RSG2Pd7CWn9zroI0e4sxGo0Ki6JUnIcXtZGOBDOQ
IIZdjXvGSfpJ+3u3XvRPXJcltRCtOsVWxYzrmvRlmHDW5QMe1+WmmrlojTePrLaJ
xn8+3WINqetAR+ZQnazbpt1XzJzKa8QtFgpiN0kT6qL7cg3N1Owc4vLGohl7wok=
=Nesf
-----END PGP SIGNATURE-----
Merge tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing update from Steven Rostedt:
"This batch of changes is mostly clean ups and small bug fixes. The
only real feature that was added this release is from Namhyung Kim,
who introduced "set_graph_notrace" filter that lets you run the
function graph tracer and not trace particular functions and their
call chain.
Tom Zanussi added some updates to the ftrace multibuffer tracing that
made it more consistent with the top level tracing.
One of the fixes for perf function tracing required an API change in
RCU; the addition of "rcu_is_watching()". As Paul McKenney is pushing
that change in this release too, he gave me a branch that included all
the changes to get that working, and I pulled that into my tree in
order to complete the perf function tracing fix"
* tag 'trace-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add rcu annotation for syscall trace descriptors
tracing: Do not use signed enums with unsigned long long in fgragh output
tracing: Remove unused function ftrace_off_permanent()
tracing: Do not assign filp->private_data to freed memory
tracing: Add helper function tracing_is_disabled()
tracing: Open tracer when ftrace_dump_on_oops is used
tracing: Add support for SOFT_DISABLE to syscall events
tracing: Make register/unregister_ftrace_command __init
tracing: Update event filters for multibuffer
recordmcount.pl: Add support for __fentry__
ftrace: Have control op function callback only trace when RCU is watching
rcu: Do not trace rcu_is_watching() functions
ftrace/x86: skip over the breakpoint for ftrace caller
trace/trace_stat: use rbtree postorder iteration helper instead of opencoding
ftrace: Add set_graph_notrace filter
ftrace: Narrow down the protected area of graph_lock
ftrace: Introduce struct ftrace_graph_data
ftrace: Get rid of ftrace_graph_filter_enabled
tracing: Fix potential out-of-bounds in trace_get_user()
tracing: Show more exact help information about snapshot
Pull block IO core updates from Jens Axboe:
"This is the pull request for the core changes in the block layer for
3.13. It contains:
- The new blk-mq request interface.
This is a new and more scalable queueing model that marries the
best part of the request based interface we currently have (which
is fully featured, but scales poorly) and the bio based "interface"
which the new drivers for high IOPS devices end up using because
it's much faster than the request based one.
The bio interface has no block layer support, since it taps into
the stack much earlier. This means that drivers end up having to
implement a lot of functionality on their own, like tagging,
timeout handling, requeue, etc. The blk-mq interface provides all
these. Some drivers even provide a switch to select bio or rq and
has code to handle both, since things like merging only works in
the rq model and hence is faster for some workloads. This is a
huge mess. Conversion of these drivers nets us a substantial code
reduction. Initial results on converting SCSI to this model even
shows an 8x improvement on single queue devices. So while the
model was intended to work on the newer multiqueue devices, it has
substantial improvements for "classic" hardware as well. This code
has gone through extensive testing and development, it's now ready
to go. A pull request is coming to convert virtio-blk to this
model will be will be coming as well, with more drivers scheduled
for 3.14 conversion.
- Two blktrace fixes from Jan and Chen Gang.
- A plug merge fix from Alireza Haghdoost.
- Conversion of __get_cpu_var() from Christoph Lameter.
- Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.
- A fix for a race between request completion and the timeout
handling from Jeff Moyer. This is what caused the merge conflict
with blk-mq/core, in case you are looking at that.
- A dm stacking fix from Mike Snitzer.
- A code consolidation fix and duplicated code removal from Kent
Overstreet.
- A handful of block bug fixes from Mikulas Patocka, fixing a loop
crash and memory corruption on blk cg.
- Elevator switch bug fix from Tomoki Sekiyama.
A heads-up that I had to rebase this branch. Initially the immutable
bio_vecs had been queued up for inclusion, but a week later, it became
clear that it wasn't fully cooked yet. So the decision was made to
pull this out and postpone it until 3.14. It was a straight forward
rebase, just pruning out the immutable series and the later fixes of
problems with it. The rest of the patches applied directly and no
further changes were made"
* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: Do not call sector_div() with a 64-bit divisor
kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
block: Consolidate duplicated bio_trim() implementations
block: Use rw_copy_check_uvector()
block: Enable sysfs nomerge control for I/O requests in the plug list
block: properly stack underlying max_segment_size to DM device
elevator: acquire q->sysfs_lock in elevator_change()
elevator: Fix a race in elevator switching and md device initialization
block: Replace __get_cpu_var uses
bdi: test bdi_init failure
block: fix a probe argument to blk_register_region
loop: fix crash if blk_alloc_queue fails
blk-core: Fix memory corruption if blkcg_init_queue fails
block: fix race between request completion and timeout handling
blktrace: Send BLK_TN_PROCESS events to all running traces
blk-mq: don't disallow request merges for req->special being set
blk-mq: mq plug list breakage
blk-mq: fix for flush deadlock
...
Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle are:
- (much) improved CONFIG_NUMA_BALANCING support from Mel Gorman, Rik
van Riel, Peter Zijlstra et al. Yay!
- optimize preemption counter handling: merge the NEED_RESCHED flag
into the preempt_count variable, by Peter Zijlstra.
- wait.h fixes and code reorganization from Peter Zijlstra
- cfs_bandwidth fixes from Ben Segall
- SMP load-balancer cleanups from Peter Zijstra
- idle balancer improvements from Jason Low
- other fixes and cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
ftrace, sched: Add TRACE_FLAG_PREEMPT_RESCHED
stop_machine: Fix race between stop_two_cpus() and stop_cpus()
sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
sched: Fix asymmetric scheduling for POWER7
sched: Move completion code from core.c to completion.c
sched: Move wait code from core.c to wait.c
sched: Move wait.c into kernel/sched/
sched/wait: Fix __wait_event_interruptible_lock_irq_timeout()
sched: Avoid throttle_cfs_rq() racing with period_timer stopping
sched: Guarantee new group-entities always have weight
sched: Fix hrtimer_cancel()/rq->lock deadlock
sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
sched: Fix race on toggling cfs_bandwidth_used
sched: Remove extra put_online_cpus() inside sched_setaffinity()
sched/rt: Fix task_tick_rt() comment
sched/wait: Fix build breakage
sched/wait: Introduce prepare_to_wait_event()
sched/wait: Add ___wait_cond_timeout() to wait_event*_timeout() too
sched: Remove get_online_cpus() usage
sched: Fix race in migrate_swap_stop()
...
sparse complains about the enter/exit_sysycall_files[] variables being
dereferenced with rcu_dereference_sched(). The fields need to be
annotated with __rcu.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the introduction of PREEMPT_NEED_RESCHED in:
f27dde8dee ("sched: Add NEED_RESCHED to the preempt_count")
we need to be able to look at both TIF_NEED_RESCHED and
PREEMPT_NEED_RESCHED to understand the full preemption behaviour.
Add it to the trace output.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Link: http://lkml.kernel.org/r/20131004152826.GP3081@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
do_blk_trace_setup() will fully initialize 'buts.name', so can remove
the related memcpy(). And also use BLKTRACE_BDEV_SIZE and ARRAY_SIZE
instead of hard code number '32'.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently each task sends BLK_TN_PROCESS event to the first traced
device it interacts with after a new trace is started. When there are
several traced devices and the task accesses more devices, this logic
can result in BLK_TN_PROCESS being sent several times to some devices
while it is never sent to other devices. Thus blkparse doesn't display
command name when parsing some blktrace files.
Fix the problem by sending BLK_TN_PROCESS event to all traced devices
when a task interacts with any of them.
Signed-off-by: Jan Kara <jack@suse.cz>
Review-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The duration field of print_graph_duration() can also be used
to do the space filling by passing an enum in it:
DURATION_FILL_FULL
DURATION_FILL_START
DURATION_FILL_END
The problem is that these are enums and defined as negative,
but the duration field is unsigned long long. Most archs are
fine with this but blackfin fails to compile because of it:
kernel/built-in.o: In function `print_graph_duration':
kernel/trace/trace_functions_graph.c:782: undefined reference to `__ucmpdi2'
Overloading a unsigned long long with an signed enum is just
bad in principle. We can accomplish the same thing by using
part of the flags field instead.
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the past, ftrace_off_permanent() was called if something
strange was detected. But the ftrace_bug() now handles all the
anomolies that can happen with ftrace (function tracing), and there
are no uses of ftrace_off_permanent(). Get rid of it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In system_tr_open(), the filp->private_data can be assigned the 'dir'
variable even if it was freed. This is on the error path, and is
harmless because the error return code will prevent filp->private_data
from being used. But for correctness, we should not assign it to
a recently freed variable, as that can cause static tools to give
false warnings.
Also have both subsystem_open() and system_tr_open() return -ENODEV
if tracing has been disabled.
Link: http://lkml.kernel.org/r/1383764571-7318-1-git-send-email-geyslan@gmail.com
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The current default perf paranoid level is "1" which has
"perf_paranoid_kernel()" return false, and giving any operations that
use it, access to normal users. Unfortunately, this includes function
tracing and normal users should not be allowed to enable function
tracing by default.
The proper level is defined at "-1" (full perf access), which
"perf_paranoid_tracepoint_raw()" will only give access to. Use that
check instead for enabling function tracing.
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org # 3.4+
CVE: CVE-2013-2930
Fixes: ced39002f5 ("ftrace, perf: Add support to use function tracepoint in perf")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With ftrace_dump_on_oops, we previously did not open the tracer in
question, sometimes causing the trace output to be useless.
For example, the function_graph tracer with tracing_thresh set dumped via
ftrace_dump_on_oops would show a series of '}' indented at different levels,
but no function names.
call trace->open() (and do a few other fixups copied from the normal dump
path) to make the output more intelligible.
Link: http://lkml.kernel.org/r/1382554197-16961-1-git-send-email-cody@linux.vnet.ibm.com
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The original SOFT_DISABLE patches didn't add support for soft disable
of syscall events; this adds it.
Add an array of ftrace_event_file pointers indexed by syscall number
to the trace array and remove the existing enabled bitmaps, which as a
result are now redundant. The ftrace_event_file structs in turn
contain the soft disable flags we need for per-syscall soft disable
accounting.
Adding ftrace_event_files also means we can remove the USE_CALL_FILTER
bit, thus enabling multibuffer filter support for syscall events.
Link: http://lkml.kernel.org/r/6e72b566e85d8df8042f133efbc6c30e21fb017e.1382620672.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace event filters are still tied to event calls rather than
event files, which means you don't get what you'd expect when using
filters in the multibuffer case:
Before:
# echo 'bytes_alloc > 8192' > /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 8192
# mkdir /sys/kernel/debug/tracing/instances/test1
# echo 'bytes_alloc > 2048' > /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 2048
# cat /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
bytes_alloc > 2048
Setting the filter in tracing/instances/test1/events shouldn't affect
the same event in tracing/events as it does above.
After:
# echo 'bytes_alloc > 8192' > /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 8192
# mkdir /sys/kernel/debug/tracing/instances/test1
# echo 'bytes_alloc > 2048' > /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
# cat /sys/kernel/debug/tracing/events/kmem/kmalloc/filter
bytes_alloc > 8192
# cat /sys/kernel/debug/tracing/instances/test1/events/kmem/kmalloc/filter
bytes_alloc > 2048
We'd like to just move the filter directly from ftrace_event_call to
ftrace_event_file, but there are a couple cases that don't yet have
multibuffer support and therefore have to continue using the current
event_call-based filters. For those cases, a new USE_CALL_FILTER bit
is added to the event_call flags, whose main purpose is to keep the
old behavior for those cases until they can be updated with
multibuffer support; at that point, the USE_CALL_FILTER flag (and the
new associated call_filter_check_discard() function) can go away.
The multibuffer support also made filter_current_check_discard()
redundant, so this change removes that function as well and replaces
it with filter_check_discard() (or call_filter_check_discard() as
appropriate).
Link: http://lkml.kernel.org/r/f16e9ce4270c62f46b2e966119225e1c3cca7e60.1382620672.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Dave Jones reported that trinity would be able to trigger the following
back trace:
===============================
[ INFO: suspicious RCU usage. ]
3.10.0-rc2+ #38 Not tainted
-------------------------------
include/linux/rcupdate.h:771 rcu_read_lock() used illegally while idle!
other info that might help us debug this:
RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0
RCU used illegally from extended quiescent state!
1 lock held by trinity-child1/18786:
#0: (rcu_read_lock){.+.+..}, at: [<ffffffff8113dd48>] __perf_event_overflow+0x108/0x310
stack backtrace:
CPU: 3 PID: 18786 Comm: trinity-child1 Not tainted 3.10.0-rc2+ #38
0000000000000000 ffff88020767bac8 ffffffff816e2f6b ffff88020767baf8
ffffffff810b5897 ffff88021de92520 0000000000000000 ffff88020767bbf8
0000000000000000 ffff88020767bb78 ffffffff8113ded4 ffffffff8113dd48
Call Trace:
[<ffffffff816e2f6b>] dump_stack+0x19/0x1b
[<ffffffff810b5897>] lockdep_rcu_suspicious+0xe7/0x120
[<ffffffff8113ded4>] __perf_event_overflow+0x294/0x310
[<ffffffff8113dd48>] ? __perf_event_overflow+0x108/0x310
[<ffffffff81309289>] ? __const_udelay+0x29/0x30
[<ffffffff81076054>] ? __rcu_read_unlock+0x54/0xa0
[<ffffffff816f4000>] ? ftrace_call+0x5/0x2f
[<ffffffff8113dfa1>] perf_swevent_overflow+0x51/0xe0
[<ffffffff8113e08f>] perf_swevent_event+0x5f/0x90
[<ffffffff8113e1c9>] perf_tp_event+0x109/0x4f0
[<ffffffff8113e36f>] ? perf_tp_event+0x2af/0x4f0
[<ffffffff81074630>] ? __rcu_read_lock+0x20/0x20
[<ffffffff8112d79f>] perf_ftrace_function_call+0xbf/0xd0
[<ffffffff8110e1e1>] ? ftrace_ops_control_func+0x181/0x210
[<ffffffff81074630>] ? __rcu_read_lock+0x20/0x20
[<ffffffff81100cae>] ? rcu_eqs_enter_common+0x5e/0x470
[<ffffffff8110e1e1>] ftrace_ops_control_func+0x181/0x210
[<ffffffff816f4000>] ftrace_call+0x5/0x2f
[<ffffffff8110e229>] ? ftrace_ops_control_func+0x1c9/0x210
[<ffffffff816f4000>] ? ftrace_call+0x5/0x2f
[<ffffffff81074635>] ? debug_lockdep_rcu_enabled+0x5/0x40
[<ffffffff81074635>] ? debug_lockdep_rcu_enabled+0x5/0x40
[<ffffffff81100cae>] ? rcu_eqs_enter_common+0x5e/0x470
[<ffffffff8110112a>] rcu_eqs_enter+0x6a/0xb0
[<ffffffff81103673>] rcu_user_enter+0x13/0x20
[<ffffffff8114541a>] user_enter+0x6a/0xd0
[<ffffffff8100f6d8>] syscall_trace_leave+0x78/0x140
[<ffffffff816f46af>] int_check_syscall_exit_work+0x34/0x3d
------------[ cut here ]------------
Perf uses rcu_read_lock() but as the function tracer can trace functions
even when RCU is not currently active, this makes the rcu_read_lock()
used by perf ineffective.
As perf is currently the only user of the ftrace_ops_control_func() and
perf is also the only function callback that actively uses rcu_read_lock(),
the quick fix is to prevent the ftrace_ops_control_func() from calling
its callbacks if RCU is not active.
With Paul's new "rcu_is_watching()" we can tell if RCU is active or not.
Reported-by: Dave Jones <davej@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead
of opencoding an alternate postorder iteration that modifies the tree
Link: http://lkml.kernel.org/r/1383345566-25087-2-git-send-email-cody@linux.vnet.ibm.com
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The parser set up is just a generic utility that uses local variables
allocated by the function. There's no need to hold the graph_lock for
this set up.
This also makes the code simpler.
Link: http://lkml.kernel.org/r/1381739066-7531-4-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The struct ftrace_graph_data is for generalizing the access to
set_graph_function file. This is a preparation for adding support to
set_graph_notrace.
Link: http://lkml.kernel.org/r/1381739066-7531-3-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_graph_filter_enabled means that user sets function filter
and it always has same meaning of ftrace_graph_count > 0.
Link: http://lkml.kernel.org/r/1381739066-7531-2-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Andrey reported the following report:
ERROR: AddressSanitizer: heap-buffer-overflow on address ffff8800359c99f3
ffff8800359c99f3 is located 0 bytes to the right of 243-byte region [ffff8800359c9900, ffff8800359c99f3)
Accessed by thread T13003:
#0 ffffffff810dd2da (asan_report_error+0x32a/0x440)
#1 ffffffff810dc6b0 (asan_check_region+0x30/0x40)
#2 ffffffff810dd4d3 (__tsan_write1+0x13/0x20)
#3 ffffffff811cd19e (ftrace_regex_release+0x1be/0x260)
#4 ffffffff812a1065 (__fput+0x155/0x360)
#5 ffffffff812a12de (____fput+0x1e/0x30)
#6 ffffffff8111708d (task_work_run+0x10d/0x140)
#7 ffffffff810ea043 (do_exit+0x433/0x11f0)
#8 ffffffff810eaee4 (do_group_exit+0x84/0x130)
#9 ffffffff810eafb1 (SyS_exit_group+0x21/0x30)
#10 ffffffff81928782 (system_call_fastpath+0x16/0x1b)
Allocated by thread T5167:
#0 ffffffff810dc778 (asan_slab_alloc+0x48/0xc0)
#1 ffffffff8128337c (__kmalloc+0xbc/0x500)
#2 ffffffff811d9d54 (trace_parser_get_init+0x34/0x90)
#3 ffffffff811cd7b3 (ftrace_regex_open+0x83/0x2e0)
#4 ffffffff811cda7d (ftrace_filter_open+0x2d/0x40)
#5 ffffffff8129b4ff (do_dentry_open+0x32f/0x430)
#6 ffffffff8129b668 (finish_open+0x68/0xa0)
#7 ffffffff812b66ac (do_last+0xb8c/0x1710)
#8 ffffffff812b7350 (path_openat+0x120/0xb50)
#9 ffffffff812b8884 (do_filp_open+0x54/0xb0)
#10 ffffffff8129d36c (do_sys_open+0x1ac/0x2c0)
#11 ffffffff8129d4b7 (SyS_open+0x37/0x50)
#12 ffffffff81928782 (system_call_fastpath+0x16/0x1b)
Shadow bytes around the buggy address:
ffff8800359c9700: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
ffff8800359c9780: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
ffff8800359c9800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>ffff8800359c9980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[03]fb
ffff8800359c9a00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
ffff8800359c9b00: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
ffff8800359c9b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8800359c9c00: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap redzone: fa
Heap kmalloc redzone: fb
Freed heap region: fd
Shadow gap: fe
The out-of-bounds access happens on 'parser->buffer[parser->idx] = 0;'
Although the crash happened in ftrace_regex_open() the real bug
occurred in trace_get_user() where there's an incrementation to
parser->idx without a check against the size. The way it is triggered
is if userspace sends in 128 characters (EVENT_BUF_SIZE + 1), the loop
that reads the last character stores it and then breaks out because
there is no more characters. Then the last character is read to determine
what to do next, and the index is incremented without checking size.
Then the caller of trace_get_user() usually nulls out the last character
with a zero, but since the index is equal to the size, it writes a nul
character after the allocated space, which can corrupt memory.
Luckily, only root user has write access to this file.
Link: http://lkml.kernel.org/r/20131009222323.04fd1a0d@gandalf.local.home
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The current "help" that comes out of the snapshot file when it is
not allocated looks like this:
# * Snapshot is freed *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
Echo 2 says that it does not allocate the buffer, which is correct,
but to be more consistent with "echo 0" it should also state
that it does not free.
Link: http://lkml.kernel.org/r/20130914045916.GA4243@udknight
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
are still in flux, and will have to wait for 3.13.
The changes for 3.12 are mostly clean ups and minor fixes.
H. Peter Anvin added a check to x86_32 static function tracing that
helps a small segment of the kernel community.
Oleg Nesterov had a few changes from 3.11, but were mostly clean ups
and not worth pushing in the -rc time frame.
Li Zefan had small clean up with annotating a raw_init with __init.
I fixed a slight race in updating function callbacks, but the race
is so small and the bug that happens when it occurs is so minor it's
not even worth pushing to stable.
The only real enhancement is from Alexander Z Lam that made the
tracing_cpumask work for trace buffer instances, instead of them all
sharing a global cpumask.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJSLJm1AAoJEOdOSU1xswtMSu0H/0/Uuh0D5VhANZRcTATY4gUO
n3WH6sm3atOxH+cbeYQcFXxOcvRcR2n90tvCMpiFlPiC0NiNR1yjro3VLS4zWb77
twq7gABdJf+Tdq7sOBmSzmY5vRKQVHIXvAfC27mBez38nCWZz0BjJGEsPBwoly25
ZaiCbKlusw/QKIEy40tuKUL/rXF6yEWnQrMujhBbyNm0w7sJVdfnd+HHmCvy15H2
IQE1g83d/dAMBjFY2BYg77J+oV6qmJxql2itvDivQWXHqFb52Jw3ZTwHwWLZlPYU
AZcHtYGs2lSUscQLF56LejB7zZyE8taUufExFEVexXxZS5u7nNPXsPrA2LOOK70=
=JWO6
-----END PGP SIGNATURE-----
Merge tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Not much changes for the 3.12 merge window. The major tracing changes
are still in flux, and will have to wait for 3.13.
The changes for 3.12 are mostly clean ups and minor fixes.
H Peter Anvin added a check to x86_32 static function tracing that
helps a small segment of the kernel community.
Oleg Nesterov had a few changes from 3.11, but were mostly clean ups
and not worth pushing in the -rc time frame.
Li Zefan had small clean up with annotating a raw_init with __init.
I fixed a slight race in updating function callbacks, but the race is
so small and the bug that happens when it occurs is so minor it's not
even worth pushing to stable.
The only real enhancement is from Alexander Z Lam that made the
tracing_cpumask work for trace buffer instances, instead of them all
sharing a global cpumask"
* tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/rcu: Do not trace debug_lockdep_rcu_enabled()
x86-32, ftrace: Fix static ftrace when early microcode is enabled
ftrace: Fix a slight race in modifying what function callback gets traced
tracing: Make tracing_cpumask available for all instances
tracing: Kill the !CONFIG_MODULES code in trace_events.c
tracing: Don't pass file_operations array to event_create_dir()
tracing: Kill trace_create_file_ops() and friends
tracing/syscalls: Annotate raw_init function with __init
There's a slight race when going from a list function to a non list
function. That is, when only one callback is registered to the function
tracer, it gets called directly by the mcount trampoline. But if this
function has filters, it may be called by the wrong functions.
As the list ops callback that handles multiple callbacks that are
registered to ftrace, it also handles what functions they call. While
the transaction is taking place, use the list function always, and
after all the updates are finished (only the functions that should be
traced are being traced), then we can update the trampoline to call
the function directly.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull RCU updates from Paul E. McKenney:
"
* Update RCU documentation. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/611.
* Miscellaneous fixes. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/619.
* Full-system idle detection. This is for use by Frederic
Weisbecker's adaptive-ticks mechanism. Its purpose is
to allow the timekeeping CPU to shut off its tick when
all other CPUs are idle. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/648.
* Improve rcutorture test coverage. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/675.
"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Allow tracer instances to disable tracing by cpu by moving
the static global tracing_cpumask into trace_array.
Link: http://lkml.kernel.org/r/921622317f239bfc2283cac2242647801ef584f2.1375980149.git.azl@google.com
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move trace_module_nb under CONFIG_MODULES and kill the dummy
trace_module_notify(). Imho it doesn't make sense to define
"struct notifier_block" and its .notifier_call just to avoid
"ifdef" in event_trace_init(), and all other !CONFIG_MODULES
code has already gone away.
Link: http://lkml.kernel.org/r/20130731173137.GA31043@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that event_create_dir() and __trace_add_new_event() always
use the same file_operations we can kill these arguments and
simplify the code.
Link: http://lkml.kernel.org/r/20130731173135.GA31040@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_create_file_ops() allocates the copy of id/filter/format/enable
file_operations to set "f_op->owner = mod" for fops_get().
However after the recent changes there is no reason to prevent rmmod
even if one of these files is opened. A file operation can do nothing
but fail after remove_event_file_dir() clears ->i_private for every
file removed by trace_module_remove_events().
Kill "struct ftrace_module_file_ops" and fix the compilation errors.
Link: http://lkml.kernel.org/r/20130731173132.GA31033@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
init_syscall_trace() can only be called during kernel bootup only, so we can
mark it and the functions it calls as __init.
Link: http://lkml.kernel.org/r/51528E89.6080508@huawei.com
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fixed two issues with changing the timestamp clock with trace_clock:
- The global buffer was reset on instance clock changes. Change this to pass
the correct per-instance buffer
- ftrace_now() is used to set buf->time_start in tracing_reset_online_cpus().
This was incorrect because ftrace_now() used the global buffer's clock to
return the current time. Change this to use buffer_ftrace_now() which
returns the current time for the correct per-instance buffer.
Also removed tracing_reset_current() because it is not used anywhere
Link: http://lkml.kernel.org/r/1375493777-17261-2-git-send-email-azl@google.com
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Releasing the free_buffer file in an instance causes the global buffer
to be stopped when TRACE_ITER_STOP_ON_FREE is enabled. Operate on the
correct buffer.
Link: http://lkml.kernel.org/r/1375493777-17261-1-git-send-email-azl@google.com
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_read_pipe zeros all fields bellow "seq". The declaration contains
a comment about that, but it doesn't help.
The first field is "snapshot", it's true when current open file is
snapshot. Looks obvious, that it should not be zeroed.
The second field is "started". It was converted from cpumask_t to
cpumask_var_t (v2.6.28-4983-g4462344), in other words it was
converted from cpumask to pointer on cpumask.
Currently the reference on "started" memory is lost after the first read
from tracing_read_pipe and a proper object will never be freed.
The "started" is never dereferenced for trace_pipe, because trace_pipe
can't have the TRACE_FILE_ANNOTATE options.
Link: http://lkml.kernel.org/r/1375463803-3085183-1-git-send-email-avagin@openvz.org
Cc: stable@vger.kernel.org # 2.6.30
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Uprobes suffer the same problem that kprobes have. There's a race between
writing to the "enable" file and removing the probe. The probe checks for
it being in use and if it is not, goes about deleting the probe and the
event that represents it. But the problem with that is, after it checks
if it is in use it can be enabled, and the deletion of the event (access
to the probe) will fail, as it is in use. But the uprobe will still be
deleted. This is a problem as the event can reference the uprobe that
was deleted.
The fix is to remove the event first, and check to make sure the event
removal succeeds. Then it is safe to remove the probe.
When the event exists, either ftrace or perf can enable the probe and
prevent the event from being removed.
Link: http://lkml.kernel.org/r/20130704034038.991525256@goodmis.org
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a probe is being removed, it cleans up the event files that correspond
to the probe. But there is a race between writing to one of these files
and deleting the probe. This is especially true for the "enable" file.
CPU 0 CPU 1
----- -----
fd = open("enable",O_WRONLY);
probes_open()
release_all_trace_probes()
unregister_trace_probe()
if (trace_probe_is_enabled(tp))
return -EBUSY
write(fd, "1", 1)
__ftrace_set_clr_event()
call->class->reg()
(kprobe_register)
enable_trace_probe(tp)
__unregister_trace_probe(tp);
list_del(&tp->list)
unregister_probe_event(tp) <-- fails!
free_trace_probe(tp)
write(fd, "0", 1)
__ftrace_set_clr_event()
call->class->unreg
(kprobe_register)
disable_trace_probe(tp) <-- BOOM!
A test program was written that used two threads to simulate the
above scenario adding a nanosleep() interval to change the timings
and after several thousand runs, it was able to trigger this bug
and crash:
BUG: unable to handle kernel paging request at 00000005000000f9
IP: [<ffffffff810dee70>] probes_open+0x3b/0xa7
PGD 7808a067 PUD 0
Oops: 0000 [#1] PREEMPT SMP
Dumping ftrace buffer:
---------------------------------
Modules linked in: ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6
CPU: 1 PID: 2070 Comm: test-kprobe-rem Not tainted 3.11.0-rc3-test+ #47
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
task: ffff880077756440 ti: ffff880076e52000 task.ti: ffff880076e52000
RIP: 0010:[<ffffffff810dee70>] [<ffffffff810dee70>] probes_open+0x3b/0xa7
RSP: 0018:ffff880076e53c38 EFLAGS: 00010203
RAX: 0000000500000001 RBX: ffff88007844f440 RCX: 0000000000000003
RDX: 0000000000000003 RSI: 0000000000000003 RDI: ffff880076e52000
RBP: ffff880076e53c58 R08: ffff880076e53bd8 R09: 0000000000000000
R10: ffff880077756440 R11: 0000000000000006 R12: ffffffff810dee35
R13: ffff880079250418 R14: 0000000000000000 R15: ffff88007844f450
FS: 00007f87a276f700(0000) GS:ffff88007d480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000005000000f9 CR3: 0000000077262000 CR4: 00000000000007e0
Stack:
ffff880076e53c58 ffffffff81219ea0 ffff88007844f440 ffffffff810dee35
ffff880076e53ca8 ffffffff81130f78 ffff8800772986c0 ffff8800796f93a0
ffffffff81d1b5d8 ffff880076e53e04 0000000000000000 ffff88007844f440
Call Trace:
[<ffffffff81219ea0>] ? security_file_open+0x2c/0x30
[<ffffffff810dee35>] ? unregister_trace_probe+0x4b/0x4b
[<ffffffff81130f78>] do_dentry_open+0x162/0x226
[<ffffffff81131186>] finish_open+0x46/0x54
[<ffffffff8113f30b>] do_last+0x7f6/0x996
[<ffffffff8113cc6f>] ? inode_permission+0x42/0x44
[<ffffffff8113f6dd>] path_openat+0x232/0x496
[<ffffffff8113fc30>] do_filp_open+0x3a/0x8a
[<ffffffff8114ab32>] ? __alloc_fd+0x168/0x17a
[<ffffffff81131f4e>] do_sys_open+0x70/0x102
[<ffffffff8108f06e>] ? trace_hardirqs_on_caller+0x160/0x197
[<ffffffff81131ffe>] SyS_open+0x1e/0x20
[<ffffffff81522742>] system_call_fastpath+0x16/0x1b
Code: e5 41 54 53 48 89 f3 48 83 ec 10 48 23 56 78 48 39 c2 75 6c 31 f6 48 c7
RIP [<ffffffff810dee70>] probes_open+0x3b/0xa7
RSP <ffff880076e53c38>
CR2: 00000005000000f9
---[ end trace 35f17d68fc569897 ]---
The unregister_trace_probe() must be done first, and if it fails it must
fail the removal of the kprobe.
Several changes have already been made by Oleg Nesterov and Masami Hiramatsu
to allow moving the unregister_probe_event() before the removal of
the probe and exit the function if it fails. This prevents the tp
structure from being used after it is freed.
Link: http://lkml.kernel.org/r/20130704034038.819592356@goodmis.org
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The "break" used in the do_for_each_event_file() is used as an optimization
as the loop is really a double loop. The loop searches all event files
for each trace_array. There's only one matching event file per trace_array
and after we find the event file for the trace_array, the break is used
to jump to the next trace_array and start the search there.
As this is not a standard way of using "break" in C code, it requires
a comment right before the break to let people know what is going on.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Change trace_remove_event_call(call) to return the error if this
call is active. This is what the callers assume but can't verify
outside of the tracing locks. Both trace_kprobe.c/trace_uprobe.c
need the additional changes, unregister_trace_probe() should abort
if trace_remove_event_call() fails.
The caller is going to free this call/file so we must ensure that
nobody can use them after trace_remove_event_call() succeeds.
debugfs should be fine after the previous changes and event_remove()
does TRACE_REG_UNREGISTER, but still there are 2 reasons why we need
the additional checks:
- There could be a perf_event(s) attached to this tp_event, so the
patch checks ->perf_refcount.
- TRACE_REG_UNREGISTER can be suppressed by FTRACE_EVENT_FL_SOFT_MODE,
so we simply check FTRACE_EVENT_FL_ENABLED protected by event_mutex.
Link: http://lkml.kernel.org/r/20130729175033.GB26284@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's been a nasty bug that would show up and not give much info.
The bug displayed the following warning:
WARNING: at kernel/trace/ftrace.c:1529 __ftrace_hash_rec_update+0x1e3/0x230()
Pid: 20903, comm: bash Tainted: G O 3.6.11+ #38405.trunk
Call Trace:
[<ffffffff8103e5ff>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8103e65a>] warn_slowpath_null+0x1a/0x20
[<ffffffff810c2ee3>] __ftrace_hash_rec_update+0x1e3/0x230
[<ffffffff810c4f28>] ftrace_hash_move+0x28/0x1d0
[<ffffffff811401cc>] ? kfree+0x2c/0x110
[<ffffffff810c68ee>] ftrace_regex_release+0x8e/0x150
[<ffffffff81149f1e>] __fput+0xae/0x220
[<ffffffff8114a09e>] ____fput+0xe/0x10
[<ffffffff8105fa22>] task_work_run+0x72/0x90
[<ffffffff810028ec>] do_notify_resume+0x6c/0xc0
[<ffffffff8126596e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[<ffffffff815c0f88>] int_signal+0x12/0x17
---[ end trace 793179526ee09b2c ]---
It was finally narrowed down to unloading a module that was being traced.
It was actually more than that. When functions are being traced, there's
a table of all functions that have a ref count of the number of active
tracers attached to that function. When a function trace callback is
registered to a function, the function's record ref count is incremented.
When it is unregistered, the function's record ref count is decremented.
If an inconsistency is detected (ref count goes below zero) the above
warning is shown and the function tracing is permanently disabled until
reboot.
The ftrace callback ops holds a hash of functions that it filters on
(and/or filters off). If the hash is empty, the default means to filter
all functions (for the filter_hash) or to disable no functions (for the
notrace_hash).
When a module is unloaded, it frees the function records that represent
the module functions. These records exist on their own pages, that is
function records for one module will not exist on the same page as
function records for other modules or even the core kernel.
Now when a module unloads, the records that represents its functions are
freed. When the module is loaded again, the records are recreated with
a default ref count of zero (unless there's a callback that traces all
functions, then they will also be traced, and the ref count will be
incremented).
The problem is that if an ftrace callback hash includes functions of the
module being unloaded, those hash entries will not be removed. If the
module is reloaded in the same location, the hash entries still point
to the functions of the module but the module's ref counts do not reflect
that.
With the help of Steve and Joern, we found a reproducer:
Using uinput module and uinput_release function.
cd /sys/kernel/debug/tracing
modprobe uinput
echo uinput_release > set_ftrace_filter
echo function > current_tracer
rmmod uinput
modprobe uinput
# check /proc/modules to see if loaded in same addr, otherwise try again
echo nop > current_tracer
[BOOM]
The above loads the uinput module, which creates a table of functions that
can be traced within the module.
We add uinput_release to the filter_hash to trace just that function.
Enable function tracincg, which increments the ref count of the record
associated to uinput_release.
Remove uinput, which frees the records including the one that represents
uinput_release.
Load the uinput module again (and make sure it's at the same address).
This recreates the function records all with a ref count of zero,
including uinput_release.
Disable function tracing, which will decrement the ref count for uinput_release
which is now zero because of the module removal and reload, and we have
a mismatch (below zero ref count).
The solution is to check all currently tracing ftrace callbacks to see if any
are tracing any of the module's functions when a module is loaded (it already does
that with callbacks that trace all functions). If a callback happens to have
a module function being traced, it increments that records ref count and starts
tracing that function.
There may be a strange side effect with this, where tracing module functions
on unload and then reloading a new module may have that new module's functions
being traced. This may be something that confuses the user, but it's not
a big deal. Another approach is to disable all callback hashes on module unload,
but this leaves some ftrace callbacks that may not be registered, but can
still have hashes tracing the module's function where ftrace doesn't know about
it. That situation can cause the same bug. This solution solves that case too.
Another benefit of this solution, is it is possible to trace a module's
function on unload and load.
Link: http://lkml.kernel.org/r/20130705142629.GA325@redhat.com
Reported-by: Jörn Engel <joern@logfs.org>
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Steve Hodgson <steve@purestorage.com>
Tested-by: Steve Hodgson <steve@purestorage.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When ftrace ops modifies the functions that it will trace, the update
to the function mcount callers may need to be modified. Consolidate
the two places that do the checks to see if an update is required
with a wrapper function for those checks.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Change remove_event_file_dir() to clear ->i_private for every
file we are going to remove.
We need to check file->dir != NULL because event_create_dir()
can fail. debugfs_remove_recursive(NULL) is fine but the patch
moves it under the same check anyway for readability.
spin_lock(d_lock) and "d_inode != NULL" check are not needed
afaics, but I do not understand this code enough.
tracing_open_generic_file() and tracing_release_generic_file()
can go away, ftrace_enable_fops and ftrace_event_filter_fops()
use tracing_open_generic() but only to check tracing_disabled.
This fixes all races with event_remove() or instance_delete().
f_op->read/write/whatever can never use the freed file/call,
all event/* files were changed to check and use ->i_private
under event_mutex.
Note: this doesn't not fix other problems, event_remove() can
destroy the active ftrace_event_call, we need more changes but
those changes are completely orthogonal.
Link: http://lkml.kernel.org/r/20130728183527.GB16723@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Preparation for the next patch. Extract the common code from
remove_event_from_tracers() and __trace_remove_event_dirs()
into the new helper, remove_event_file_dir().
The patch looks more complicated than it actually is, it also
moves remove_subsystem() up to avoid the forward declaration.
Link: http://lkml.kernel.org/r/20130726172547.GA3629@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_format_open() and trace_format_seq_ops are racy, nothing
protects ftrace_event_call from trace_remove_event_call().
Change f_start() to take event_mutex and verify i_private != NULL,
change f_stop() to drop this lock.
This fixes nothing, but now we can change debugfs_remove("format")
callers to nullify ->i_private and fix the the problem.
Note: the usage of event_mutex is sub-optimal but simple, we can
change this later.
Link: http://lkml.kernel.org/r/20130726172543.GA3622@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
event_filter_read/write() are racy, ftrace_event_call can be already
freed by trace_remove_event_call() callers.
1. Shift mutex_lock(event_mutex) from print/apply_event_filter to
the callers.
2. Change the callers, event_filter_read() and event_filter_write()
to read i_private under this mutex and abort if it is NULL.
This fixes nothing, but now we can change debugfs_remove("filter")
callers to nullify ->i_private and fix the the problem.
Link: http://lkml.kernel.org/r/20130726172540.GA3619@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open_generic_file() is racy, ftrace_event_file can be
already freed by rmdir or trace_remove_event_call().
Change event_enable_read() and event_disable_read() to read and
verify "file = i_private" under event_mutex.
This fixes nothing, but now we can change debugfs_remove("enable")
callers to nullify ->i_private and fix the the problem.
Link: http://lkml.kernel.org/r/20130726172536.GA3612@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
event_id_read() is racy, ftrace_event_call can be already freed
by trace_remove_event_call() callers.
Change event_create_dir() to pass "data = call->event.type", this
is all event_id_read() needs. ftrace_event_id_fops no longer needs
tracing_open_generic().
We add the new helper, event_file_data(), to read ->i_private, it
will have more users.
Note: currently ACCESS_ONCE() and "id != 0" check are not needed,
but we are going to change event_remove/rmdir to clear ->i_private.
Link: http://lkml.kernel.org/r/20130726172532.GA3605@redhat.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are several tracepoints (mostly in RCU), that reference a string
pointer and uses the print format of "%s" to display the string that
exists in the kernel, instead of copying the actual string to the
ring buffer (saves time and ring buffer space).
But this has an issue with userspace tools that read the binary buffers
that has the address of the string but has no access to what the string
itself is. The end result is just output that looks like:
rcu_dyntick: ffffffff818adeaa 1 0
rcu_dyntick: ffffffff818adeb5 0 140000000000000
rcu_dyntick: ffffffff818adeb5 0 140000000000000
rcu_utilization: ffffffff8184333b
rcu_utilization: ffffffff8184333b
The above is pretty useless when read by the userspace tools. Ideally
we would want something that looks like this:
rcu_dyntick: Start 1 0
rcu_dyntick: End 0 140000000000000
rcu_dyntick: Start 140000000000000 0
rcu_callback: rcu_preempt rhp=0xffff880037aff710 func=put_cred_rcu 0/4
rcu_callback: rcu_preempt rhp=0xffff880078961980 func=file_free_rcu 0/5
rcu_dyntick: End 0 1
The trace_printk() which also only stores the address of the string
format instead of recording the string into the buffer itself, exports
the mapping of kernel addresses to format strings via the printk_format
file in the debugfs tracing directory.
The tracepoint strings can use this same method and output the format
to the same file and the userspace tools will be able to decipher
the address without any modification.
The tracepoint strings need its own section to save the strings because
the trace_printk section will cause the trace_printk() buffers to be
allocated if anything exists within the section. trace_printk() is only
used for debugging and should never exist in the kernel, we can not use
the trace_printk sections.
Add a new tracepoint_str section that will also be examined by the output
of the printk_format file.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit a82274151a "tracing: Protect ftrace_trace_arrays list in trace_events.c"
added taking the trace_types_lock mutex in trace_events.c as there were
several locations that needed it for protection. Unfortunately, it also
encapsulated a call to tracing_reset_all_online_cpus() which also takes
the trace_types_lock, causing a deadlock.
This happens when a module has tracepoints and has been traced. When the
module is removed, the trace events module notifier will grab the
trace_types_lock, do a bunch of clean ups, and also clears the buffer
by calling tracing_reset_all_online_cpus. This doesn't happen often
which explains why it wasn't caught right away.
Commit a82274151a was marked for stable, which means this must be
sent to stable too.
Link: http://lkml.kernel.org/r/51EEC646.7070306@broadcom.com
Reported-by: Arend van Spril <arend@broadcom.com>
Tested-by: Arend van Spriel <arend@broadcom.com>
Cc: Alexander Z Lam <azl@google.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If a ftrace ops is registered with the SAVE_REGS flag set, and there's
already a ops registered to one of its functions but without the
SAVE_REGS flag, there's a small race window where the SAVE_REGS ops gets
added to the list of callbacks to call for that function before the
callback trampoline gets set to save the regs.
The problem is, the function is not currently saving regs, which opens
a small race window where the ops that is expecting regs to be passed
to it, wont. This can cause a crash if the callback were to reference
the regs, as the SAVE_REGS guarantees that regs will be set.
To fix this, we add a check in the loop case where it checks if the ops
has the SAVE_REGS flag set, and if so, it will ignore it if regs is
not set.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After the previous changes trace_array_cpu->trace_cpu and
trace_array->trace_cpu becomes write-only. Remove these members
and kill "struct trace_cpu" as well.
As a side effect this also removes memset(per_cpu_memory, 0).
It was not needed, alloc_percpu() returns zero-filled memory.
Link: http://lkml.kernel.org/r/20130723152613.GA23741@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open() and tracing_snapshot_open() are racy, the memory
inode->i_private points to can be already freed.
Convert these last users of "inode->i_private == trace_cpu" to
use "i_private = trace_array" and rely on tracing_get_cpu().
v2: incorporate the fix from Steven, tracing_release() must not
blindly dereference file->private_data unless we know that
the file was opened for reading.
Link: http://lkml.kernel.org/r/20130723152610.GA23737@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open_generic_tc() is racy, the memory inode->i_private
points to can be already freed.
1. Change its last user, tracing_entries_fops, to use
tracing_*_generic_tr() instead.
2. Change debugfs_create_file("buffer_size_kb", data) callers
to pass "data = tr".
3. Change tracing_entries_read() and tracing_entries_write() to
use tracing_get_cpu().
4. Kill the no longer used tracing_open_generic_tc() and
tracing_release_generic_tc().
Link: http://lkml.kernel.org/r/20130723152606.GA23730@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open_generic_tc() is racy, the memory inode->i_private
points to can be already freed.
1. Change one of its users, tracing_stats_fops, to use
tracing_*_generic_tr() instead.
2. Change trace_create_cpu_file("stats", data) to pass "data = tr".
3. Change tracing_stats_read() to use tracing_get_cpu().
Link: http://lkml.kernel.org/r/20130723152603.GA23727@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_buffers_open() is racy, the memory inode->i_private points
to can be already freed.
Change debugfs_create_file("trace_pipe_raw", data) caller to pass
"data = tr", tracing_buffers_open() can use tracing_get_cpu().
Change debugfs_create_file("snapshot_raw_fops", data) caller too,
this file uses tracing_buffers_open/release.
Link: http://lkml.kernel.org/r/20130723152600.GA23720@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_open_pipe() is racy, the memory inode->i_private points to
can be already freed.
Change debugfs_create_file("trace_pipe", data) callers to to pass
"data = tr", tracing_open_pipe() can use tracing_get_cpu().
Link: http://lkml.kernel.org/r/20130723152557.GA23717@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every "file_operations" used by tracing_init_debugfs_percpu is buggy.
f_op->open/etc does:
1. struct trace_cpu *tc = inode->i_private;
struct trace_array *tr = tc->tr;
2. trace_array_get(tr) or fail;
3. do_something(tc);
But tc (and tr) can be already freed before trace_array_get() is called.
And it doesn't matter whether this file is per-cpu or it was created by
init_tracer_debugfs(), free_percpu() or kfree() are equally bad.
Note that even 1. is not safe, the freed memory can be unmapped. But even
if it was safe trace_array_get() can wrongly succeed if we also race with
the next new_instance_create() which can re-allocate the same tr, or tc
was overwritten and ->tr points to the valid tr. In this case 3. uses the
freed/reused memory.
Add the new trivial helper, trace_create_cpu_file() which simply calls
trace_create_file() and encodes "cpu" in "struct inode". Another helper,
tracing_get_cpu() will be used to read cpu_nr-or-RING_BUFFER_ALL_CPUS.
The patch abuses ->i_cdev to encode the number, it is never used unless
the file is S_ISCHR(). But we could use something else, say, i_bytes or
even ->d_fsdata. In any case this hack is hidden inside these 2 helpers,
it would be trivial to change them if needed.
This patch only changes tracing_init_debugfs_percpu() to use the new
trace_create_cpu_file(), the next patches will change file_operations.
Note: tracing_get_cpu(inode) is always safe but you can't trust the
result unless trace_array_get() was called, without trace_types_lock
which acts as a barrier it can wrongly return RING_BUFFER_ALL_CPUS.
Link: http://lkml.kernel.org/r/20130723152554.GA23710@redhat.com
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_buffers_open() does trace_array_get() and then it wrongly
inrcements tr->ref again under trace_types_lock. This means that
every caller leaks trace_array:
# cd /sys/kernel/debug/tracing/
# mkdir instances/X
# true < instances/X/per_cpu/cpu0/trace_pipe_raw
# rmdir instances/X
rmdir: failed to remove `instances/X': Device or resource busy
Link: http://lkml.kernel.org/r/20130719153644.GA18899@redhat.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Trivial. trace_array->waiter has no users since 6eaaa5d5
"tracing/core: use appropriate waiting on trace_pipe".
Link: http://lkml.kernel.org/r/20130719142036.GA1594@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
event_id_read() has no reason to kmalloc "struct trace_seq"
(more than PAGE_SIZE!), it can use a small buffer instead.
Note: "if (*ppos) return 0" looks strange and even wrong,
simple_read_from_buffer() handles ppos != 0 case corrrectly.
And it seems that almost every user of trace_seq in this file
should be converted too. Unless you use seq_open(), trace_seq
buys nothing compared to the raw buffer, but it needs a bit
more memory and code.
Link: http://lkml.kernel.org/r/20130718184712.GA4786@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
f_next() looks overcomplicated, and it is not strictly correct
even if this doesn't matter.
Say, FORMAT_FIELD_SEPERATOR should not return NULL (means EOF)
if trace_get_fields() returns an empty list, we should simply
advance to FORMAT_PRINTFMT as we do when we find the end of list.
1. Change f_next() to return "struct list_head *" rather than
"ftrace_event_field *", and change f_show() to do list_entry().
This simplifies the code a bit, only f_show() needs to know
about ftrace_event_field, and f_next() can play with ->prev
directly
2. Change f_next() to not play with ->prev / return inside the
switch() statement. It can simply set node = head/common_head,
the prev-or-advance-to-the-next-magic below does all work.
While at it. f_start() looks overcomplicated too. I don't think
*pos == 0 makes sense as a separate case, just change this code
to do "while" instead of "do/while".
The patch also moves f_start() down, close to f_stop(). This is
purely cosmetic, just to make the locking added by the next patch
more clear/visible.
Link: http://lkml.kernel.org/r/20130718184710.GA4783@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The selftest for function and function graph tracers are defined as
__init, as they are only executed at boot up. The "tracer" structs
that are associated to those tracers are not setup as __init as they
are used after boot. To stop mismatch warnings, those structures
need to be annotated with __ref_data.
Currently, the tracer structures are defined to __read_mostly, as they
do not really change. But in the future they should be converted to
consts, but that will take a little work because they have a "next"
pointer that gets updated when they are registered. That will have to
wait till the next major release.
Link: http://lkml.kernel.org/r/1373596735.17876.84.camel@gandalf.local.home
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some error paths did not handle ref counting properly, and some trace files need
ref counting.
Link: http://lkml.kernel.org/r/1374171524-11948-1-git-send-email-azl@google.com
Cc: stable@vger.kernel.org # 3.10
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove debugfs directories for tracing instances during creation if an error
occurs causing the trace_array for that instance to not be added to
ftrace_trace_arrays. If the directory continues to exist after the error, it
cannot be removed because the respective trace_array is not in
ftrace_trace_arrays.
Link: http://lkml.kernel.org/r/1373502874-1706-2-git-send-email-azl@google.com
Cc: stable@vger.kernel.org # 3.10
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Wait for disabling all running kprobe handlers when a kprobe
event is disabled, since the caller, trace_remove_event_call()
supposes that a removing event is disabled completely by
disabling the event.
With this change, ftrace can ensure that there is no running
event handlers after disabling it.
Link: http://lkml.kernel.org/r/20130709093526.20138.93100.stgit@mhiramat-M0-7522
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every perf_trace_buf_prepare() caller does
WARN_ONCE(size > PERF_MAX_TRACE_SIZE, message) and "message" is
almost the same.
Shift this WARN_ONCE() into perf_trace_buf_prepare(). This changes
the meaning of _ONCE, but I think this is fine.
- 4947014 2932448 10104832 17984294 1126b26 vmlinux
+ 4948422 2932448 10104832 17985702 11270a6 vmlinux
on my build.
Link: http://lkml.kernel.org/r/20130617170211.GA19813@redhat.com
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
perf_trace_buf_prepare() + perf_trace_buf_submit(head, task => NULL)
make no sense if hlist_empty(head). Change perf_syscall_enter/exit()
to check sys_data->{enter,exit}_event->perf_events beforehand.
Link: http://lkml.kernel.org/r/20130617170207.GA19806@redhat.com
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
perf_trace_buf_prepare() + perf_trace_buf_submit(head, task => NULL)
make no sense if hlist_empty(head). Change perf_ftrace_function_call()
to check event_function.perf_events beforehand.
Link: http://lkml.kernel.org/r/20130617170204.GA19803@redhat.com
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There have some mismatch between comments with
real function name, update it.
This patch also add some missed function arguments
description.
Link: http://lkml.kernel.org/r/51E3B3B2.4080307@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For string without format specifiers, use trace_seq_puts()
or trace_seq_putc().
Link: http://lkml.kernel.org/r/51E3B3AC.1000605@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ fixed a trace_seq_putc(s, " ") to trace_seq_putc(s, ' ') ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We should use CONFIG_STACK_TRACER to guard readme text
of stack tracer related file, not CONFIG_STACKTRACE.
Link: http://lkml.kernel.org/r/51E3B3A2.8080609@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
were added to 3.10, which includes several bug fixes that have been
marked for stable.
As for new features, there were a few, but nothing to write to LWN about.
These include:
New function trigger called "dump" and "cpudump" that will cause
ftrace to dump its buffer to the console when the function is called.
The difference between "dump" and "cpudump" is that "dump" will dump
the entire contents of the ftrace buffer, where as "cpudump" will only
dump the contents of the ftrace buffer for the CPU that called the function.
Another small enhancement is a new sysctl switch called "traceoff_on_warning"
which, when enabled, will disable tracing if any WARN_ON() is triggered.
This is useful if you want to debug what caused a warning and do not
want to risk losing your trace data by the ring buffer overwriting the
data before you can disable it. There's also a kernel command line
option that will make this enabled at boot up called the same thing.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJR1uF2AAoJEOdOSU1xswtMJ1IH/2LSiZAKTA2QaRgGQC/5Bb9c
XSOI1HfD/78lmUvTyb0AX8sLpkzZlvIONEQ/WaZUFo1Zjbrl45zJUwMkTE9uImEg
ZqI5x8OiiN6j4XrRbfYn3Ti060H/Jq41pZXa+shh961Vv51ilv/1yyLkoRmnjzuO
JTloPdXDV7icOqqiSdgxSdtUSv59Ef1ZdHgvvsb3aqzMC5btVQPi4kIys0ST1Tr1
pMWBY+UgvH0xYm3gvTR+W6jjDlkVZEH2alkmcinfr+uC1tm9DDqK2HA17Pd5yZ5z
HNdT76lCzf9iqRF5F8HUvUt+PIp76dNNxAt2qpB6APqAuJTojyguxXHDbY/0kzs=
=UvLi
-----END PGP SIGNATURE-----
Merge tag 'trace-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing changes from Steven Rostedt:
"The majority of the changes here are cleanups for the large changes
that were added to 3.10, which includes several bug fixes that have
been marked for stable.
As for new features, there were a few, but nothing to write to LWN
about. These include:
New function trigger called "dump" and "cpudump" that will cause
ftrace to dump its buffer to the console when the function is called.
The difference between "dump" and "cpudump" is that "dump" will dump
the entire contents of the ftrace buffer, where as "cpudump" will only
dump the contents of the ftrace buffer for the CPU that called the
function.
Another small enhancement is a new sysctl switch called
"traceoff_on_warning" which, when enabled, will disable tracing if any
WARN_ON() is triggered. This is useful if you want to debug what
caused a warning and do not want to risk losing your trace data by the
ring buffer overwriting the data before you can disable it. There's
also a kernel command line option that will make this enabled at boot
up called the same thing"
* tag 'trace-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (34 commits)
tracing: Make tracing_open_generic_{tr,tc}() static
tracing: Remove ftrace() function
tracing: Remove TRACE_EVENT_TYPE enum definition
tracing: Make tracer_tracing_{off,on,is_on}() static
tracing: Fix irqs-off tag display in syscall tracing
uprobes: Fix return value in error handling path
tracing: Fix race between deleting buffer and setting events
tracing: Add trace_array_get/put() to event handling
tracing: Get trace_array ref counts when accessing trace files
tracing: Add trace_array_get/put() to handle instance refs better
tracing: Protect ftrace_trace_arrays list in trace_events.c
tracing: Make trace_marker use the correct per-instance buffer
ftrace: Do not run selftest if command line parameter is set
tracing/kprobes: Don't pass addr=ip to perf_trace_buf_submit()
tracing: Use flag buffer_disabled for irqsoff tracer
tracing/kprobes: Turn trace_probe->files into list_head
tracing: Fix disabling of soft disable
tracing: Add missing syscall_metadata comment
tracing: Simplify code for showing of soft disabled flag
tracing/kprobes: Kill probe_enable_lock
...
I have patches that will use tracing_open_generic_tr/tc() in other
files, but as they are not ready to be merged yet, and Fengguang Wu's
sparse scripts pointed out that these functions were not declared
anywhere, I'll make them static for now.
When these functions are required to be used elsewhere, I'll remove
the static then.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I have patches that will use tracer_tracing_on/off/is_on() in other
files, but as they are not ready to be merged yet, and Fengguang Wu's
sparse scripts pointed out that these functions were not declared
anywhere, I'll make them static for now.
When these functions are required to be used elsewhere, I'll remove
the static then.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While analyzing the code, I discovered that there's a potential race between
deleting a trace instance and setting events. There are a few races that can
occur if events are being traced as the buffer is being deleted. Mostly the
problem comes with freeing the descriptor used by the trace event callback.
To prevent problems like this, the events are disabled before the buffer is
deleted. The problem with the current solution is that the event_mutex is let
go between disabling the events and freeing the files, which means that the events
could be enabled again while the freeing takes place.
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit a695cb5816 "tracing: Prevent deleting instances when they are being read"
tried to fix a race between deleting a trace instance and reading contents
of a trace file. But it wasn't good enough. The following could crash the kernel:
# cd /sys/kernel/debug/tracing/instances
# ( while :; do mkdir foo; rmdir foo; done ) &
# ( while :; do echo 1 > foo/events/sched/sched_switch 2> /dev/null; done ) &
Luckily this can only be done by root user, but it should be fixed regardless.
The problem is that a delete of the file can happen after the write to the event
is opened, but before the enabling happens.
The solution is to make sure the trace_array is available before succeeding in
opening for write, and incerment the ref counter while opened.
Now the instance can be deleted when the events are writing to the buffer,
but the deletion of the instance will disable all events before the instance
is actually deleted.
Cc: stable@vger.kernel.org # 3.10
Reported-by: Alexander Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a trace file is opened that may access a trace array, it must
increment its ref count to prevent it from being deleted.
Cc: stable@vger.kernel.org # 3.10
Reported-by: Alexander Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit a695cb5816 "tracing: Prevent deleting instances when they are being read"
tried to fix a race between deleting a trace instance and reading contents
of a trace file. But it wasn't good enough. The following could crash the kernel:
# cd /sys/kernel/debug/tracing/instances
# ( while :; do mkdir foo; rmdir foo; done ) &
# ( while :; do cat foo/trace &> /dev/null; done ) &
Luckily this can only be done by root user, but it should be fixed regardless.
The problem is that a delete of the file can happen after the reader starts
to open the file but before it grabs the trace_types_mutex.
The solution is to validate the trace array before using it. If the trace
array does not exist in the list of trace arrays, then it returns -ENODEV.
There's a possibility that a trace_array could be deleted and a new one
created and the open would open its file instead. But that is very minor as
it will just return the data of the new trace array, it may confuse the user
but it will not crash the system. As this can only be done by root anyway,
the race will only occur if root is deleting what its trying to read at
the same time.
Cc: stable@vger.kernel.org # 3.10
Reported-by: Alexander Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are multiple places where the ftrace_trace_arrays list is accessed in
trace_events.c without the trace_types_lock held.
Link: http://lkml.kernel.org/r/1372732674-22726-1-git-send-email-azl@google.com
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_marker file was present for each new instance created, but it
added the trace mark to the global trace buffer instead of to
the instance's buffer.
Link: http://lkml.kernel.org/r/1372717885-4543-2-git-send-email-azl@google.com
Cc: David Sharp <dhsharp@google.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Cc: stable@vger.kernel.org # 3.10
Signed-off-by: Alexander Z Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the kernel command line ftrace filter parameters are set
(ftrace_filter or ftrace_notrace), force the function self test to
pass, with a warning why it was forced.
If the user adds a filter to the kernel command line, it is assumed
that they know what they are doing, and the self test should just not
run instead of failing (which disables function tracing) or clearing
the filter, as that will probably annoy the user.
If the user wants the selftest to run, the message will tell them why
it did not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
kprobe_perf_func() and kretprobe_perf_func() pass addr=ip to
perf_trace_buf_submit() for no reason.
This sets perf_sample_data->addr for PERF_SAMPLE_ADDR, we already
have perf_sample_data->ip initialized if PERF_SAMPLE_IP.
Link: http://lkml.kernel.org/r/20130620173811.GA13161@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the ring buffer is disabled and the irqsoff tracer records a trace it
will clear out its buffer and lose the data it had previously recorded.
Currently there's a callback when writing to the tracing_of file, but if
tracing is disabled via the function tracer trigger, it will not inform
the irqsoff tracer to stop recording.
By using the "mirror" flag (buffer_disabled) in the trace_array, that keeps
track of the status of the trace_array's buffer, it gives the irqsoff
tracer a fast way to know if it should record a new trace or not.
The flag may be a little behind the real state of the buffer, but it
should not affect the trace too much. It's more important for the irqsoff
tracer to be fast.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I think that "ftrace_event_file *trace_probe[]" complicates the
code for no reason, turn it into list_head to simplify the code.
enable_trace_probe() no longer needs synchronize_sched().
This needs the extra sizeof(list_head) memory for every attached
ftrace_event_file, hopefully not a problem in this case.
Link: http://lkml.kernel.org/r/20130620173814.GA13165@redhat.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The comment on the soft disable 'disable' case of
__ftrace_event_enable_disable() states that the soft disable bit
should be cleared in that case, but currently only the soft mode bit
is actually cleared.
This essentially leaves the standard non-soft-enable enable/disable
paths as the only way to clear the soft disable flag, but the soft
disable bit should also be cleared when removing a trigger with '!'.
Also, the SOFT_DISABLED bit should never be set if SOFT_MODE is
cleared.
This fixes the above discrepancies.
Link: http://lkml.kernel.org/r/b9c68dd50bc07019e6c67d3f9b29be4ef1b2badb.1372479499.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
enable_trace_probe() and disable_trace_probe() should not worry about
serialization, the caller (perf_trace_init or __ftrace_set_clr_event)
holds event_mutex.
They are also called by kprobe_trace_self_tests_init(), but this __init
function can't race with itself or trace_events.c
And note that this code depended on event_mutex even before 41a7dd420c
which introduced probe_enable_lock. In fact it assumes that the caller
kprobe_register() can never race with itself. Otherwise, say, tp->flags
manipulations are racy.
Link: http://lkml.kernel.org/r/20130620173809.GA13158@redhat.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
perf_trace_buf_prepare() + perf_trace_buf_submit() make no sense
if this task/CPU has no active counters. Change kprobe_perf_func()
and kretprobe_perf_func() to check call->perf_events beforehand
and return if this list is empty.
For example, "perf record -e some_probe -p1". Only /sbin/init will
report, all other threads which hit the same probe will do
perf_trace_buf_prepare/perf_trace_buf_submit just to realize that
nobody wants perf_swevent_event().
Link: http://lkml.kernel.org/r/20130620173806.GA13151@redhat.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Running the following:
# cd /sys/kernel/debug/tracing
# echo p:i do_sys_open > kprobe_events
# echo p:j schedule >> kprobe_events
# cat kprobe_events
p:kprobes/i do_sys_open
p:kprobes/j schedule
# echo p:i do_sys_open >> kprobe_events
# cat kprobe_events
p:kprobes/j schedule
p:kprobes/i do_sys_open
# ls /sys/kernel/debug/tracing/events/kprobes/
enable filter j
Notice that the 'i' is missing from the kprobes directory.
The console produces:
"Failed to create system directory kprobes"
This is because kprobes passes in a allocated name for the system
and the ftrace event subsystem saves off that name instead of creating
a duplicate for it. But the kprobes may free the system name making
the pointer to it invalid.
This bug was introduced by 92edca073c "tracing: Use direct field, type
and system names" which switched from using kstrdup() on the system name
in favor of just keeping apointer to it, as the internal ftrace event
system names are static and exist for the life of the computer being booted.
Instead of reverting back to duplicating system names again, we can use
core_kernel_data() to determine if the passed in name was allocated or
static. Then use the MSB of the ref_count to be a flag to keep track if
the name was allocated or not. Then we can still save from having to duplicate
strings that will always exist, but still copy the ones that may be freed.
Cc: stable@vger.kernel.org # 3.10
Reported-by: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com>
Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since tp->flags assignment was moved into function enable_trace_probe(),
there is no need to use trace_probe_is_enabled to check flags
in the same function.
Remove the unnecessary checking.
Link: http://lkml.kernel.org/r/51BA7B9E.3040807@huawei.com
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a traceoff_on_warning option in both the kernel command line as well
as a sysctl option. When set, any WARN*() function that is hit will cause
the tracing_on variable to be cleared, which disables writing to the
ring buffer.
This is useful especially when tracing a bug with function tracing. When
a warning is hit, the print caused by the warning can flood the trace with
the functions that producing the output for the warning. This can make the
resulting trace useless by either hiding where the bug happened, or worse,
by overflowing the buffer and losing the trace of the bug totally.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are some cases when filtering on a set flag of a field of a tracepoint
is useful. But currently the only filtering commands for numbered fields
is ==, !=, <, <=, >, >=. This does not help when you just want to trace if
a specific flag is set. For example:
> # sudo trace-cmd record -e brcmfmac:brcmf_dbg -f 'level & 0x40000'
> disable all
> enable brcmfmac:brcmf_dbg
> path = /sys/kernel/debug/tracing/events/brcmfmac/brcmf_dbg/enable
> (level & 0x40000)
> ^
> parse_error: Invalid operator
>
When trying to trace brcmf_dbg when level has its 1 << 18 bit set, the
filter fails to perform.
By allowing a binary '&' operation, this gives the user the ability to
test a bit.
Note, a binary '|' is not added, as it doesn't make sense as fields must
be compared to constants (for now), and ORing a constant will always return
true.
Link: http://lkml.kernel.org/r/1371057385.9844.261.camel@gandalf.local.home
Suggested-by: Arend van Spriel <arend@broadcom.com>
Tested-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracer uses preempt_disable/enable_notrace() for
synchronization between reading registered ftrace_ops and unregistering
them.
Most of the ftrace_ops are global permanent structures that do not
require this synchronization. That is, ops may be added and removed from
the hlist but are never freed, and wont hurt if a synchronization is
missed.
But this is not true for dynamically created ftrace_ops or control_ops,
which are used by the perf function tracing.
The problem here is that the function tracer can be used to trace
kernel/user context switches as well as going to and from idle.
Basically, it can be used to trace blind spots of the RCU subsystem.
This means that even though preempt_disable() is done, a
synchronize_sched() will ignore CPUs that haven't made it out of user
space or idle. These can include functions that are being traced just
before entering or exiting the kernel sections.
To implement the RCU synchronization, instead of using
synchronize_sched() the use of schedule_on_each_cpu() is performed. This
means that when a dynamically allocated ftrace_ops, or a control ops is
being unregistered, all CPUs must be touched and execute a ftrace_sync()
stub function via the work queues. This will rip CPUs out from idle or
in dynamic tick mode. This only happens when a user disables perf
function tracing or other dynamically allocated function tracers, but it
allows us to continue to debug RCU and context tracking with function
tracing.
Link: http://lkml.kernel.org/r/1369785676.15552.55.camel@gandalf.local.home
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 4f271a2a60
(tracing: Add a proc file to stop tracing and free buffer)
implement a method to free up ring buffer in kernel memory
in the release code path of free_buffer's fd.
Then we don't need read/write support for free_buffer,
indeed we just have a dummy write fop, and don't implement read fop.
So the 0200 is more reasonable file mode for free_buffer than
the current file mode 0644.
Link: http://lkml.kernel.org/r/20130526085201.GA3183@udknight
Acked-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Acked-by: David Sharp <dhsharp@google.com>
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the "cpudump" command to have the current CPU ftrace buffer dumped
to console if a function is hit. This is useful when debugging a
tripple fault, where you have an idea of a function that is called
just before the tripple fault occurs, and can tell ftrace to dump its
content out to the console before it continues.
This differs from the "dump" command as it only dumps the content of
the ring buffer for the currently executing CPU, and does not show
the contents of the other CPUs.
Format is:
<function>:cpudump
echo 'bad_address:cpudump' > /debug/tracing/set_ftrace_filter
To remove this:
echo '!bad_address:cpudump' > /debug/tracing/set_ftrace_filter
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the "dump" command to have the ftrace buffer dumped to console if
a function is hit. This is useful when debugging a tripple fault,
where you have an idea of a function that is called just before the
tripple fault occurs, and can tell ftrace to dump its content out
to the console before it continues.
Format is:
<function>:dump
echo 'bad_address:dump' > /debug/tracing/set_ftrace_filter
To remove this:
echo '!bad_address:dump' > /debug/tracing/set_ftrace_filter
Requested-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Outputting formats of x86-tsc and counter should be a raw format, but after
applying the patch(2b6080f28c), the format was
changed to nanosec. This is because the global variable trace_clock_id was used.
When we use multiple buffers, clock_id of each sub-buffer should be used. Then,
this patch uses tr->clock_id instead of the global variable trace_clock_id.
[ Basically, this fixes a regression where the multibuffer code changed the
trace_clock file to update tr->clock_id but the traces still use the old
global trace_clock_id variable, negating the file's effect. The global
trace_clock_id variable is obsolete and removed. - SR ]
Link: http://lkml.kernel.org/r/20130423013239.22334.7394.stgit@yunodevel
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The irqsoff tracer records the max time that interrupts are disabled.
There are hooks in the assembly code that calls back into the tracer when
interrupts are disabled or enabled.
When they are enabled, the tracer checks if the amount of time they
were disabled is larger than the previous recorded max interrupts off
time. If it is, it creates a snapshot of the currently running trace
to store where the last largest interrupts off time was held and how
it happened.
During testing, this RCU lockdep dump appeared:
[ 1257.829021] ===============================
[ 1257.829021] [ INFO: suspicious RCU usage. ]
[ 1257.829021] 3.10.0-rc1-test+ #171 Tainted: G W
[ 1257.829021] -------------------------------
[ 1257.829021] /home/rostedt/work/git/linux-trace.git/include/linux/rcupdate.h:780 rcu_read_lock() used illegally while idle!
[ 1257.829021]
[ 1257.829021] other info that might help us debug this:
[ 1257.829021]
[ 1257.829021]
[ 1257.829021] RCU used illegally from idle CPU!
[ 1257.829021] rcu_scheduler_active = 1, debug_locks = 0
[ 1257.829021] RCU used illegally from extended quiescent state!
[ 1257.829021] 2 locks held by trace-cmd/4831:
[ 1257.829021] #0: (max_trace_lock){......}, at: [<ffffffff810e2b77>] stop_critical_timing+0x1a3/0x209
[ 1257.829021] #1: (rcu_read_lock){.+.+..}, at: [<ffffffff810dae5a>] __update_max_tr+0x88/0x1ee
[ 1257.829021]
[ 1257.829021] stack backtrace:
[ 1257.829021] CPU: 3 PID: 4831 Comm: trace-cmd Tainted: G W 3.10.0-rc1-test+ #171
[ 1257.829021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
[ 1257.829021] 0000000000000001 ffff880065f49da8 ffffffff8153dd2b ffff880065f49dd8
[ 1257.829021] ffffffff81092a00 ffff88006bd78680 ffff88007add7500 0000000000000003
[ 1257.829021] ffff88006bd78680 ffff880065f49e18 ffffffff810daebf ffffffff810dae5a
[ 1257.829021] Call Trace:
[ 1257.829021] [<ffffffff8153dd2b>] dump_stack+0x19/0x1b
[ 1257.829021] [<ffffffff81092a00>] lockdep_rcu_suspicious+0x109/0x112
[ 1257.829021] [<ffffffff810daebf>] __update_max_tr+0xed/0x1ee
[ 1257.829021] [<ffffffff810dae5a>] ? __update_max_tr+0x88/0x1ee
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810dbf85>] update_max_tr_single+0x11d/0x12d
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810e2b15>] stop_critical_timing+0x141/0x209
[ 1257.829021] [<ffffffff8109569a>] ? trace_hardirqs_on+0xd/0xf
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810e3057>] time_hardirqs_on+0x2a/0x2f
[ 1257.829021] [<ffffffff811002b9>] ? user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff8109550c>] trace_hardirqs_on_caller+0x16/0x197
[ 1257.829021] [<ffffffff8109569a>] trace_hardirqs_on+0xd/0xf
[ 1257.829021] [<ffffffff811002b9>] user_enter+0xfd/0x107
[ 1257.829021] [<ffffffff810029b4>] do_notify_resume+0x92/0x97
[ 1257.829021] [<ffffffff8154bdca>] int_signal+0x12/0x17
What happened was entering into the user code, the interrupts were enabled
and a max interrupts off was recorded. The trace buffer was saved along with
various information about the task: comm, pid, uid, priority, etc.
The uid is recorded with task_uid(tsk). But this is a macro that uses rcu_read_lock()
to retrieve the data, and this happened to happen where RCU is blind (user_enter).
As only the preempt and irqs off tracers can have this happen, and they both
only have the tsk == current, if tsk == current, use current_uid() instead of
task_uid(), as current_uid() does not use RCU as only current can change its uid.
This fixes the RCU suspicious splat.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The branch selftest calls trace_test_buffer(), but with the new code
it expects the first parameter to be a pointer to a struct trace_buffer.
All self tests were changed but the branch selftest was missed.
This caused either a crash or failed test when the branch selftest was
enabled.
Link: http://lkml.kernel.org/r/20130529141333.GA24064@localhost
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As rcu_dereference_raw() under RCU debug config options can add quite a
bit of checks, and that tracing uses rcu_dereference_raw(), these checks
happen with the function tracer. The function tracer also happens to trace
these debug checks too. This added overhead can livelock the system.
Have the function tracer use the new RCU _notrace equivalents that do
not do the debug checks for RCU.
Link: http://lkml.kernel.org/r/20130528184209.467603904@goodmis.org
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing infrastructure sets up for possible CPUs, but it uses
the ring buffer polling, it is possible to call the ring buffer
polling code with a CPU that hasn't been allocated. This will cause
a kernel oops when it access a ring buffer cpu buffer that is part
of the possible cpus but hasn't been allocated yet as the CPU has never
been online.
Reported-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If ftrace=<tracer> is on the kernel command line, when that tracer is
registered, it will be initiated by tracing_set_tracer() to execute that
tracer.
The nop tracer is just a stub tracer that is used to have no tracer
enabled. It is assigned at early bootup as it is the default tracer.
But if ftrace=nop is on the kernel command line, the registering of the
nop tracer will call tracing_set_tracer() which will try to execute
the nop tracer. But it expects tr->current_trace to be assigned something
as it usually is assigned to the nop tracer. As it hasn't been assigned
to anything yet, it causes the system to crash.
The simple fix is to move the tr->current_trace = nop before registering
the nop tracer. The functionality is still the same as the nop tracer
doesn't do anything anyway.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since try_module_get() returns false( = 0) when it fails to
pindown a module, event_enable_func() returns 0 which means
"succeed". This can cause a kernel panic when the entry
is removed, because the event is already released.
This fixes the bug by returning -EBUSY, because the reason
why it fails is that the module is being removed at that time.
Link: http://lkml.kernel.org/r/20130516114848.13508.97899.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to sparse warning, print_*probe_event static because
those functions are not directly called from outside.
Link: http://lkml.kernel.org/r/20130513115839.6545.83067.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use rcu_dereference_raw() for accessing tp->files. Because the
write-side uses rcu_assign_pointer() for memory barrier,
the read-side also has to use rcu_dereference_raw() with
read memory barrier.
Link: http://lkml.kernel.org/r/20130513115834.6545.17022.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Special preds are created when folding a series of preds that
can be done in serial. These are allocated in an ops field of
the pred structure. But they were never freed, causing memory
leaks.
This was discovered using the kmemleak checker:
unreferenced object 0xffff8800797fd5e0 (size 32):
comm "swapper/0", pid 1, jiffies 4294690605 (age 104.608s)
hex dump (first 32 bytes):
00 00 01 00 03 00 05 00 07 00 09 00 0b 00 0d 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff814b52af>] kmemleak_alloc+0x73/0x98
[<ffffffff8111ff84>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
[<ffffffff81120e68>] __kmalloc+0xd7/0x125
[<ffffffff810d47eb>] kcalloc.constprop.24+0x2d/0x2f
[<ffffffff810d4896>] fold_pred_tree_cb+0xa9/0xf4
[<ffffffff810d3781>] walk_pred_tree+0x47/0xcc
[<ffffffff810d5030>] replace_preds.isra.20+0x6f8/0x72f
[<ffffffff810d50b5>] create_filter+0x4e/0x8b
[<ffffffff81b1c30d>] ftrace_test_event_filter+0x5a/0x155
[<ffffffff8100028d>] do_one_initcall+0xa0/0x137
[<ffffffff81afbedf>] kernel_init_freeable+0x14d/0x1dc
[<ffffffff814b24b7>] kernel_init+0xe/0xdb
[<ffffffff814d539c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: stable@vger.kernel.org # 2.6.39+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
kprobes up to par with the latest changes to ftrace (multi buffering
and the new function probes).
He also discovered and fixed some bugs in doing so. When pulling in his
patches, I also found a few minor bugs as well and fixed them.
This also includes a compile fix for some archs that select the ring buffer
but not tracing.
I based this off of the last patch you took from me that fixed the merge
conflict error, as that was the commit that had all the changes I needed
for this set of changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRjYnJAAoJEOdOSU1xswtMg9EH/iFs438FgrNMk2ZdQftmqcqA
cqcactHo1mmoHjAoLZT/oDBjEThhVUuqzMXrFRutSYcTh4PsQEC3arX0mpsC+T12
UEEV/tZS3TXH+GXEyrOit/O3kzntQcDHwJDV4+0n80IrJmw4IDZbnV3R8DWjS6wp
so+dq0A1pwehcG/upgpw1oTKsGv1G/p6vyf968B6W44icHEClLiph4JE2kzE6D3r
fzSpOLaQoBEvwIRf6xRKxi240VqIItXwfG7pwNpPpSC37gRLzm74zGr+Sj93/k1y
pARbZ/5XO7/pcVYQYupErRAoV5in+QMZ67k5G1vQIvyOS9r039catbQf/7PkzcI=
=EZCE
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing/kprobes update from Steven Rostedt:
"The majority of these changes are from Masami Hiramatsu bringing
kprobes up to par with the latest changes to ftrace (multi buffering
and the new function probes).
He also discovered and fixed some bugs in doing so. When pulling in
his patches, I also found a few minor bugs as well and fixed them.
This also includes a compile fix for some archs that select the ring
buffer but not tracing.
I based this off of the last patch you took from me that fixed the
merge conflict error, as that was the commit that had all the changes
I needed for this set of changes."
* tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Support soft-mode disabling
tracing/kprobes: Support ftrace_event_file base multibuffer
tracing/kprobes: Pass trace_probe directly from dispatcher
tracing/kprobes: Increment probe hit-count even if it is used by perf
tracing/kprobes: Use bool for retprobe checker
ftrace: Fix function probe when more than one probe is added
ftrace: Fix the output of enabled_functions debug file
ftrace: Fix locking in register_ftrace_function_probe()
tracing: Add helper function trace_create_new_event() to remove duplicate code
tracing: Modify soft-mode only if there's no other referrer
tracing: Indicate enabled soft-mode in enable file
tracing/kprobes: Fix to increment return event probe hit-count
ftrace: Cleanup regex_lock and ftrace_lock around hash updating
ftrace, kprobes: Fix a deadlock on ftrace_regex_lock
ftrace: Have ftrace_regex_write() return either read or error
tracing: Return error if register_ftrace_function_probe() fails for event_enable_func()
tracing: Don't succeed if event_enable_func did not register anything
ring-buffer: Select IRQ_WORK
Support soft-mode disabling on kprobe-based dynamic events.
Soft-disabling is just ignoring recording if the soft disabled
flag is set.
Link: http://lkml.kernel.org/r/20130509054454.30398.7237.stgit@mhiramat-M0-7522
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Increment probe hit-count for profiling even if it is used
by perf tool. Same thing has already done in trace_uprobe.
Link: http://lkml.kernel.org/r/20130509054436.30398.21133.stgit@mhiramat-M0-7522
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enabled_functions debugfs file was created to be able to see
what functions have been modified from nops to calling a tracer.
The current method uses the counter in the function record.
As when a ftrace_ops is registered to a function, its count
increases. But that doesn't mean that the function is actively
being traced. /proc/sys/kernel/ftrace_enabled can be set to zero
which would disable it, as well as something can go wrong and
we can think its enabled when only the counter is set.
The record's FTRACE_FL_ENABLED flag is set or cleared when its
function is modified. That is a much more accurate way of knowing
what function is enabled or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The iteration of the ftrace function list and the call to
ftrace_match_record() need to be protected by the ftrace_lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both __trace_add_new_event() and __trace_early_add_new_event() do
basically the same thing, except that __trace_add_new_event() does
a little more.
Instead of having duplicate code between the two functions, add
a helper function trace_create_new_event() that both can use.
This will help against having bugs fixed in one function but not
the other.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Modify soft-mode flag only if no other soft-mode referrer
(currently only the ftrace triggers) by using a reference
counter in each ftrace_event_file.
Without this fix, adding and removing several different
enable/disable_event triggers on the same event clear
soft-mode bit from the ftrace_event_file. This also
happens with a typo of glob on setting triggers.
e.g.
# echo vfs_symlink:enable_event:net:netif_rx > set_ftrace_filter
# cat events/net/netif_rx/enable
0*
# echo typo_func:enable_event:net:netif_rx > set_ftrace_filter
# cat events/net/netif_rx/enable
0
# cat set_ftrace_filter
#### all functions enabled ####
vfs_symlink:enable_event:net:netif_rx:unlimited
As above, we still have a trigger, but soft-mode is gone.
Link: http://lkml.kernel.org/r/20130509054429.30398.7464.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Indicate enabled soft-mode event as "1*" in "enable" file
for each event, because it can be soft-disabled when disable_event
trigger is hit.
Link: http://lkml.kernel.org/r/20130509054426.30398.28202.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cleanup regex_lock and ftrace_lock locking points around
ftrace_ops hash update code.
The new rule is that regex_lock protects ops->*_hash
read-update-write code for each ftrace_ops. Usually,
hash update is done by following sequence.
1. allocate a new local hash and copy the original hash.
2. update the local hash.
3. move(actually, copy) back the local hash to ftrace_ops.
4. update ftrace entries if needed.
5. release the local hash.
This makes regex_lock protect #1-#4, and ftrace_lock
to protect #3, #4 and adding and removing ftrace_ops from the
ftrace_ops_list. The ftrace_lock protects #3 as well because
the move functions update the entries too.
Link: http://lkml.kernel.org/r/20130509054421.30398.83411.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix a deadlock on ftrace_regex_lock which happens when setting
an enable_event trigger on dynamic kprobe event as below.
----
sh-2.05b# echo p vfs_symlink > kprobe_events
sh-2.05b# echo vfs_symlink:enable_event:kprobes:p_vfs_symlink_0 > set_ftrace_filter
=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #35 Not tainted
---------------------------------------------
sh/72 is trying to acquire lock:
(ftrace_regex_lock){+.+.+.}, at: [<ffffffff810ba6c1>] ftrace_set_hash+0x81/0x1f0
but task is already holding lock:
(ftrace_regex_lock){+.+.+.}, at: [<ffffffff810b7cbd>] ftrace_regex_write.isra.29.part.30+0x3d/0x220
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(ftrace_regex_lock);
lock(ftrace_regex_lock);
*** DEADLOCK ***
----
To fix that, this introduces a finer regex_lock for each ftrace_ops.
ftrace_regex_lock is too big of a lock which protects all
filter/notrace_hash operations, but it doesn't need to be a global
lock after supporting multiple ftrace_ops because each ftrace_ops
has its own filter/notrace_hash.
Link: http://lkml.kernel.org/r/20130509054417.30398.84254.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Added initialization flag and automate mutex initialization for
non ftrace.c ftrace_probes. ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As ftrace_regex_write() reads the result of ftrace_process_regex()
which can sometimes return a positive number, only consider a
failure if the return is negative. Otherwise, it will skip possible
other registered probes and by returning a positive number that
wasn't read, it will confuse the user processes doing the writing.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
register_ftrace_function_probe() returns the number of functions
it registered, which can be zero, it can also return a negative number
if something went wrong. But event_enable_func() only checks for
the case that it didn't register anything, it needs to also check
for the case that something went wrong and return that error code
as well.
Added some comments about the code as well, to make it more
understandable.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Return 0 instead of the number of activated ftrace function probes if
event_enable_func succeeded and return an error code if it failed or
did not register any functions. But it currently returns the number
of registered functions and if it didn't register anything, it returns 0,
but that is considered success.
This also fixes the return value. As if it succeeds, it returns the
number of functions that were enabled, which is returned back to
the user in ftrace_regex_write (the write() return code). If only
one function is enabled, then the return code of the write is one,
and this can confuse the user program in thinking it only wrote 1
byte.
Link: http://lkml.kernel.org/r/20130509054413.30398.55650.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Rewrote change log to reflect that this fixes two bugs - SR ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull block driver updates from Jens Axboe:
"It might look big in volume, but when categorized, not a lot of
drivers are touched. The pull request contains:
- mtip32xx fixes from Micron.
- A slew of drbd updates, this time in a nicer series.
- bcache, a flash/ssd caching framework from Kent.
- Fixes for cciss"
* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
bcache: Use bd_link_disk_holder()
bcache: Allocator cleanup/fixes
cciss: bug fix to prevent cciss from loading in kdump crash kernel
cciss: add cciss_allow_hpsa module parameter
drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
mtip32xx: Workaround for unaligned writes
bcache: Make sure blocksize isn't smaller than device blocksize
bcache: Fix merge_bvec_fn usage for when it modifies the bvm
bcache: Correctly check against BIO_MAX_PAGES
bcache: Hack around stuff that clones up to bi_max_vecs
bcache: Set ra_pages based on backing device's ra_pages
bcache: Take data offset from the bdev superblock.
mtip32xx: mtip32xx: Disable TRIM support
mtip32xx: fix a smatch warning
bcache: Disable broken btree fuzz tester
bcache: Fix a format string overflow
bcache: Fix a minor memory leak on device teardown
bcache: Documentation updates
bcache: Use WARN_ONCE() instead of __WARN()
bcache: Add missing #include <linux/prefetch.h>
...
As the wake up logic for waiters on the buffer has been moved
from the tracing code to the ring buffer, it requires also adding
IRQ_WORK as the wake up code is performed via irq_work.
This fixes compile breakage when a user of the ring buffer is selected
but tracing and irq_work are not.
Link http://lkml.kernel.org/r/20130503115332.GT8356@rric.localhost
Cc: Arnd Bergmann <arnd@arndb.de>
Reported-by: Robert Richter <rric@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf updates from Ingo Molnar:
"Features:
- Add "uretprobes" - an optimization to uprobes, like kretprobes are
an optimization to kprobes. "perf probe -x file sym%return" now
works like kretprobes. By Oleg Nesterov.
- Introduce per core aggregation in 'perf stat', from Stephane
Eranian.
- Add memory profiling via PEBS, from Stephane Eranian.
- Event group view for 'annotate' in --stdio, --tui and --gtk, from
Namhyung Kim.
- Add support for AMD NB and L2I "uncore" counters, by Jacob Shin.
- Add Ivy Bridge-EP uncore support, by Zheng Yan
- IBM zEnterprise EC12 oprofile support patchlet from Robert Richter.
- Add perf test entries for checking breakpoint overflow signal
handler issues, from Jiri Olsa.
- Add perf test entry for for checking number of EXIT events, from
Namhyung Kim.
- Add perf test entries for checking --cpu in record and stat, from
Jiri Olsa.
- Introduce perf stat --repeat forever, from Frederik Deweerdt.
- Add --no-demangle to report/top, from Namhyung Kim.
- PowerPC fixes plus a couple of cleanups/optimizations in uprobes
and trace_uprobes, by Oleg Nesterov.
Various fixes and refactorings:
- Fix dependency of the python binding wrt libtraceevent, from
Naohiro Aota.
- Simplify some perf_evlist methods and to allow 'stat' to share code
with 'record' and 'trace', by Arnaldo Carvalho de Melo.
- Remove dead code in related to libtraceevent integration, from
Namhyung Kim.
- Revert "perf sched: Handle PERF_RECORD_EXIT events" to get 'perf
sched lat' back working, by Arnaldo Carvalho de Melo
- We don't use Newt anymore, just plain libslang, by Arnaldo Carvalho
de Melo.
- Kill a bunch of die() calls, from Namhyung Kim.
- Fix build on non-glibc systems due to libio.h absence, from Cody P
Schafer.
- Remove some perf_session and tracing dead code, from David Ahern.
- Honor parallel jobs, fix from Borislav Petkov
- Introduce tools/lib/lk library, initially just removing duplication
among tools/perf and tools/vm. from Borislav Petkov
... and many more I missed to list, see the shortlog and git log for
more details."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (136 commits)
perf/x86/intel/P4: Robistify P4 PMU types
perf/x86/amd: Fix AMD NB and L2I "uncore" support
perf/x86/amd: Remove old-style NB counter support from perf_event_amd.c
perf/x86: Check all MSRs before passing hw check
perf/x86/amd: Add support for AMD NB and L2I "uncore" counters
perf/x86/intel: Add Ivy Bridge-EP uncore support
perf/x86/intel: Fix SNB-EP CBO and PCU uncore PMU filter management
perf/x86: Avoid kfree() in CPU_{STARTING,DYING}
uprobes/perf: Avoid perf_trace_buf_prepare/submit if ->perf_events is empty
uprobes/tracing: Don't pass addr=ip to perf_trace_buf_submit()
uprobes/tracing: Change create_trace_uprobe() to support uretprobes
uprobes/tracing: Make seq_printf() code uretprobe-friendly
uprobes/tracing: Make register_uprobe_event() paths uretprobe-friendly
uprobes/tracing: Make uprobe_{trace,perf}_print() uretprobe-friendly
uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()
uprobes/tracing: Introduce uprobe_{trace,perf}_print() helpers
uprobes/tracing: Generalize struct uprobe_trace_entry_head
uprobes/tracing: Kill the pointless local_save_flags/preempt_count calls
uprobes/tracing: Kill the pointless seq_print_ip_sym() call
uprobes/tracing: Kill the pointless task_pt_regs() calls
...
During the 3.10 merge, a conflict happened and the resolution was
almost, but not quite, correct. An if statement was reversed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[ Duh. That was just silly of me - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Along with the usual minor fixes and clean ups there are a few major
changes with this pull request.
1) Multiple buffers for the ftrace facility
This feature has been requested by many people over the last few years.
I even heard that Google was about to implement it themselves. I finally
had time and cleaned up the code such that you can now create multiple
instances of the ftrace buffer and have different events go to different
buffers. This way, a low frequency event will not be lost in the noise
of a high frequency event.
Note, currently only events can go to different buffers, the tracers
(ie. function, function_graph and the latency tracers) still can only
be written to the main buffer.
2) The function tracer triggers have now been extended.
The function tracer had two triggers. One to enable tracing when a
function is hit, and one to disable tracing. Now you can record a
stack trace on a single (or many) function(s), take a snapshot of the
buffer (copy it to the snapshot buffer), and you can enable or disable
an event to be traced when a function is hit.
3) A perf clock has been added.
A "perf" clock can be chosen to be used when tracing. This will cause
ftrace to use the same clock as perf uses, and hopefully this will make
it easier to interleave the perf and ftrace data for analysis.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRfnTPAAoJEOdOSU1xswtMqYYH/1WIdrwXmxHflErnYkCIr3sU
QtYae2K5A1HcgiqOvRJrdWMOt016iMx5CaQQyBFM1vvMiPY0sTWRmwNxDfZzz9LN
10jRvWEzZSLtzl+a9mkFWLEpr5nR/QODOxkWFCnRWscp46sp04LSTxGDYsOnPQZB
sam/AQ1h4xA+DqDBChm9BDEUEPorGleTlN54LBaCGgSFGvrbF+eAg2s4vHNAQAvQ
8d5xjSE9zC7J+FqbVxvJTbKI3+EqKL6hMsJKsKfi0SI+FuxBaFMSltXck5zKyTI4
HpNJzXCmw+v90Tju7oMkPHh6RTbESPCHoGU+wqE52fM6m7oScVeuI/kfc6USwU4=
=W1n+
-----END PGP SIGNATURE-----
Merge tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Along with the usual minor fixes and clean ups there are a few major
changes with this pull request.
1) Multiple buffers for the ftrace facility
This feature has been requested by many people over the last few
years. I even heard that Google was about to implement it themselves.
I finally had time and cleaned up the code such that you can now
create multiple instances of the ftrace buffer and have different
events go to different buffers. This way, a low frequency event will
not be lost in the noise of a high frequency event.
Note, currently only events can go to different buffers, the tracers
(ie function, function_graph and the latency tracers) still can only
be written to the main buffer.
2) The function tracer triggers have now been extended.
The function tracer had two triggers. One to enable tracing when a
function is hit, and one to disable tracing. Now you can record a
stack trace on a single (or many) function(s), take a snapshot of the
buffer (copy it to the snapshot buffer), and you can enable or disable
an event to be traced when a function is hit.
3) A perf clock has been added.
A "perf" clock can be chosen to be used when tracing. This will cause
ftrace to use the same clock as perf uses, and hopefully this will
make it easier to interleave the perf and ftrace data for analysis."
* tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (82 commits)
tracepoints: Prevent null probe from being added
tracing: Compare to 1 instead of zero for is_signed_type()
tracing: Remove obsolete macro guard _TRACE_PROFILE_INIT
ftrace: Get rid of ftrace_profile_bits
tracing: Check return value of tracing_init_dentry()
tracing: Get rid of unneeded key calculation in ftrace_hash_move()
tracing: Reset ftrace_graph_filter_enabled if count is zero
tracing: Fix off-by-one on allocating stat->pages
kernel: tracing: Use strlcpy instead of strncpy
tracing: Update debugfs README file
tracing: Fix ftrace_dump()
tracing: Rename trace_event_mutex to trace_event_sem
tracing: Fix comment about prefix in arch_syscall_match_sym_name()
tracing: Convert trace_destroy_fields() to static
tracing: Move find_event_field() into trace_events.c
tracing: Use TRACE_MAX_PRINT instead of constant
tracing: Use pr_warn_once instead of open coded implementation
ring-buffer: Add ring buffer startup selftest
tracing: Bring Documentation/trace/ftrace.txt up to date
tracing: Add "perf" trace_clock
...
Conflicts:
kernel/trace/ftrace.c
kernel/trace/trace.c
Conflicts:
arch/x86/kernel/cpu/perf_event_intel.c
Merge in the latest fixes before applying new patches, resolve the conflict.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This reverts commit 3a366e614d.
Wanlong Gao reports that it causes a kernel panic on his machine several
minutes after boot. Reverting it removes the panic.
Jens says:
"It's not quite clear why that is yet, so I think we should just revert
the commit for 3.9 final (which I'm assuming is pretty close).
The wifi is crap at the LSF hotel, so sending this email instead of
queueing up a revert and pull request."
Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Requested-by: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
perf_trace_buf_prepare() + perf_trace_buf_submit() make no sense
if this task/CPU has no active counters. Change uprobe_perf_print()
to return if hlist_empty(call->perf_events).
Note: this is not uprobe-specific, we can change other users too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
uprobe_perf_print() passes addr=ip to perf_trace_buf_submit() for
no reason. This sets perf_sample_data->addr for PERF_SAMPLE_ADDR,
we already have perf_sample_data->ip initialized if PERF_SAMPLE_IP.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Finally change create_trace_uprobe() to check if argv[0][0] == 'r'
and pass the correct "is_ret" to alloc_trace_uprobe().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Change probes_seq_show() and print_uprobe_event() to check
is_ret_probe() and print the correct data.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Change uprobe_event_define_fields(), and __set_print_fmt() to check
is_ret_probe() and use the appropriate format/fields.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Change uprobe_trace_print() and uprobe_perf_print() to check
is_ret_probe() and fill ring_buffer_event accordingly.
Also change uprobe_trace_func() and uprobe_perf_func() to not
_print() if is_ret_probe() is true. Note that we keep ->handler()
nontrivial even for uretprobe, we need this for filtering and for
other potential extensions.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Create the new functions we need to support uretprobes, and change
alloc_trace_uprobe() to initialize consumer.ret_handler if the new
"is_ret" argument is true. Curently this argument is always false,
so the new code is never called and is_ret_probe(tu) is false too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
Extract the output code from uprobe_trace_func() and uprobe_perf_func()
into the new helpers, they will be used by ->ret_handler() too. We also
add the unused "unsigned long func" argument in advance, to simplify the
next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
struct uprobe_trace_entry_head has a single member for reporting,
"unsigned long ip". If we want to support uretprobes we need to
create another struct which has "func" and "ret_ip" and duplicate
a lot of functions, like trace_kprobe.c does.
To avoid this copy-and-paste horror we turn ->ip into ->vaddr[]
and add couple of trivial helpers to calculate sizeof/data. This
uglifies the code a bit, but this allows us to avoid a lot more
complications later, when we add the support for ret-probes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
uprobe_trace_func() is never called with irqs or preemption
disabled, no need to ask preempt_count() or local_save_flags().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Anton Arapov <anton@redhat.com>
seq_print_ip_sym(ip) in print_uprobe_event() is pointless,
kallsyms_lookup(ip) can not resolve a user-space address.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
uprobe_trace_func() and uprobe_perf_func() do not need task_pt_regs(),
we already have "struct pt_regs *regs".
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Anton Arapov <anton@redhat.com>
It seems that function profiler's hash size is fixed at 1024. Add and
use FTRACE_PROFILE_HASH_BITS instead and update hash size macro.
Link: http://lkml.kernel.org/r/1365551750-4504-1-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_graph_count can be decreased with a "!" pattern, so that
the enabled flag should be updated too.
Link: http://lkml.kernel.org/r/1365663698-2413-1-git-send-email-namhyung@kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As ftrace_filter_lseek is now used with ftrace_pid_fops, it needs to
be moved out of the #ifdef CONFIG_DYNAMIC_FTRACE section as the
ftrace_pid_fops is defined when DYNAMIC_FTRACE is not.
Cc: stable@vger.kernel.org
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently set_ftrace_pid and set_graph_function files use seq_lseek
for their fops. However seq_open() is called only for FMODE_READ in
the fops->open() so that if an user tries to seek one of those file
when she open it for writing, it sees NULL seq_file and then panic.
It can be easily reproduced with following command:
$ cd /sys/kernel/debug/tracing
$ echo 1234 | sudo tee -a set_ftrace_pid
In this example, GNU coreutils' tee opens the file with fopen(, "a")
and then the fopen() internally calls lseek().
Link: http://lkml.kernel.org/r/1365663302-2170-1-git-send-email-namhyung@kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On the failure path, stat->start and stat->pages will refer same page.
So it'll attempt to free the same page again and get kernel panic.
Link: http://lkml.kernel.org/r/1364820385-32027-1-git-send-email-namhyung@kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: stable@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use strlcpy() instead of strncpy() as it will always add a '\0'
to the end of the string even if the buffer is smaller than what
is being copied.
Link: http://lkml.kernel.org/r/51624254.30301@asianux.com
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing control loop used by perf spits out a warning
if the called function is not a control function. This is because
the control function references a per cpu allocated data structure
on struct ftrace_ops that is not allocated for other types of
functions.
commit 0a016409e4 "ftrace: Optimize the function tracer list loop"
Had an optimization done to all function tracing loops to optimize
for a single registered ops. Unfortunately, this allows for a slight
race when tracing starts or ends, where the stub function might be
called after the current registered ops is removed. In this case we
get the following dump:
root# perf stat -e ftrace:function sleep 1
[ 74.339105] WARNING: at include/linux/ftrace.h:209 ftrace_ops_control_func+0xde/0xf0()
[ 74.349522] Hardware name: PRIMERGY RX200 S6
[ 74.357149] Modules linked in: sg igb iTCO_wdt ptp pps_core iTCO_vendor_support i7core_edac dca lpc_ich i2c_i801 coretemp edac_core crc32c_intel mfd_core ghash_clmulni_intel dm_multipath acpi_power_meter pcspk
r microcode vhost_net tun macvtap macvlan nfsd kvm_intel kvm auth_rpcgss nfs_acl lockd sunrpc uinput xfs libcrc32c sd_mod crc_t10dif sr_mod cdrom mgag200 i2c_algo_bit drm_kms_helper ttm qla2xxx mptsas ahci drm li
bahci scsi_transport_sas mptscsih libata scsi_transport_fc i2c_core mptbase scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[ 74.446233] Pid: 1377, comm: perf Tainted: G W 3.9.0-rc1 #1
[ 74.453458] Call Trace:
[ 74.456233] [<ffffffff81062e3f>] warn_slowpath_common+0x7f/0xc0
[ 74.462997] [<ffffffff810fbc60>] ? rcu_note_context_switch+0xa0/0xa0
[ 74.470272] [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0
[ 74.478117] [<ffffffff81062e9a>] warn_slowpath_null+0x1a/0x20
[ 74.484681] [<ffffffff81102ede>] ftrace_ops_control_func+0xde/0xf0
[ 74.491760] [<ffffffff8162f400>] ftrace_call+0x5/0x2f
[ 74.497511] [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f
[ 74.503486] [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f
[ 74.509500] [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50
[ 74.516088] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.522268] [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50
[ 74.528837] [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0
[ 74.536696] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.542878] [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50
[ 74.548869] [<ffffffff81105c67>] unregister_ftrace_function+0x27/0x50
[ 74.556243] [<ffffffff8111eadf>] perf_ftrace_event_register+0x9f/0x140
[ 74.563709] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.569887] [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50
[ 74.575898] [<ffffffff8111e94e>] perf_trace_destroy+0x2e/0x50
[ 74.582505] [<ffffffff81127ba9>] tp_perf_event_destroy+0x9/0x10
[ 74.589298] [<ffffffff811295d0>] free_event+0x70/0x1a0
[ 74.595208] [<ffffffff8112a579>] perf_event_release_kernel+0x69/0xa0
[ 74.602460] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40
[ 74.608667] [<ffffffff8112a640>] put_event+0x90/0xc0
[ 74.614373] [<ffffffff8112a740>] perf_release+0x10/0x20
[ 74.620367] [<ffffffff811a3044>] __fput+0xf4/0x280
[ 74.625894] [<ffffffff811a31de>] ____fput+0xe/0x10
[ 74.631387] [<ffffffff81083697>] task_work_run+0xa7/0xe0
[ 74.637452] [<ffffffff81014981>] do_notify_resume+0x71/0xb0
[ 74.643843] [<ffffffff8162fa92>] int_signal+0x12/0x17
To fix this a new ftrace_ops flag is added that denotes the ftrace_list_end
ftrace_ops stub as just that, a stub. This flag is now checked in the
control loop and the function is not called if the flag is set.
Thanks to Jovi for not just reporting the bug, but also pointing out
where the bug was in the code.
Link: http://lkml.kernel.org/r/514A8855.7090402@redhat.com
Link: http://lkml.kernel.org/r/1364377499-1900-15-git-send-email-jovi.zhangwei@huawei.com
Tested-by: WANG Chao <chaowang@redhat.com>
Reported-by: WANG Chao <chaowang@redhat.com>
Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If we reenable ftrace via syctl, we currently set ftrace_trace_function
based on the previous simplistic algorithm. This is inconsistent with
what update_ftrace_function does. So better call that helper instead.
Link: http://lkml.kernel.org/r/5151D26F.1070702@siemens.com
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The commit 34600f0e9 "tracing: Fix race with max_tr and changing tracers"
fixed the updating of the main buffers with the race of changing
tracers, but left out the fix to the updating of just a per cpu buffer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For NUL terminated string we always need to set '\0' at the end.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: rostedt@goodmis.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/516243B7.9020405@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For NUL terminated string we always need to set '\0' at the end.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: rostedt@goodmis.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/51624254.30301@asianux.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Exported so it can be used by bcache's tracepoints
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
Update the README file in debugfs/tracing to something more useful.
What's currently in the file is very old and what it shows doesn't
have much use. Heck, it tells you how to mount debugfs! But to read
this file you would have already needed to mount it.
Replace the file with current up-to-date information. It's rather
limited, but what do you expect from a pseudo README file.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_dump() had a lot of issues. What ftrace_dump() does, is when
ftrace_dump_on_oops is set (via a kernel parameter or sysctl), it
will dump out the ftrace buffers to the console when either a oops,
panic, or a sysrq-z occurs.
This was written a long time ago when ftrace was fragile to recursion.
But it wasn't written well even for that.
There's a possible deadlock that can occur if a ftrace_dump() is happening
and an NMI triggers another dump. This is because it grabs a lock
before checking if the dump ran.
It also totally disables ftrace, and tracing for no good reasons.
As the ring_buffer now checks if it is read via a oops or NMI, where
there's a chance that the buffer gets corrupted, it will disable
itself. No need to have ftrace_dump() do the same.
ftrace_dump() is now cleaned up where it uses an atomic counter to
make sure only one dump happens at a time. A simple atomic_inc_return()
is enough that is needed for both other CPUs and NMIs. No need for
a spinlock, as if one CPU is running the dump, no other CPU needs
to do it too.
The tracing_on variable is turned off and not turned on. The original
code did this, but it wasn't pretty. By just disabling this variable
we get the result of not seeing traces that happen between crashes.
For sysrq-z, it doesn't get turned on, but the user can always write
a '1' to the tracing_on file. If they are using sysrq-z, then they should
know about tracing_on.
The new code is much easier to read and less error prone. No more
deadlock possibility when an NMI triggers here.
Reported-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_event_mutex is an rw semaphore now, not a mutex, change the name.
Link: http://lkml.kernel.org/r/513D843B.40109@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ Forward ported to my new code ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ppc64 has its own syscall prefix like ".SyS" or ".sys". Make the
comment in arch_syscall_match_sym_name() more understandable.
Link: http://lkml.kernel.org/r/513D842F.40205@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
trace_destroy_fields() is not used outside of the file. It can be
a static function.
Link: http://lkml.kernel.org/r/513D842A.2000907@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
By moving find_event_field() and trace_find_field() into trace_events.c,
the ftrace_common_fields list and trace_get_fields() can become local to
the trace_events.c file.
find_event_field() is renamed to trace_find_event_field() to conform to
the tracing global function names.
Link: http://lkml.kernel.org/r/513D8426.9070109@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
[ rostedt: Modified trace_find_field() to trace_find_event_field() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
TRACE_MAX_PRINT macro is defined, but is not used.
Link: http://lkml.kernel.org/r/513D8421.4070404@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use pr_warn_once, instead of making an open coded implementation.
Link: http://lkml.kernel.org/r/513D8419.20400@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When testing my large changes to the ftrace system, there was
a bug that looked like the ring buffer was dropping events.
I wrote up a quick integrity checker of the ring buffer to
see if it was.
Although the bug ended up being something stupid I did in ftrace,
and had nothing to do with the ring buffer, I figured if I spent
the time to write up this test, I might as well include it in the
kernel.
I cleaned it up a bit, as the original version was rather ugly.
Not saying this version is pretty, but it's a beauty queen
compared to what I original wrote.
To enable the start up test, set CONFIG_RING_BUFFER_STARTUP_TEST.
Note, it runs for 10 seconds, so it will slow your boot time
by at least 10 more seconds.
What it does is documented in both the comments and the Kconfig
help.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function trace_clock() calls "local_clock()" which is exactly
the same clock that perf uses. I'm not sure why perf doesn't call
trace_clock(), as trace_clock() doesn't have any users.
But now it does. As trace_clock() calls local_clock() like perf does,
I added the trace_clock "perf" option that uses trace_clock().
Now the ftrace buffers can use the same clock as perf uses. This
will be useful when perf starts reading the ftrace buffers, and will
be able to interleave them with the same clock data.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a simple trace clock called "uptime" for those that are
interested in the uptime of the trace. It uses jiffies as that's
the safest method, as other uptime clocks grab seq locks, which could
cause a deadlock if taken from an event or function tracer.
Requested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the only way to stop the latency tracers from doing function
tracing is to fully disable the function tracer from the proc file
system:
echo 0 > /proc/sys/kernel/ftrace_enabled
This is a big hammer approach as it disables function tracing for
all users. This includes kprobes, perf, stack tracer, etc.
Instead, create a function-trace option that the latency tracers can
check to determine if it should enable function tracing or not.
This option can be set or cleared even while the tracer is active
and the tracers will disable or enable function tracing depending
on how the option was set.
Instead of using the proc file, disable latency function tracing with
echo 0 > /debug/tracing/options/function-trace
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Clark Williams <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the depth reported in the stack tracer stack_trace file
does not match the stack_max_size file. This is because the stack_max_size
includes the overhead of stack tracer itself while the depth does not.
The first time a max is triggered, a calculation is not performed that
figures out the overhead of the stack tracer and subtracts it from
the stack_max_size variable. The overhead is stored and is subtracted
from the reported stack size for comparing for a new max.
Now the stack_max_size corresponds to the reported depth:
# cat stack_max_size
4640
# cat stack_trace
Depth Size Location (48 entries)
----- ---- --------
0) 4640 32 _raw_spin_lock+0x18/0x24
1) 4608 112 ____cache_alloc+0xb7/0x22d
2) 4496 80 kmem_cache_alloc+0x63/0x12f
3) 4416 16 mempool_alloc_slab+0x15/0x17
[...]
While testing against and older gcc on x86 that uses mcount instead
of fentry, I found that pasing in ip + MCOUNT_INSN_SIZE let the
stack trace show one more function deep which was missing before.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When gcc 4.6 on x86 is used, the function tracer will use the new
option -mfentry which does a call to "fentry" at every function
instead of "mcount". The significance of this is that fentry is
called as the first operation of the function instead of the mcount
usage of being called after the stack.
This causes the stack tracer to show some bogus results for the size
of the last function traced, as well as showing "ftrace_call" instead
of the function. This is due to the stack frame not being set up
by the function that is about to be traced.
# cat stack_trace
Depth Size Location (48 entries)
----- ---- --------
0) 4824 216 ftrace_call+0x5/0x2f
1) 4608 112 ____cache_alloc+0xb7/0x22d
2) 4496 80 kmem_cache_alloc+0x63/0x12f
The 216 size for ftrace_call includes both the ftrace_call stack
(which includes the saving of registers it does), as well as the
stack size of the parent.
To fix this, if CC_USING_FENTRY is defined, then the stack_tracer
will reserve the first item in stack_dump_trace[] array when
calling save_stack_trace(), and it will fill it in with the parent ip.
Then the code will look for the parent pointer on the stack and
give the real size of the parent's stack pointer:
# cat stack_trace
Depth Size Location (14 entries)
----- ---- --------
0) 2640 48 update_group_power+0x26/0x187
1) 2592 224 update_sd_lb_stats+0x2a5/0x4ac
2) 2368 160 find_busiest_group+0x31/0x1f1
3) 2208 256 load_balance+0xd9/0x662
I'm Cc'ing stable, although it's not urgent, as it only shows bogus
size for item #0, the rest of the trace is legit. It should still be
corrected in previous stable releases.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use the stack of stack_trace_call() instead of check_stack() as
the test pointer for max stack size. It makes it a bit cleaner
and a little more accurate.
Adding stable, as a later fix depends on this patch.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Altough the trace_dump_stack() already skips three functions in
the call to stack trace, which gets the stack trace to start
at the caller of the function, the caller may want to skip some
more too (as it may have helper functions).
Add a skip argument to the trace_dump_stack() that lets the caller
skip back tracing functions that it doesn't care about.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add triggers to function tracer that lets an event get enabled or
disabled when a function is called:
format is:
<function>:enable_event:<system>:<event>[:<count>]
<function>:disable_event:<system>:<event>[:<count>]
echo 'schedule:enable_event:sched:sched_switch' > /debug/tracing/set_ftrace_filter
Every time schedule is called, it will enable the sched_switch event.
echo 'schedule:disable_event:sched:sched_switch:2' > /debug/tracing/set_ftrace_filter
The first two times schedule is called while the sched_switch
event is enabled, it will disable it. It will not count for a time
that the event is already disabled (or enabled for enable_event).
[ fixed return without mutex_unlock() - thanks to Dan Carpenter and smatch ]
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to let triggers enable or disable events, we need a 'soft'
method for doing so. For example, if a function probe is added that
lets a user enable or disable events when a function is called, that
change must be done without taking locks or a mutex, and definitely
it can't sleep. But the full enabling of a tracepoint is expensive.
By adding a 'SOFT_DISABLE' flag, and converting the flags to be updated
without the protection of a mutex (using set/clear_bit()), this soft
disable flag can be used to allow critical sections to enable or disable
events from being traced (after the event has been placed into "SOFT_MODE").
Some caveats though: The comm recorder (to map pids with a comm) can not
be soft disabled (yet). If you disable an event with with a "soft"
disable and wait a while before reading the trace, the comm cache may be
replaced and you'll get a bunch of <...> for comms in the trace.
Reading the "enable" file for an event that is disabled will now give
you "0*" where the '*' denotes that the tracepoint is still active but
the event itself is "disabled".
[ fixed _BIT used in & operation : thanks to Dan Carpenter and smatch ]
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The entries to the probe hash must be freed after a synchronize_sched()
after the entry has been removed from the hash.
As the entries are registered with ops that may have their own callbacks,
and these callbacks may sleep, we can not use call_rcu_sched() because
the rcu callbacks registered with that are called from a softirq context.
Instead of using call_rcu_sched(), manually save the entries on a free_list
and at the end of the loop that removes the entries, do a synchronize_sched()
and then go through the free_list, freeing the entries.
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a function probe is created, each function that the probe is
attached to, a "callback" method is called. On release of the probe,
each function entry calls the "free" method.
First, "callback" is a confusing name and does not really match what
it does. Callback sounds like it will be called when the probe
triggers. But that's not the case. This is really an "init" function,
so lets rename it as such.
Secondly, both "init" and "free" do not pass enough information back
to the handlers. Pass back the ops, ip and data for each time the
method is called. We have the information, might as well use it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
echo 'schedule:snapshot:1' > /debug/tracing/set_ftrace_filter
This will cause the scheduler to trigger a snapshot the next time
it's called (you can use any function that's not called by NMI).
Even though it triggers only once, you still need to remove it with:
echo '!schedule:snapshot:0' > /debug/tracing/set_ftrace_filter
The :1 can be left off for the first command:
echo 'schedule:snapshot' > /debug/tracing/set_ftrace_filter
But this will cause all calls to schedule to trigger a snapshot.
This must be removed without the ':0'
echo '!schedule:snapshot' > /debug/tracing/set_ftrace_filter
As adding a "count" is a different operation (internally).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add alloc_snapshot() and free_snapshot() to allocate and free the
snapshot buffer respectively, and use these to remove duplicate
code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the function probe enables all functions and runs a "hash"
against every function call to see if it should call a probe. This
is extremely wasteful.
Note, a probe is something like:
echo schedule:traceoff > /debug/tracing/set_ftrace_filter
When schedule is called, the probe will disable tracing. But currently,
it has a call back for *all* functions, and checks to see if the
called function is the probe that is needed.
The probe function has been created before ftrace was rewritten to
allow for more than one "op" to be registered by the function tracer.
When probes were created, it couldn't limit the functions without also
limiting normal function calls. But now we can, it's about time
to update the probe code.
Todo, have separate ops for different entries. That is, assign
a ftrace_ops per probe, instead of one op for all probes. But
as there's not many probes assigned, this may not be that urgent.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing probes that trigger traceon or traceoff can be
set to unlimited, or given a count of # of times to execute.
By separating these two types of probes, we can then use the dynamic
ftrace function filtering directly, and remove the brute force
"check if this function called is my probe" routines in ftrace.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The only thing ftrace_trace_onoff_unreg() does is to do a strcmp()
against the cmd parameter to determine what op to unregister. But
this compare is also done after the location that this function is
called (and returns). By moving the check for '!' to unregister after
the strcmp(), the callback function itself can just do the unregister
and we can get rid of the helper function.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove some duplicate code and replace it with a helper function.
This makes the code a it cleaner.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add EXPORT_SYMBOL_GPL() to let the tracing_snapshot() functions be
called from modules.
Also add a test to see if the snapshot was called from NMI context
and just warn in the tracing buffer if so, and return.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a few places that ftrace uses trace_printk() for internal
use, but this requires context (normal, softirq, irq, NMI) buffers
to keep things lockless. But the trace_puts() does not, as it can
write the string directly into the ring buffer. Make a internal helper
for trace_puts() and have the internal functions use that.
This way the extra context buffers are not used.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_printk() is extremely fast and is very handy as it can be
used in any context (including NMIs!). But it still requires scanning
the fmt string for parsing the args. Even the trace_bprintk() requires
a scan to know what args will be saved, although it doesn't copy the
format string itself.
Several times trace_printk() has no args, and wastes cpu cycles scanning
the fmt string.
Adding trace_puts() allows the developer to use an even faster
tracing method that only saves the pointer to the string in the
ring buffer without doing any format parsing at all. This will
help remove even more of the "Heisenbug" effect, when debugging.
Also fixed up the F_printk()s for the ftrace internal bprint and print events.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The changce to add the trace_buffer struct to have the trace array
have both the main buffer and max buffer broke the branch tracer
because the change did not update that code. As the branch tracer
adds a significant amount of overhead, and must be selected via
a selection (not a allyesconfig) it was missed in testing.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If debugging the kernel, and the developer wants to use
tracing_snapshot() in places where tracing_snapshot_alloc() may
be difficult (or more likely, the developer is lazy and doesn't
want to bother with tracing_snapshot_alloc() at all), then adding
alloc_snapshot
to the kernel command line parameter will tell ftrace to allocate
the snapshot buffer (if configured) when it allocates the main
tracing buffer.
I also noticed that ring_buffer_expanded and tracing_selftest_disabled
had inconsistent use of boolean "true" and "false" with "0" and "1".
I cleaned that up too.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move the tracing startup selftest code into its own function and
when not enabled, always have that function succeed.
This makes the register_tracer() function much more readable.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ring buffer updates when done while the ring buffer is active,
needs to be completed on the CPU that is used for the ring buffer
per_cpu buffer. To accomplish this, schedule_work_on() is used to
schedule work on the given CPU.
Now there's no reason to use schedule_work_on() if the process
doing the update happens to be on the CPU that it is processing.
It has already filled the requirement. Instead, just do the work
and continue.
This is needed for tracing_snapshot_alloc() where it may be called
really early in boot, where the work queues have not been set up yet.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The new snapshot feature is quite handy. It's a way for the user
to take advantage of the spare buffer that, until then, only
the latency tracers used to "snapshot" the buffer when it hit
a max latency. Now users can trigger a "snapshot" manually when
some condition is hit in a program. But a snapshot currently can
not be triggered by a condition inside the kernel.
With the addition of tracing_snapshot() and tracing_snapshot_alloc(),
snapshots can now be taking when a condition is hit, and the
developer wants to snapshot the case without stopping the trace.
Note, any snapshot will overwrite the old one, so take care
in how this is done.
These new functions are to be used like tracing_on(), tracing_off()
and trace_printk() are. That is, they should never be called
in the mainline Linux kernel. They are solely for the purpose
of debugging.
The tracing_snapshot() will not allocate a buffer, but it is
safe to be called from any context (except NMIs). But if a
snapshot buffer isn't allocated when it is called, it will write
to the live buffer, complaining about the lack of a snapshot
buffer, and then stop tracing (giving you the "permanent snapshot").
tracing_snapshot_alloc() will allocate the snapshot buffer if
it was not already allocated and then take the snapshot. This routine
*may sleep*, and must be called from context that can sleep.
The allocation is done with GFP_KERNEL and not atomic.
If you need a snapshot in an atomic context, say in early boot,
then it is best to call the tracing_snapshot_alloc() before then,
where it will allocate the buffer, and then you can use the
tracing_snapshot() anywhere you want and still get snapshots.
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a ref count to the trace_array structure and prevent removal
of instances that have open descriptors.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the per_cpu directory to the created tracing instances:
cd /sys/kernel/debug/tracing/instances
mkdir foo
ls foo/per_cpu/cpu0
buffer_size_kb snapshot_raw trace trace_pipe_raw
snapshot stats trace_pipe
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the "snapshot" file to the the multi-buffer instances.
cd /sys/kernel/debug/tracing/instances
mkdir foo
ls foo
buffer_size_kb buffer_total_size_kb events free_buffer set_event
snapshot trace trace_clock trace_marker trace_options trace_pipe
tracing_on
cat foo/snapshot
# tracer: nop
#
#
# * Snapshot is freed *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a bit of duplicate code in creating the trace buffers for
the normal trace buffer and the max trace buffer among the instances
and the main global_trace. This code can be consolidated and cleaned
up a bit making the code cleaner and more readable as well as less
duplication.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The snapshot buffer belongs to the trace array not the tracer that is
running. The trace array should be the data structure that keeps track
of whether or not the snapshot buffer is allocated, not the tracer
desciptor. Having the trace array keep track of it makes modifications
so much easier.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a 'snapshot_raw' per_cpu file that allows tools to read the raw
binary data of the snapshot buffer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the preempt or irq latency tracers are enabled, they require
the ring buffer to be able to swap the per cpu sub buffers between
two main buffers. This adds a slight overhead to tracing as the
trace recording needs to perform some checks to synchronize
between recording and swaps that might be happening on other CPUs.
The config RING_BUFFER_ALLOW_SWAP is set when a user of the ring
buffer needs the "swap cpu" feature, otherwise the extra checks
are not implemented and removed from the tracing overhead.
The snapshot feature will swap per CPU if the RING_BUFFER_ALLOW_SWAP
config is set. But that only gets set by things like OPROFILE
and the irqs and preempt latency tracers.
This config is added to let the user decide to include this feature
with the snapshot agnostic from whether or not another user of
the ring buffer sets this config.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the snapshot file into the per_cpu tracing directories to allow
them to be read for an individual cpu. This also allows to clear
an individual cpu from the snapshot buffer.
If the kernel allows it (CONFIG_RING_BUFFER_ALLOW_SWAP is set), then
echoing in '1' into one of the per_cpu snapshot files will do an
individual cpu buffer swap instead of the entire file.
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the way the latency tracers and snapshot feature works
is to have a separate trace_array called "max_tr" that holds the
snapshot buffer. For latency tracers, this snapshot buffer is used
to swap the running buffer with this buffer to save the current max
latency.
The only items needed for the max_tr is really just a copy of the buffer
itself, the per_cpu data pointers, the time_start timestamp that states
when the max latency was triggered, and the cpu that the max latency
was triggered on. All other fields in trace_array are unused by the
max_tr, making the max_tr mostly bloat.
This change removes the max_tr completely, and adds a new structure
called trace_buffer, that holds the buffer pointer, the per_cpu data
pointers, the time_start timestamp, and the cpu where the latency occurred.
The trace_array, now has two trace_buffers, one for the normal trace and
one for the max trace or snapshot. By doing this, not only do we remove
the bloat from the max_trace but the instances of traces can now use
their own snapshot feature and not have just the top level global_trace have
the snapshot feature and latency tracers for itself.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The snapshot utility is extremely useful, and does not add any more
overhead in memory when another latency tracer is enabled. They use
the snapshot underneath. There's no reason to hide the snapshot file
when a latency tracer has been enabled in the kernel.
If any of the latency tracers (irq, preempt or wakeup) is enabled
then also select the snapshot facility.
Note, snapshot can be enabled without the latency tracers enabled.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently we do not know what buffer a module event was enabled in.
On unload, it is safest to clear all buffer instances, not just the
top level buffer.
Todo: Clear only the buffer that the event was used in. The
infrastructure is there to do this, but it makes the code a bit
more complex. Lets get the current code vetted before we add that.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, when a module with events is unloaded, the trace buffer is
cleared. This is just a safety net in case the module might have some
strange callback when its event is outputted. But there's no reason
to reset the buffer if the module didn't have any of its events traced.
Add a flag to the event "call" structure called WAS_ENABLED and gets set
when the event is ever enabled, and this flag never gets cleared. When a
module gets unloaded, if any of its events have this flag set, then the
trace buffer will get cleared.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The move of blocked readers to the ring buffer left out the
init of the wait queue that is used. Tests missed this due to running
stress tests against the buffers, which didn't allow for any
readers to end up waiting. Running a simple read and wait triggered
a bug.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As we've added __init annotation to field-defining functions, we should
add __refdata annotation to event_call variables, which reference those
functions.
Link: http://lkml.kernel.org/r/51343C1F.2050502@huawei.com
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The new multi-buffers added a descriptor that kept track of module
events, and the directories they use, with struct ftace_module_file_ops.
This is used to add a ref count to keep modules from unloading while
their files are being accessed.
As the descriptor is only needed when CONFIG_MODULES is enabled, it
is only declared when the config is enabled. But that struct is
dereferenced in a few areas outside the #ifdef CONFIG_MODULES.
By adding some helper routines and moving code around a little,
events can be compiled again without modules.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the conversion of the data array to per cpu, sparse now complains
about the use of per_cpu_ptr() on the variable. But The variable is
allocated with alloc_percpu() and is fine to use. But since the structure
that contains the data variable does not annotate it as such, sparse
gives out a lot of false warnings.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
These two functions are called during kernel boot only.
Link: http://lkml.kernel.org/r/51258796.7020704@huawei.com
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move duplicate code in event print functions to a helper function.
This shrinks the size of the kernel by ~13K.
text data bss dec hex filename
6596137 1743966 10138672 18478775 119f6b7 vmlinux.o.old
6583002 1743849 10138672 18465523 119c2f3 vmlinux.o.new
Link: http://lkml.kernel.org/r/51258746.2060304@huawei.com
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move the logic to wake up on ring buffer data into the ring buffer
code itself. This simplifies the tracing code a lot and also has the
added benefit that waiters on one of the instance buffers can be woken
only when data is added to that instance instead of data added to
any instance.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the ring buffer is empty, a read to trace_pipe_raw wont block.
The tracing code has the infrastructure to wake up waiting readers,
but the trace_pipe_raw doesn't take advantage of that.
When a read is done to trace_pipe_raw without the O_NONBLOCK flag
set, have the read block until there's data in the requested buffer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_pipe_raw never implemented polling and this was casing
issues for several utilities. This is now implemented.
Blocked reads still are on the TODO list.
Reported-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Tested-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently only the splice NONBLOCK flag is checked to determine if
the splice read should block or not. But the file descriptor NONBLOCK
flag also needs to be checked.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The names used to display the field and type in the event format
files are copied, as well as the system name that is displayed.
All these names are created by constant values passed in.
If one of theses values were to be removed by a module, the module
would also be required to remove any event it created.
By using the strings directly, we can save over 100K of memory.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The event structures used by the trace events are mostly persistent,
but they are also allocated by kmalloc, which is not the best at
allocating space for what is used. By converting these kmallocs
into kmem_cache_allocs, we can save over 50K of space that is
permanently allocated.
After boot we have:
slab name active allocated size
--------- ------ --------- ----
ftrace_event_file 979 1005 56 67 1
ftrace_event_field 2301 2310 48 77 1
The ftrace_event_file has at boot up 979 active objects out of
1005 allocated in the slabs. Each object is 56 bytes. In a normal
kmalloc, that would allocate 64 bytes for each object.
1005 - 979 = 26 objects not used
26 * 56 = 1456 bytes wasted
But if we used kmalloc:
64 - 56 = 8 bytes unused per allocation
8 * 979 = 7832 bytes wasted
7832 - 1456 = 6376 bytes in savings
Doing the same for ftrace_event_field where there's 2301 objects
allocated in a slab that can hold 2310 with 48 bytes each we have:
2310 - 2301 = 9 objects not used
9 * 48 = 432 bytes wasted
A kmalloc would also use 64 bytes per object:
64 - 48 = 16 bytes unused per allocation
16 * 2301 = 36816 bytes wasted!
36816 - 432 = 36384 bytes in savings
This change gives us a total of 42760 bytes in savings. At least
on my machine, but as there's a lot of these persistent objects
for all configurations that use trace points, this is a net win.
Thanks to Ezequiel Garcia for his trace_analyze presentation which
pointed out the wasted space in my code.
Cc: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the new descriptors used to allow multiple buffers in the
tracing directory added, the kernel command line parameter
trace_events=... no longer works. This is because the top level
(global) trace array now has a list of descriptors associated
with the events and the files in the debugfs directory. But in
early bootup, when the command line is processed and the events
enabled, the trace array list of events has not been set up yet.
Without the list of events in the trace array, the setting of
events to record will fail because it would not match any events.
The solution is to set up the top level array in two stages.
The first is to just add the ftrace file descriptors that just point
to the events. This will allow events to be enabled and start tracing.
The second stage is called after the filesystem is set up, and this
stage will create the debugfs event files and directories associated
with the trace array events.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a method to the hijacked dentry descriptor of the
"instances" directory to allow for rmdir to remove an
instance of a multibuffer.
Example:
cd /debug/tracing/instances
mkdir hello
ls
hello/
rmdir hello
ls
Like the mkdir method, the i_mutex is dropped for the instances
directory. The instances directory is created at boot up and can
not be renamed or removed. The trace_types_lock mutex is used to
synchronize adding and removing of instances.
I've run several stress tests with different threads trying to
create and delete directories of the same name, and it has stood
up fine.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the interface ("instances" directory) to add multiple buffers
to ftrace. To create a new instance, simply do a mkdir in the
instances directory:
This will create a directory with the following:
# cd instances
# mkdir foo
# ls foo
buffer_size_kb free_buffer trace_clock trace_pipe
buffer_total_size_kb set_event trace_marker tracing_enabled
events/ trace trace_options tracing_on
Currently only events are able to be set, and there isn't a way
to delete a buffer when one is created (yet).
Note, the i_mutex lock is dropped from the parent "instances"
directory during the mkdir operation. As the "instances" directory
can not be renamed or deleted (created on boot), I do not see
any harm in dropping the lock. The creation of the sub directories
is protected by trace_types_lock mutex, which only lets one
instance get into the code path at a time. If two tasks try to
create or delete directories of the same name, only one will occur
and the other will fail with -EEXIST.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the syscall events record into the global buffer. But if
multiple buffers are in place, then we need to have syscall events
record in the proper buffers.
By adding descriptors to pass to the syscall event functions, the
syscall events can now record into the buffers that have been assigned
to them (one event may be applied to mulitple buffers).
This will allow tracing high volume syscalls along with seldom occurring
syscalls without losing the seldom syscall events.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The global and max-tr currently use static per_cpu arrays for the CPU data
descriptors. But in order to get new allocated trace_arrays, they need to
be allocated per_cpu arrays. Instead of using the static arrays, switch
the global and max-tr to use allocated data.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pass the struct ftrace_event_file *ftrace_file to the
trace_event_buffer_lock_reserve() (new function that replaces the
trace_current_buffer_lock_reserver()).
The ftrace_file holds a pointer to the trace_array that is in use.
In the case of multiple buffers with different trace_arrays, this
allows different events to be recorded into different buffers.
Also fixed some of the stale comments in include/trace/ftrace.h
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The global_trace variable in kernel/trace/trace.c has been kept 'static' and
local to that file so that it would not be used too much outside of that
file. This has paid off, even though there were lots of changes to make
the trace_array structure more generic (not depending on global_trace).
Removal of a lot of direct usages of global_trace is needed to be able to
create more trace_arrays such that we can add multiple buffers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both RING_BUFFER_ALL_CPUS and TRACE_PIPE_ALL_CPU are defined as
-1 and used to say that all the ring buffers are to be modified
or read (instead of just a single cpu, which would be >= 0).
There's no reason to keep TRACE_PIPE_ALL_CPU as it is also started
to be used for more than what it was created for, and now that
the ring buffer code added a generic RING_BUFFER_ALL_CPUS define,
we can clean up the trace code to use that instead and remove
the TRACE_PIPE_ALL_CPU macro.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace events for ftrace are all defined via global variables.
The arrays of events and event systems are linked to a global list.
This prevents multiple users of the event system (what to enable and
what not to).
By adding descriptors to represent the event/file relation, as well
as to which trace_array descriptor they are associated with, allows
for more than one set of events to be defined. Once the trace events
files have a link between the trace event and the trace_array they
are associated with, we can create multiple trace_arrays that can
record separate events in separate buffers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The latency tracers require the buffers to be in overwrite mode,
otherwise they get screwed up. Force the buffers to stay in overwrite
mode when latency tracers are enabled.
Added a flag_changed() method to the tracer structure to allow
the tracers to see what flags are being changed, and also be able
to prevent the change from happing.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Changing the overwrite mode for the ring buffer via the trace
option only sets the normal buffer. But the snapshot buffer could
swap with it, and then the snapshot would be in non overwrite mode
and the normal buffer would be in overwrite mode, even though the
option flag states otherwise.
Keep the two buffers overwrite modes in sync.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Seems that the tracer flags have never been protected from
synchronous writes. Luckily, admins don't usually modify the
tracing flags via two different tasks. But if scripts were to
be used to modify them, then they could get corrupted.
Move the trace_types_lock that protects against tracers changing
to also protect the flags being set.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Because function tracing is very invasive, and can even trace
calls to rcu_read_lock(), RCU access in function tracing is done
with preempt_disable_notrace(). This requires a synchronize_sched()
for updates and not a synchronize_rcu().
Function probes (traceon, traceoff, etc) must be freed after
a synchronize_sched() after its entry has been removed from the
hash. But call_rcu() is used. Fix this by using call_rcu_sched().
Also fix the usage to use hlist_del_rcu() instead of hlist_del().
Cc: stable@vger.kernel.org
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Although the swap is wrapped with a spin_lock, the assignment
of the temp buffer used to swap is not within that lock.
It needs to be moved into that lock, otherwise two swaps
happening on two different CPUs, can end up using the wrong
temp buffer to assign in the swap.
Luckily, all current callers of the swap function appear to have
their own locks. But in case something is added that allows two
different callers to call the swap, then there's a chance that
this race can trigger and corrupt the buffers.
New code is coming soon that will allow for this race to trigger.
I've Cc'd stable, so this bug will not show up if someone backports
one of the changes that can trigger this bug.
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf fixes from Ingo Molnar:
"Misc minor fixes mostly related to tracing"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
s390: Fix a header dependencies related build error
tracing: update documentation of snapshot utility
tracing: Do not return EINVAL in snapshot when not allocated
tracing: Add help of snapshot feature when snapshot is empty
ftrace: Update the kconfig for DYNAMIC_FTRACE
To use the tracing snapshot feature, writing a '1' into the snapshot
file causes the snapshot buffer to be allocated if it has not already
been allocated and dose a 'swap' with the main buffer, so that the
snapshot now contains what was in the main buffer, and the main buffer
now writes to what was the snapshot buffer.
To free the snapshot buffer, a '0' is written into the snapshot file.
To clear the snapshot buffer, any number but a '0' or '1' is written
into the snapshot file. But if the file is not allocated it returns
-EINVAL error code. This is rather pointless. It is better just to
do nothing and return success.
Acked-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When cat'ing the snapshot file, instead of showing an empty trace
header like the trace file does, show how to use the snapshot
feature.
Also, this is a good place to show if the snapshot has been allocated
or not. Users may want to "pre allocate" the snapshot to have a fast
"swap" of the current buffer. Otherwise, a swap would be slow and might
fail as it would need to allocate the snapshot buffer, and that might
fail under tight memory constraints.
Here's what it looked like before:
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
Here's what it looks like now:
# tracer: nop
#
#
# * Snapshot is freed *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
Acked-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This adds core architecture support for Imagination's Meta processor
cores, followed by some later miscellaneous arch/metag cleanups and
fixes which I kept separate to ease review:
- Support for basic Meta 1 (ATP) and Meta 2 (HTP) core architecture
- A few fixes all over, particularly for symbol prefixes
- A few privilege protection fixes
- Several cleanups (setup.c includes, split out a lot of metag_ksyms.c)
- Fix some missing exports
- Convert hugetlb to use vm_unmapped_area()
- Copy device tree to non-init memory
- Provide dma_get_sgtable()
Signed-off-by: James Hogan <james.hogan@imgtec.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJRMmVXAAoJEKHZs+irPybfivgP/inEXqJyfw59omQdjwvYcU/a
/u0MJ3UKSNS3U+HknfaFCy/Nwk1dqPLjqqyVC1V6AbUPBXlaEwGcimlNRx2uRjdq
Uh96upMLHsNuF/xiiR477g3RwY0egIJdM1R1bGi3mZ3vVrNQGF+wbni6f61xCWGz
M/4rDglpQvE79oLhYdgj6tidZtHQT0YWtERA9W90zkQWXGYmpFPKBKbfZAi5+rKQ
U6Gpg26orUugzXNaxltJEYKE8gjLTppEabx8DARnItZ4zCMy4dw5RBJ35RFvQw6e
eSmfgTy9w9WqBMY2+QMSgU0KQt1IITCzX7OlOXC0jALQJXoU0WWbOELlBVQLCwF1
T0OcR/5ZP/hIlOk5Dh+e9U3AtbASXdMtqA0ZUe78woH1CBf7Nc/0c0vRg23EdMh8
lnHDJxT/UqskoOYLI4kgWbEdLDy4uTh19U2pVi7VCo7ksLB9Bj9Xc8VSKgscSXTl
OwzN+c4Jgtu8FDFTp+Af4AT8pYGJ08j8L2ErsV2sOv3Q44U5WXdrMz3GSgwXj8+4
wZk3HvdkQVkMD5sJCUZgAswaN6BnbB0pHdCz4wMQ8jR/Ogs015Ipk64Ecym9S/4n
uES7PnDtt/4lb5EyX2ScbvdnZTAFTaaP7OOhC77BOQvbQjIW1tkAcxWJqRry86uS
iM0BFgK6Ohx3geqa5Ft0
=65cR
-----END PGP SIGNATURE-----
Merge tag 'metag-v3.9-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag
Pull new ImgTec Meta architecture from James Hogan:
"This adds core architecture support for Imagination's Meta processor
cores, followed by some later miscellaneous arch/metag cleanups and
fixes which I kept separate to ease review:
- Support for basic Meta 1 (ATP) and Meta 2 (HTP) core architecture
- A few fixes all over, particularly for symbol prefixes
- A few privilege protection fixes
- Several cleanups (setup.c includes, split out a lot of
metag_ksyms.c)
- Fix some missing exports
- Convert hugetlb to use vm_unmapped_area()
- Copy device tree to non-init memory
- Provide dma_get_sgtable()"
* tag 'metag-v3.9-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag: (61 commits)
metag: Provide dma_get_sgtable()
metag: prom.h: remove declaration of metag_dt_memblock_reserve()
metag: copy devicetree to non-init memory
metag: cleanup metag_ksyms.c includes
metag: move mm/init.c exports out of metag_ksyms.c
metag: move usercopy.c exports out of metag_ksyms.c
metag: move setup.c exports out of metag_ksyms.c
metag: move kick.c exports out of metag_ksyms.c
metag: move traps.c exports out of metag_ksyms.c
metag: move irq enable out of irqflags.h on SMP
genksyms: fix metag symbol prefix on crc symbols
metag: hugetlb: convert to vm_unmapped_area()
metag: export clear_page and copy_page
metag: export metag_code_cache_flush_all
metag: protect more non-MMU memory regions
metag: make TXPRIVEXT bits explicit
metag: kernel/setup.c: sort includes
perf: Enable building perf tools for Meta
metag: add boot time LNKGET/LNKSET check
metag: add __init to metag_cache_probe()
...
Some 32 bit architectures require 64 bit values to be aligned (for
example Meta which has 64 bit read/write instructions). These require 8
byte alignment of event data too, so use
!CONFIG_HAVE_64BIT_ALIGNED_ACCESS instead of !CONFIG_64BIT ||
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to decide alignment, and align
buffer_data_page::data accordingly.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org> (previous version subtly different)
Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:
- The big cfq/blkcg update from Tejun and and Vivek.
- Additional block and writeback tracepoints from Tejun.
- Improvement of the should sort (based on queues) logic in the plug
flushing.
- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.
- Various little fixes.
You'll get two trivial merge conflicts, which should be easy enough to
fix up"
Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42: "hlist: drop the node parameter from iterators").
* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The prompt to enable DYNAMIC_FTRACE (the ability to nop and
enable function tracing at run time) had a confusing statement:
"enable/disable ftrace tracepoints dynamically"
This was written before tracepoints were added to the kernel,
but now that tracepoints have been added, this is very confusing
and has confused people enough to give wrong information during
presentations.
Not only that, I looked at the help text, and it still references
that dreaded daemon that use to wake up once a second to update
the nop locations and brick NICs, that hasn't been around for over
five years.
Time to bring the text up to the current decade.
Cc: stable@vger.kernel.org
Reported-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- Rework of the ACPI namespace scanning code from Rafael J. Wysocki
with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
Toshi Kani, and Yinghai Lu.
- ACPI power resources handling and ACPI device PM update from
Rafael J. Wysocki.
- ACPICA update to version 20130117 from Bob Moore and Lv Zheng
with contributions from Aaron Lu, Chao Guan, Jesper Juhl, and
Tim Gardner.
- Support for Intel Lynxpoint LPSS from Mika Westerberg.
- cpuidle update from Len Brown including Intel Haswell support, C1
state for intel_idle, removal of global pm_idle.
- cpuidle fixes and cleanups from Daniel Lezcano.
- cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri
with contributions from Stratos Karafotis and Rickard Andersson.
- Intel P-states driver for Sandy Bridge processors from
Dirk Brandewie.
- cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.
- cpufreq fixes related to ordering issues between acpi-cpufreq and
powernow-k8 from Borislav Petkov and Matthew Garrett.
- cpufreq support for Calxeda Highbank processors from Mark Langsdorf
and Rob Herring.
- cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
from Shawn Guo.
- cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
and Inderpal Singh.
- Support for "lightweight suspend" from Zhang Rui.
- Removal of the deprecated power trace API from Paul Gortmaker.
- Assorted updates from Andreas Fleig, Colin Ian King,
Davidlohr Bueso, Joseph Salisbury, Kees Cook, Li Fei,
Nishanth Menon, ShuoX Liu, Srinivas Pandruvada, Tejun Heo,
Thomas Renninger, and Yasuaki Ishimatsu.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJRIsArAAoJEKhOf7ml8uNsD6MP/j7C4NA+GTq6RdwoJt+Yki0K
9Ep8I4pEuRFoN/oskv24EyQhpGJIk6UxWcJ/DWFBc+1VhmKORta7k2Idv/wlJA77
s7AcDveA9xcDh+TVfbh87TeuiMSXiSdDZbiaQO+wMizWJAF3F84AnjiAqqqyQcSK
bA5/Siz/vWlt9PyYDaQtHTVE4lpvPuVcQdYewsdaH2PsmUjvIg/TUzg28CTrdyvv
eHOdBK9R0/OLQLhzRbL0VOGJ//wEl+HJRO0QEhTKPgdQ1e/VH/4Zu5WSzF8P/x4C
s2f8U4IKQqulDuDHXtpMpelFm7hRWgsOqZLkcyXLs+0dvSM9CTPO6P0ZaImxUctk
5daHWEsXUnCErDQawt1mcZP8l6qnxofMQIfLXyPVzvlSnHyToTmrtXa1v2u4AuL/
hOo4MYWsFNUmRdtGFFGlExGgEDZ4G5NwiYjRBl/6XJ3v4nhnnMbuzxP8scpoe5m1
8tjroJHZFUUs/mFU/H+oRbHzSzXPmp1sddNaTg4OpVmTn3DDh6ljnFhiItd1Ndw0
5ldVbSe6ETq5RoK0TbzvQOeVpa9F3JfqbrXLQPqfd2iz/No41LQYG1uShRYuXKuA
wfEcc+c9VMd3FILu05pGwBnU8VS9VbxTYMz7xDxg6b29Ywnb7u+Q1ycCk2gFYtkS
E2oZDuyewTJxaskzYsNr
=wijn
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
- Rework of the ACPI namespace scanning code from Rafael J. Wysocki
with contributions from Bjorn Helgaas, Jiang Liu, Mika Westerberg,
Toshi Kani, and Yinghai Lu.
- ACPI power resources handling and ACPI device PM update from Rafael
J Wysocki.
- ACPICA update to version 20130117 from Bob Moore and Lv Zheng with
contributions from Aaron Lu, Chao Guan, Jesper Juhl, and Tim Gardner.
- Support for Intel Lynxpoint LPSS from Mika Westerberg.
- cpuidle update from Len Brown including Intel Haswell support, C1
state for intel_idle, removal of global pm_idle.
- cpuidle fixes and cleanups from Daniel Lezcano.
- cpufreq fixes and cleanups from Viresh Kumar and Fabio Baltieri with
contributions from Stratos Karafotis and Rickard Andersson.
- Intel P-states driver for Sandy Bridge processors from Dirk
Brandewie.
- cpufreq driver for Marvell Kirkwood SoCs from Andrew Lunn.
- cpufreq fixes related to ordering issues between acpi-cpufreq and
powernow-k8 from Borislav Petkov and Matthew Garrett.
- cpufreq support for Calxeda Highbank processors from Mark Langsdorf
and Rob Herring.
- cpufreq driver for the Freescale i.MX6Q SoC and cpufreq-cpu0 update
from Shawn Guo.
- cpufreq Exynos fixes and cleanups from Jonghwan Choi, Sachin Kamat,
and Inderpal Singh.
- Support for "lightweight suspend" from Zhang Rui.
- Removal of the deprecated power trace API from Paul Gortmaker.
- Assorted updates from Andreas Fleig, Colin Ian King, Davidlohr Bueso,
Joseph Salisbury, Kees Cook, Li Fei, Nishanth Menon, ShuoX Liu,
Srinivas Pandruvada, Tejun Heo, Thomas Renninger, and Yasuaki
Ishimatsu.
* tag 'pm+acpi-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (267 commits)
PM idle: remove global declaration of pm_idle
unicore32 idle: delete stray pm_idle comment
openrisc idle: delete pm_idle
mn10300 idle: delete pm_idle
microblaze idle: delete pm_idle
m32r idle: delete pm_idle, and other dead idle code
ia64 idle: delete pm_idle
cris idle: delete idle and pm_idle
ARM64 idle: delete pm_idle
ARM idle: delete pm_idle
blackfin idle: delete pm_idle
sparc idle: rename pm_idle to sparc_idle
sh idle: rename global pm_idle to static sh_idle
x86 idle: rename global pm_idle to static x86_idle
APM idle: register apm_cpu_idle via cpuidle
cpufreq / intel_pstate: Add kernel command line option disable intel_pstate.
cpufreq / intel_pstate: Change to disallow module build
tools/power turbostat: display SMI count by default
intel_idle: export both C1 and C1E
ACPI / hotplug: Fix concurrency issues and memory leaks
...
Pull scheduler changes from Ingo Molnar:
"Main changes:
- scheduler side full-dynticks (user-space execution is undisturbed
and receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready, from Frederic
Weisbecker.
- Initial sched.h split-up changes, by Clark Williams
- select_idle_sibling() performance improvement by Mike Galbraith:
" 1 tbench pair (worst case) in a 10 core + SMT package:
pre 15.22 MB/sec 1 procs
post 252.01 MB/sec 1 procs "
- sched_rr_get_interval() ABI fix/change. We think this detail is not
used by apps (so it's not an ABI in practice), but lets keep it
under observation.
- misc RT scheduling cleanups, optimizations"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h>
cputime: Remove irqsave from seqlock readers
sched, powerpc: Fix sched.h split-up build failure
cputime: Restore CPU_ACCOUNTING config defaults for PPC64
sched/rt: Move rt specific bits into new header file
sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
sched: Move sched.h sysctl bits into separate header
sched: Fix signedness bug in yield_to()
sched: Fix select_idle_sibling() bouncing cow syndrome
sched/rt: Further simplify pick_rt_task()
sched/rt: Do not account zero delta_exec in update_curr_rt()
cputime: Safely read cputime of full dynticks CPUs
kvm: Prepare to add generic guest entry/exit callbacks
cputime: Use accessors to read task cputime stats
cputime: Allow dynamic switch between tick/virtual based cputime accounting
cputime: Generic on-demand virtual cputime accounting
cputime: Move default nsecs_to_cputime() to jiffies based cputime file
cputime: Librarize per nsecs resolution cputime definitions
cputime: Avoid multiplication overflow on utime scaling
context_tracking: Export context state for generic vtime
...
Fix up conflict in kernel/context_tracking.c due to comment additions.
Pull perf changes from Ingo Molnar:
"There are lots of improvements, the biggest changes are:
Main kernel side changes:
- Improve uprobes performance by adding 'pre-filtering' support, by
Oleg Nesterov.
- Make some POWER7 events available in sysfs, equivalent to what was
done on x86, from Sukadev Bhattiprolu.
- tracing updates by Steve Rostedt - mostly misc fixes and smaller
improvements.
- Use perf/event tracing to report PCI Express advanced errors, by
Tony Luck.
- Enable northbridge performance counters on AMD family 15h, by Jacob
Shin.
- This tracing commit:
tracing: Remove the extra 4 bytes of padding in events
changes the ABI. All involved parties (PowerTop in particular)
seem to agree that it's safe to do now with the introduction of
libtraceevent, but the devil is in the details ...
Main tooling side changes:
- Add 'event group view', from Namyung Kim:
To use it, 'perf record' should group events when recording. And
then perf report parses the saved group relation from file header
and prints them together if --group option is provided. You can
use the 'perf evlist' command to see event group information:
$ perf record -e '{ref-cycles,cycles}' noploop 1
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.385 MB perf.data (~16807 samples) ]
$ perf evlist --group
{ref-cycles,cycles}
With this example, default perf report will show you each event
separately.
You can use --group option to enable event group view:
$ perf report --group
...
# group: {ref-cycles,cycles}
# ========
# Samples: 7K of event 'anon group { ref-cycles, cycles }'
# Event count (approx.): 6876107743
#
# Overhead Command Shared Object Symbol
# ................ ....... ................. ..........................
99.84% 99.76% noploop noploop [.] main
0.07% 0.00% noploop ld-2.15.so [.] strcmp
0.03% 0.00% noploop [kernel.kallsyms] [k] timerqueue_del
0.03% 0.03% noploop [kernel.kallsyms] [k] sched_clock_cpu
0.02% 0.00% noploop [kernel.kallsyms] [k] account_user_time
0.01% 0.00% noploop [kernel.kallsyms] [k] __alloc_pages_nodemask
0.00% 0.00% noploop [kernel.kallsyms] [k] native_write_msr_safe
0.00% 0.11% noploop [kernel.kallsyms] [k] _raw_spin_lock
0.00% 0.06% noploop [kernel.kallsyms] [k] find_get_page
0.00% 0.02% noploop [kernel.kallsyms] [k] rcu_check_callbacks
0.00% 0.02% noploop [kernel.kallsyms] [k] __current_kernel_time
As you can see the Overhead column now contains both of ref-cycles
and cycles and header line shows group information also - 'anon
group { ref-cycles, cycles }'. The output is sorted by period of
group leader first.
- Initial GTK+ annotate browser, from Namhyung Kim.
- Add option for runtime switching perf data file in perf report,
just press 's' and a menu with the valid files found in the current
directory will be presented, from Feng Tang.
- Add support to display whole group data for raw columns, from Jiri
Olsa.
- Add per processor socket count aggregation in perf stat, from
Stephane Eranian.
- Add interval printing in 'perf stat', from Stephane Eranian.
- 'perf test' improvements
- Add support for wildcards in tracepoint system name, from Jiri
Olsa.
- Add anonymous huge page recognition, from Joshua Zhu.
- perf build-id cache now can show DSOs present in a perf.data file
that are not in the cache, to integrate with build-id servers being
put in place by organizations such as Fedora.
- perf top now shares more of the evsel config/creation routines with
'record', paving the way for further integration like 'top'
snapshots, etc.
- perf top now supports DWARF callchains.
- Fix mmap limitations on 32-bit, fix from David Miller.
- 'perf bench numa mem' NUMA performance measurement suite
- ... and lots of fixes, performance improvements, cleanups and other
improvements I failed to list - see the shortlog and git log for
details."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (270 commits)
perf/x86/amd: Enable northbridge performance counters on AMD family 15h
perf/hwbp: Fix cleanup in case of kzalloc failure
perf tools: Fix build with bison 2.3 and older.
perf tools: Limit unwind support to x86 archs
perf annotate: Make it to be able to skip unannotatable symbols
perf gtk/annotate: Fail early if it can't annotate
perf gtk/annotate: Show source lines with gray color
perf gtk/annotate: Support multiple event annotation
perf ui/gtk: Implement basic GTK2 annotation browser
perf annotate: Fix warning message on a missing vmlinux
perf buildid-cache: Add --update option
uprobes/perf: Avoid uprobe_apply() whenever possible
uprobes/perf: Teach trace_uprobe/perf code to use UPROBE_HANDLER_REMOVE
uprobes/perf: Teach trace_uprobe/perf code to pre-filter
uprobes/perf: Teach trace_uprobe/perf code to track the active perf_event's
uprobes: Introduce uprobe_apply()
perf: Introduce hw_perf_event->tp_target and ->tp_list
uprobes/perf: Always increment trace_uprobe->nhit
uprobes/tracing: Kill uprobe_trace_consumer, embed uprobe_consumer into trace_uprobe
uprobes/tracing: Introduce is_trace_uprobe_enabled()
...
Commit: c1bf08ac "ftrace: Be first to run code modification on modules"
changed ftrace module notifier's priority to INT_MAX in order to
process the ftrace nops before anything else could touch them
(namely kprobes). This was the correct thing to do.
Unfortunately, the ftrace module notifier also contains the ftrace
clean up code. As opposed to the set up code, this code should be
run *after* all the module notifiers have run in case a module is doing
correct clean-up and unregisters its ftrace hooks. Basically, ftrace
needs to do clean up on module removal, as it needs to know about code
being removed so that it doesn't try to modify that code. But after it
removes the module from its records, if a ftrace user tries to remove
a probe, that removal will fail due as the record of that code segment
no longer exists.
Nothing really bad happens if the probe removal is called after ftrace
did the clean up, but the ftrace removal function will return an error.
Correct code (such as kprobes) will produce a WARN_ON() if it fails
to remove the probe. As people get annoyed by frivolous warnings, it's
best to do the ftrace clean up after everything else.
By splitting the ftrace_module_notifier into two notifiers, one that
does the module load setup that is run at high priority, and the other
that is called for module clean up that is run at low priority, the
problem is solved.
Cc: stable@vger.kernel.org
Reported-by: Frank Ch. Eigler <fche@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing of ia32 compat system calls has been a bit of a pain as they
use different system call numbers than the 64bit equivalents.
I wrote a simple 'lls' program that lists files. I compiled it as a i686
ELF binary and ran it under a x86_64 box. This is the result:
echo 0 > /debug/tracing/tracing_on
echo 1 > /debug/tracing/events/syscalls/enable
echo 1 > /debug/tracing/tracing_on ; ./lls ; echo 0 > /debug/tracing/tracing_on
grep lls /debug/tracing/trace
[.. skipping calls before TS_COMPAT is set ...]
lls-1127 [005] d... 936.409188: sys_recvfrom(fd: 0, ubuf: 4d560fc4, size: 0, flags: 8048034, addr: 8, addr_len: f7700420)
lls-1127 [005] d... 936.409190: sys_recvfrom -> 0x8a77000
lls-1127 [005] d... 936.409211: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22)
lls-1127 [005] d... 936.409215: sys_lgetxattr -> 0xf76ff000
lls-1127 [005] d... 936.409223: sys_dup2(oldfd: 4d55ae9b, newfd: 4)
lls-1127 [005] d... 936.409228: sys_dup2 -> 0xfffffffffffffffe
lls-1127 [005] d... 936.409236: sys_newfstat(fd: 4d55b085, statbuf: 80000)
lls-1127 [005] d... 936.409242: sys_newfstat -> 0x3
lls-1127 [005] d... 936.409243: sys_removexattr(pathname: 3, name: ffcd0060)
lls-1127 [005] d... 936.409244: sys_removexattr -> 0x0
lls-1127 [005] d... 936.409245: sys_lgetxattr(pathname: 0, name: 19614, value: 1, size: 2)
lls-1127 [005] d... 936.409248: sys_lgetxattr -> 0xf76e5000
lls-1127 [005] d... 936.409248: sys_newlstat(filename: 3, statbuf: 19614)
lls-1127 [005] d... 936.409249: sys_newlstat -> 0x0
lls-1127 [005] d... 936.409262: sys_newfstat(fd: f76fb588, statbuf: 80000)
lls-1127 [005] d... 936.409279: sys_newfstat -> 0x3
lls-1127 [005] d... 936.409279: sys_close(fd: 3)
lls-1127 [005] d... 936.421550: sys_close -> 0x200
lls-1127 [005] d... 936.421558: sys_removexattr(pathname: 3, name: ffcd00d0)
lls-1127 [005] d... 936.421560: sys_removexattr -> 0x0
lls-1127 [005] d... 936.421569: sys_lgetxattr(pathname: 4d564000, name: 1b1abc, value: 5, size: 802)
lls-1127 [005] d... 936.421574: sys_lgetxattr -> 0x4d564000
lls-1127 [005] d... 936.421575: sys_capget(header: 4d70f000, dataptr: 1000)
lls-1127 [005] d... 936.421580: sys_capget -> 0x0
lls-1127 [005] d... 936.421580: sys_lgetxattr(pathname: 4d710000, name: 3000, value: 3, size: 812)
lls-1127 [005] d... 936.421589: sys_lgetxattr -> 0x4d710000
lls-1127 [005] d... 936.426130: sys_lgetxattr(pathname: 4d713000, name: 2abc, value: 3, size: 32)
lls-1127 [005] d... 936.426141: sys_lgetxattr -> 0x4d713000
lls-1127 [005] d... 936.426145: sys_newlstat(filename: 3, statbuf: f76ff3f0)
lls-1127 [005] d... 936.426146: sys_newlstat -> 0x0
lls-1127 [005] d... 936.431748: sys_lgetxattr(pathname: 0, name: 1000, value: 3, size: 22)
Obviously I'm not calling newfstat with a fd of 4d55b085. The calls are
obviously incorrect, and confusing.
Other efforts have been made to fix this:
https://lkml.org/lkml/2012/3/26/367
But the real solution is to rewrite the syscall internals and come up
with a fixed solution. One that doesn't require all the kluge that the
current solution has.
Thus for now, instead of outputting incorrect data, simply ignore them.
With this patch the changes now have:
#> grep lls /debug/tracing/trace
#>
Compat system calls simply are not traced. If users need compat
syscalls, then they should just use the raw syscall tracepoints.
For an architecture to make their compat syscalls ignored, it must
define ARCH_TRACE_IGNORE_COMPAT_SYSCALLS (done in asm/ftrace.h) and also
define an arch_trace_is_compat_syscall() function that will return true
if the current task should ignore tracing the syscall.
I want to stress that this change does not affect actual syscalls in any
way, shape or form. It is only used within the tracing system and
doesn't interfere with the syscall logic at all. The changes are
consolidated nicely into trace_syscalls.c and asm/ftrace.h.
I had to make one small modification to asm/thread_info.h and that was
to remove the include of asm/ftrace.h. As asm/ftrace.h required the
current_thread_info() it was causing include hell. That include was
added back in 2008 when the function graph tracer was added:
commit caf4b323 "tracing, x86: add low level support for ftrace return tracing"
It does not need to be included there.
Link: http://lkml.kernel.org/r/1360703939.21867.99.camel@gandalf.local.home
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
uprobe_perf_open/close call the costly uprobe_apply() every time,
we can avoid it if:
- "nr_systemwide != 0" is not changed.
- There is another process/thread with the same ->mm.
- copy_proccess() does inherit_event(). dup_mmap() preserves the
inserted breakpoints.
- event->attr.enable_on_exec == T, we can rely on uprobe_mmap()
called by exec/mmap paths.
- tp_target is exiting. Only _close() checks PF_EXITING, I don't
think TRACE_REG_PERF_OPEN can hit the dying task too often.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Change uprobe_trace_func() and uprobe_perf_func() to return "int". Change
uprobe_dispatcher() to return "trace_ret | perf_ret" although this is not
needed, currently TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually exclusive.
The only functional change is that uprobe_perf_func() checks the filtering
too and returns UPROBE_HANDLER_REMOVE if nobody wants to trace current.
Testing:
# perf probe -x /lib/libc.so.6 syscall
# perf record -e probe_libc:syscall -i perl -e 'fork; syscall -1 for 1..10; wait'
# perf report --show-total-period
100.00% 10 perl libc-2.8.so [.] syscall
Before this patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 20
A child process doesn't have a counter, but still it hits this breakoint
"copied" by dup_mmap().
After the patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 11
The child process hits this int3 only once and does unapply_uprobe().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Finally implement uprobe_perf_filter() which checks ->nr_systemwide or
->perf_events to figure out whether we need to insert the breakpoint.
uprobe_perf_open/close are changed to do uprobe_apply(true/false) when
the new perf event comes or goes away.
Note that currently this is very suboptimal:
- uprobe_register() called by TRACE_REG_PERF_REGISTER becomes a
heavy nop, consumer->filter() always returns F at this stage.
As it was already discussed we need uprobe_register_only() to
avoid the costly register_for_each_vma() when possible.
- uprobe_apply() is oftenly overkill. Unless "nr_systemwide != 0"
changes we need uprobe_apply_mm(), unapply_uprobe() is almost
what we need.
- uprobe_apply() can be simply avoided sometimes, see the next
changes.
Testing:
# perf probe -x /lib/libc.so.6 syscall
# perl -e 'syscall -1 while 1' &
[1] 530
# perf record -e probe_libc:syscall perl -e 'syscall -1 for 1..10; sleep 1'
# perf report --show-total-period
100.00% 10 perl libc-2.8.so [.] syscall
Before this patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 79291
A huge ->nrhit == 79291 reflects the fact that the background process
530 constantly hits this breakpoint too, even if doesn't contribute to
the output.
After the patch:
# cat /sys/kernel/debug/tracing/uprobe_profile
/lib/libc.so.6 syscall 10
This shows that only the target process was punished by int3.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Introduce "struct trace_uprobe_filter" which records the "active"
perf_event's attached to ftrace_event_call. For the start we simply
use list_head, we can optimize this later if needed. For example, we
do not really need to record an event with ->parent != NULL, we can
rely on parent->child_list. And we can certainly do some optimizations
for the case when 2 events have the same ->tp_target or tp_target->mm.
Change trace_uprobe_register() to process TRACE_REG_PERF_OPEN/CLOSE
and add/del this perf_event to the list.
We can probably avoid any locking, but lets start with the "obvioulsy
correct" trace_uprobe_filter->rwlock which protects everything.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Move tu->nhit++ from uprobe_trace_func() to uprobe_dispatcher().
->nhit counts how many time we hit the breakpoint inserted by this
uprobe, we do not want to loose this info if uprobe was enabled by
sys_perf_event_open().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
trace_uprobe->consumer and "struct uprobe_trace_consumer" add the
unnecessary indirection and complicate the code for no reason.
This patch simply embeds uprobe_consumer into "struct trace_uprobe",
all other changes only fix the compilation errors.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
probe_event_enable/disable() check tu->consumer != NULL to avoid the
wrong uprobe_register/unregister().
We are going to kill this pointer and "struct uprobe_trace_consumer",
so we add the new helper, is_trace_uprobe_enabled(), which can rely
on TP_FLAG_TRACE/TP_FLAG_PROFILE instead.
Note: the current logic doesn't look optimal, it is not clear why
TP_FLAG_TRACE/TP_FLAG_PROFILE are mutually exclusive, we will probably
change this later.
Also kill the unused TP_FLAG_UPROBE.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
probe_event_enable/disable() check tu->inode != NULL at the start.
This is ugly, if igrab() can fail create_trace_uprobe() should not
succeed and "postpone" the failure.
And S_ISREG(inode->i_mode) check added by d24d7dbf is not safe.
Note: alloc_uprobe() should probably check igrab() != NULL as well.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
probe_event_enable() does uprobe_register() and only after that sets
utc->tu and tu->consumer/flags. This can race with uprobe_dispatcher()
which can miss these assignments or see them out of order. Nothing
really bad can happen, but this doesn't look clean/safe.
And this does not allow to use uprobe_consumer->filter() we are going
to add, it is called by uprobe_register() and it needs utc->tu.
Change this code to initialize everything before uprobe_register(), and
reset tu->consumer/flags if it fails. We can't race with event_disable(),
the caller holds event_mutex, and if we could the code would be wrong
anyway.
In fact I think uprobe_trace_consumer should die, it buys nothing but
complicates the code. We can simply add uprobe_consumer into trace_uprobe.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
create_trace_uprobe() does kern_path() to find ->d_inode, but forgets
to do path_put(). We can do this right after igrab().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Change handle_swbp() to set regs->ip = bp_vaddr in advance, this is
what consumer->handler() needs but uprobe_get_swbp_addr() is not
exported.
This also simplifies the code and makes it more consistent across
the supported architectures. handle_swbp() becomes the only caller
of uprobe_get_swbp_addr().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
uprobe_consumer->filter() is pointless in its current form, kill it.
We will add it back, but with the different signature/semantics. Perhaps
we will even re-introduce the callsite in handler_chain(), but not to
just skip uc->handler().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h
Signed-off-by: Clark Williams <williams@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On early boot up, when the ftrace ring buffer is initialized, the
static variable current_trace is initialized to &nop_trace.
Before this initialization, current_trace is NULL and will never
become NULL again. It is always reassigned to a ftrace tracer.
Several places check if current_trace is NULL before it uses
it, and this check is frivolous, because at the point in time
when the checks are made the only way current_trace could be
NULL is if ftrace failed its allocations at boot up, and the
paths to these locations would probably not be possible.
By initializing current_trace to &nop_trace where it is declared,
current_trace will never be NULL, and we can remove all these
checks of current_trace being NULL which never needed to be
checked in the first place.
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Ftrace has a snapshot feature available from kernel space and
latency tracers (e.g. irqsoff) are using it. This patch enables
user applictions to take a snapshot via debugfs.
Add "snapshot" debugfs file in "tracing" directory.
snapshot:
This is used to take a snapshot and to read the output of the
snapshot.
# echo 1 > snapshot
This will allocate the spare buffer for snapshot (if it is
not allocated), and take a snapshot.
# cat snapshot
This will show contents of the snapshot.
# echo 0 > snapshot
This will free the snapshot if it is allocated.
Any other positive values will clear the snapshot contents if
the snapshot is allocated, or return EINVAL if it is not allocated.
Link: http://lkml.kernel.org/r/20121226025300.3252.86850.stgit@liselsia
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
[
Fixed irqsoff selftest and also a conflict with a change
that fixes the update_max_tr.
]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the trace buffer read functions use a static variable
"old_tracer" for detecting if the current tracer changes. This
was suitable for a single trace file ("trace"), but to add a
snapshot feature that will use the same function for its file,
a check against a static variable is not sufficient.
To use the output functions for two different files, instead of
storing the current tracer in a static variable, as the trace
iterator descriptor contains a pointer to the original current
tracer's name, that pointer can now be used to check if the
current tracer has changed between different reads of the trace
file.
Link: http://lkml.kernel.org/r/20121226025252.3252.9276.stgit@liselsia
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For systems with an unstable sched_clock, all cpu_clock() does is enable/
disable local irq during the call to sched_clock_cpu(). And for stable
systems they are same.
trace_clock_global() already disables interrupts, so it can call
sched_clock_cpu() directly.
Link: http://lkml.kernel.org/r/1356576585-28782-2-git-send-email-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a stat about the number of events read from the ring buffer:
# cat /debug/tracing/per_cpu/cpu0/stats
entries: 39869
overrun: 870512
commit overrun: 0
bytes: 1449912
oldest event ts: 6561.368690
now ts: 6565.246426
dropped events: 0
read events: 112 <-- Added
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While debugging the virtual cputime with the function graph tracer
with a max_depth of 1 (most common use of the max_depth so far),
I found that I was missing kernel execution because of a race condition.
The code for the return side of the function has a slight race:
ftrace_pop_return_trace(&trace, &ret, frame_pointer);
trace.rettime = trace_clock_local();
ftrace_graph_return(&trace);
barrier();
current->curr_ret_stack--;
The ftrace_pop_return_trace() initializes the trace structure for
the callback. The ftrace_graph_return() uses the trace structure
for its own use as that structure is on the stack and is local
to this function. Then the curr_ret_stack is decremented which
is what the trace.depth is set to.
If an interrupt comes in after the ftrace_graph_return() but
before the curr_ret_stack, then the called function will get
a depth of 2. If max_depth is set to 1 this function will be
ignored.
The problem is that the trace has already been called, and the
timestamp for that trace will not reflect the time the function
was about to re-enter userspace. Calls to the interrupt will not
be traced because the max_depth has prevented this.
To solve this issue, the ftrace_graph_return() can safely be
moved after the current->curr_ret_stack has been updated.
This way the timestamp for the return callback will reflect
the actual time.
If an interrupt comes in after the curr_ret_stack update and
ftrace_graph_return(), it will be traced. It may look a little
confusing to see it within the other function, but at least
it will not be lost.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
__this_cpu_inc_return() or __this_cpu_dec generates a single instruction,
which is faster than __get_cpu_var operation.
Link: http://lkml.kernel.org/r/50A9C1BD.1060308@gmail.com
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The text in Documentation said it would be removed in 2.6.41;
the text in the Kconfig said removal in the 3.1 release. Either
way you look at it, we are well past both, so push it off a cliff.
Note that the POWER_CSTATE and the POWER_PSTATE are part of the
legacy tracing API. Remove all tracepoints which use these flags.
As can be seen from context, most already have a trace entry via
trace_cpu_idle anyways.
Also, the cpufreq/cpufreq.c PSTATE one is actually unpaired, as
compared to the CSTATE ones which all have a clear start/stop.
As part of this, the trace_power_frequency also becomes orphaned,
so it too is deleted.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Dan's smatch found a compare bug with the result of the
trace_test_and_set_recursion() and comparing to less than
zero. If the function fails, it returns -1, but was saved in
an unsigned int, which will never be less than zero and will
ignore the result of the test if a recursion did happen.
Luckily this is the last of the recursion tests, as the
infrastructure of ftrace would catch recursions before it
got here, except for some few exceptions.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ring_buffer.c use to require declarations from trace.h, but
these have moved to the generic header files. There's nothing
in trace.h that ring_buffer.c requires.
There's some headers that trace.h included that ring_buffer.c
needs, but it's best that it includes them directly, and not
include trace.h.
Also, some things may use ring_buffer.c without having tracing
configured. This removes the dependency that may come in the
future.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Using context bit recursion checking, we can help increase the
performance of the ring buffer.
Before this patch:
# echo function > /debug/tracing/current_tracer
# for i in `seq 10`; do ./hackbench 50; done
Time: 10.285
Time: 10.407
Time: 10.243
Time: 10.372
Time: 10.380
Time: 10.198
Time: 10.272
Time: 10.354
Time: 10.248
Time: 10.253
(average: 10.3012)
Now we have:
# echo function > /debug/tracing/current_tracer
# for i in `seq 10`; do ./hackbench 50; done
Time: 9.712
Time: 9.824
Time: 9.861
Time: 9.827
Time: 9.962
Time: 9.905
Time: 9.886
Time: 10.088
Time: 9.861
Time: 9.834
(average: 9.876)
a 4% savings!
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracer had two different versions of function tracing.
The disabling of irqs version and the preempt disable version.
As function tracing in very intrusive and can cause nasty recursion
issues, it has its own recursion protection. But the old method to
do this was a flat layer. If it detected that a recursion was happening
then it would just return without recording.
This made the preempt version (much faster than the irq disabling one)
not very useful, because if an interrupt were to occur after the
recursion flag was set, the interrupt would not be traced at all,
because every function that was traced would think it recursed on
itself (due to the context it preempted setting the recursive flag).
Now that we have a recursion flag for every context level, we
no longer need to worry about that. We can disable preemption,
set the current context recursion check bit, and go on. If an
interrupt were to come along, it would check its own context bit
and happily continue to trace.
As the preempt version is faster than the irq disable version,
there's no more reason to keep the preempt version around.
And the irq disable version still had an issue with missing
out on tracing NMI code.
Remove the irq disable function tracer version and have the
preempt disable version be the default (and only version).
Before this patch we had from running:
# echo function > /debug/tracing/current_tracer
# for i in `seq 10`; do ./hackbench 50; done
Time: 12.028
Time: 11.945
Time: 11.925
Time: 11.964
Time: 12.002
Time: 11.910
Time: 11.944
Time: 11.929
Time: 11.941
Time: 11.924
(average: 11.9512)
Now we have:
# echo function > /debug/tracing/current_tracer
# for i in `seq 10`; do ./hackbench 50; done
Time: 10.285
Time: 10.407
Time: 10.243
Time: 10.372
Time: 10.380
Time: 10.198
Time: 10.272
Time: 10.354
Time: 10.248
Time: 10.253
(average: 10.3012)
a 13.8% savings!
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When function tracing occurs, the following steps are made:
If arch does not support a ftrace feature:
call internal function (uses INTERNAL bits) which calls...
If callback is registered to the "global" list, the list
function is called and recursion checks the GLOBAL bits.
then this function calls...
The function callback, which can use the FTRACE bits to
check for recursion.
Now if the arch does not suppport a feature, and it calls
the global list function which calls the ftrace callback
all three of these steps will do a recursion protection.
There's no reason to do one if the previous caller already
did. The recursion that we are protecting against will
go through the same steps again.
To prevent the multiple recursion checks, if a recursion
bit is set that is higher than the MAX bit of the current
check, then we know that the check was made by the previous
caller, and we can skip the current check.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently for recursion checking in the function tracer, ftrace
tests a task_struct bit to determine if the function tracer had
recursed or not. If it has, then it will will return without going
further.
But this leads to races. If an interrupt came in after the bit
was set, the functions being traced would see that bit set and
think that the function tracer recursed on itself, and would return.
Instead add a bit for each context (normal, softirq, irq and nmi).
A check of which context the task is in is made before testing the
associated bit. Now if an interrupt preempts the function tracer
after the previous context has been set, the interrupt functions
can still be traced.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is lots of places that perform:
op = rcu_dereference_raw(ftrace_control_list);
while (op != &ftrace_list_end) {
Add a helper macro to do this, and also optimize for a single
entity. That is, gcc will optimize a loop for either no iterations
or more than one iteration. But usually only a single callback
is registered to the function tracer, thus the optimized case
should be a single pass. to do this we now do:
op = rcu_dereference_raw(list);
do {
[...]
} while (likely(op = rcu_dereference_raw((op)->next)) &&
unlikely((op) != &ftrace_list_end));
An op is always registered (ftrace_list_end when no callbacks is
registered), thus when a single callback is registered, the link
list looks like:
top => callback => ftrace_list_end => NULL.
The likely(op = op->next) still must be performed due to the race
of removing the callback, where the first op assignment could
equal ftrace_list_end. In that case, the op->next would be NULL.
But this is unlikely (only happens in a race condition when
removing the callback).
But it is very likely that the next op would be ftrace_list_end,
unless more than one callback has been registered. This tells
gcc what the most common case is and makes the fast path with
the least amount of branches.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracing recursion self test should not crash
the machine if the resursion test fails. If it detects that
the function tracing is recursing when it should not be, then
bail, don't go into an infinite recursive loop.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If one of the function tracers set by the global ops is not recursion
safe, it can still be called directly without the added recursion
supplied by the ftrace infrastructure.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The test that checks function recursion does things differently
if the arch does not support all ftrace features. But that really
doesn't make a difference with how the test runs, and either way
the count variable should be 2 at the end.
Currently the test wrongly fails for archs that don't support all
the ftrace features.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a race condition between the setting of a new tracer and
the update of the max trace buffers (the swap). When a new tracer
is added, it sets current_trace to nop_trace before disabling
the old tracer. At this moment, if the old tracer uses update_max_tr(),
the update may trigger the warning against !current_trace->use_max-tr,
as nop_trace doesn't have that set.
As update_max_tr() requires that interrupts be disabled, we can
add a check to see if current_trace == nop_trace and bail if it
does. Then when disabling the current_trace, set it to nop_trace
and run synchronize_sched(). This will make sure all calls to
update_max_tr() have completed (it was called with interrupts disabled).
As a clean up, this commit also removes shrinking and recreating
the max_tr buffer if the old and new tracers both have use_max_tr set.
The old way use to always shrink the buffer, and then expand it
for the next tracer. This is a waste of time.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As trace_clock is used by other things besides tracing, and it
does not require anything from trace.h, it is best not to include
the header file in trace_clock.c.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Due to a userspace issue with PowerTop v2beta, which hardcoded
the offset of event fields that it was using, it broke when
we removed the Big Kernel Lock counter from the event header.
(commit e6e1e2593 "tracing: Remove lock_depth from event entry")
Because this broke userspace, it was determined that we must
keep those 4 bytes around.
(commit a3a4a5acd "Regression: partial revert "tracing: Remove lock_depth from event entry"")
This unfortunately wastes space in the ring buffer. 4 bytes per
event, where a lot of events are just 24 bytes. That's 16% of the
buffer wasted. A million events will add 4 megs of white space
into the buffer.
It was later noticed that PowerTop v2beta could not work on systems
where the kernel was 64 bit but the userspace was 32 bits.
The reason was because the offsets are different between the
two and the hard coded offset of one would not work with the other.
With PowerTop v2 final, it implemented the same interface that both
perf and trace-cmd use. That is, it reads the format file of
the event to find the offsets of the fields it needs. This fixes
the problem with running powertop on a 32 bit userspace running
on a 64 bit kernel. It also no longer requires the 4 byte padding.
As PowerTop v2 has been out for a while, and is included in all
major distributions, it is time that we can safely remove the
4 bytes of padding. Users of PowerTop v2beta should upgrade to
PowerTop v2 final.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move SAVE_REGS support flag into Kconfig and rename
it to CONFIG_DYNAMIC_FTRACE_WITH_REGS. This also introduces
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which indicates
the architecture depending part of ftrace has a code
that saves full registers.
On the other hand, CONFIG_DYNAMIC_FTRACE_WITH_REGS indicates
the code is enabled.
Link: http://lkml.kernel.org/r/20120928081516.3560.72534.stgit@ltc138.sdl.hitachi.co.jp
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the file max_graph_depth to the debug tracing directory that lets
the user define the depth of the function graph.
A very useful operation is to set the depth to 1. Then it traces only
the first function that is called when entering the kernel. This can
be used to determine what system operations interrupt a process.
For example, to work on NOHZ processes (single tasks running without
a timer tick), if any interrupt goes off and preempts that task, this
code will show it happening.
# cd /sys/kernel/debug/tracing
# echo 1 > max_graph_depth
# echo function_graph > current_tracer
# cat per_cpu/cpu/<cpu-of-process>/trace
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's now a check in tracing_reset_online_cpus() if the buffer is
allocated or NULL. No need to do a check before calling it with max_tr.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
max_tr->buffer could be NULL in the tracing_reset{_online_cpus}. In this
case, a NULL pointer dereference happens, so we should return immediately
from these functions.
Note, the current code does not call tracing_reset*() with max_tr when
its buffer is NULL, but future code will. This patch is needed to prevent
the future code from crashing.
Link: http://lkml.kernel.org/r/20121219070234.31200.93863.stgit@liselsia
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some functions in the syscall tracing is used only locally to
the file, but they are labeled global. Convert them to static functions.
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Without this patch, we can register a uprobe event for a directory.
Enabling such a uprobe event would anyway fail.
Example:
$ echo 'p /bin:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
However dirctories cannot be valid targets for uprobe.
Hence verify if the target is a regular file during the probe
registration.
Link: http://lkml.kernel.org/r/20130103004212.690763002@goodmis.org
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ cleaned up whitespace and removed redundant IS_DIR() check ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
typeof(&buffer) is a pointer to array of 1024 char, or char (*)[1024].
But, typeof(&buffer[0]) is a pointer to char which match the return type of get_trace_buf().
As well-known, the value of &buffer is equal to &buffer[0].
so return this_cpu_ptr(&percpu_buffer->buffer[0]) can avoid type cast.
Link: http://lkml.kernel.org/r/50A1A800.3020102@gmail.com
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The original ring-buffer code had special checks at the start
of rb_advance_iter() and instead of repeating them again at the
end of the function if a certain condition existed, I just did
a recursive call to rb_advance_iter() because the special condition
would cause rb_advance_iter() to return early (after the checks).
But as things have changed, the special checks no longer exist
and the only thing done for the special_condition is to call
rb_inc_iter() and return. Instead of doing a confusing recursive call,
just call rb_inc_iter instead.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If some other kernel subsystem has a module notifier, and adds a kprobe
to a ftrace mcount point (now that kprobes work on ftrace points),
when the ftrace notifier runs it will fail and disable ftrace, as well
as kprobes that are attached to ftrace points.
Here's the error:
WARNING: at kernel/trace/ftrace.c:1618 ftrace_bug+0x239/0x280()
Hardware name: Bochs
Modules linked in: fat(+) stap_56d28a51b3fe546293ca0700b10bcb29__8059(F) nfsv4 auth_rpcgss nfs dns_resolver fscache xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack lockd sunrpc ppdev parport_pc parport microcode virtio_net i2c_piix4 drm_kms_helper ttm drm i2c_core [last unloaded: bid_shared]
Pid: 8068, comm: modprobe Tainted: GF 3.7.0-0.rc8.git0.1.fc19.x86_64 #1
Call Trace:
[<ffffffff8105e70f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff81134106>] ? __probe_kernel_read+0x46/0x70
[<ffffffffa0180000>] ? 0xffffffffa017ffff
[<ffffffffa0180000>] ? 0xffffffffa017ffff
[<ffffffff8105e76a>] warn_slowpath_null+0x1a/0x20
[<ffffffff810fd189>] ftrace_bug+0x239/0x280
[<ffffffff810fd626>] ftrace_process_locs+0x376/0x520
[<ffffffff810fefb7>] ftrace_module_notify+0x47/0x50
[<ffffffff8163912d>] notifier_call_chain+0x4d/0x70
[<ffffffff810882f8>] __blocking_notifier_call_chain+0x58/0x80
[<ffffffff81088336>] blocking_notifier_call_chain+0x16/0x20
[<ffffffff810c2a23>] sys_init_module+0x73/0x220
[<ffffffff8163d719>] system_call_fastpath+0x16/0x1b
---[ end trace 9ef46351e53bbf80 ]---
ftrace failed to modify [<ffffffffa0180000>] init_once+0x0/0x20 [fat]
actual: cc:bb:d2:4b:e1
A kprobe was added to the init_once() function in the fat module on load.
But this happened before ftrace could have touched the code. As ftrace
didn't run yet, the kprobe system had no idea it was a ftrace point and
simply added a breakpoint to the code (0xcc in the cc:bb:d2:4b:e1).
Then when ftrace went to modify the location from a call to mcount/fentry
into a nop, it didn't see a call op, but instead it saw the breakpoint op
and not knowing what to do with it, ftrace shut itself down.
The solution is to simply give the ftrace module notifier the max priority.
This should have been done regardless, as the core code ftrace modification
also happens very early on in boot up. This makes the module modification
closer to core modification.
Link: http://lkml.kernel.org/r/20130107140333.593683061@goodmis.org
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reported-by: Frank Ch. Eigler <fche@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 0fb9656d "tracing: Make tracing_enabled be equal to tracing_on"
changes the behaviour of trace_pipe, ie. it makes trace_pipe return if
we've read something and tracing is enabled, and this means that we have
to 'cat trace_pipe' again and again while running tests.
IMO the right way is if tracing is enabled, we always block and wait for
ring buffer, or we may lose what we want since ring buffer's size is limited.
Link: http://lkml.kernel.org/r/1358132051-5410-1-git-send-email-bo.li.liu@oracle.com
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
bio_{front|back}_merge tracepoints report a bio merging into an
existing request but didn't specify which request the bio is being
merged into. Add @req to it. This makes it impossible to share the
event template with block_bio_queue - split it out.
@req isn't used or exported to userland at this point and there is no
userland visible behavior change. Later changes will make use of the
extra parameter.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio completion didn't kick block_bio_complete TP. Only dm was
explicitly triggering the TP on IO completion. This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.
This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.
* Explicit trace_block_bio_complete() invocation removed from dm and
the trace point is unexported.
* @rq dropped from trace_block_bio_complete(). bios may fly around
w/o queue associated. Verifying and accessing the assocaited queue
belongs to TP probes.
* blktrace now gets both request and bio completions. Make it ignore
bio completions if request completion path is happening.
This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.
v2: With this change, block_bio_complete TP could be invoked on sg
commands which have bio's with %NULL bi_bdev. Update TP
assignment code to check whether bio->bi_bdev is %NULL before
dereferencing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Namhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 02404baf1b "tracing: Remove deprecated tracing_enabled file"
removed the tracing_enabled file as it never worked properly and
the tracing_on file should be used instead. But the tracing_on file
didn't call into the tracers start/stop routines like the
tracing_enabled file did. This caused trace-cmd to break when it
enabled the irqsoff tracer.
If you just did "echo irqsoff > current_tracer" then it would work
properly. But the tool trace-cmd disables tracing first by writing
"0" into the tracing_on file. Then it writes "irqsoff" into
current_tracer and then writes "1" into tracing_on. Unfortunately,
the above commit changed the irqsoff tracer to check the tracing_on
status instead of the tracing_enabled status. If it's disabled then
it does not start the tracer internals.
The problem is that writing "1" into tracing_on does not call the
tracers "start" routine like writing "1" into tracing_enabled did.
This makes the irqsoff tracer not start when using the trace-cmd
tool, and is a regression for userspace.
Simple fix is to have the tracing_on file call the tracers start()
method when being enabled (and the stop() method when disabled).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The latest change to allow trace options to be set on the command
line also broke the trace_options file.
The zeroing of the last byte of the option name that is echoed into
the trace_option file was removed with the consolidation of some
of the code. The compare between the option and what was written to
the trace_options file fails because the string holding the data
written doesn't terminate with a null character.
A zero needs to be added to the end of the string copied from
user space.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The rcutorture tests need to be able to trace the time of the
beginning of an RCU read-side critical section, and thus need access
to trace_clock_local(). This commit therefore adds a the needed
EXPORT_SYMBOL_GPL().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Pull minor tracing updates and fixes from Steven Rostedt:
"It seems that one of my old pull requests have slipped through.
The changes are contained to just the files that I maintain, and are
changes from others that I told I would get into this merge window.
They have already been in linux-next for several weeks, and should be
well tested."
* 'tip/perf/core-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove unnecessary WARN_ONCE's from tracing_buffers_splice_read
tracing: Remove unneeded checks from the stack tracer
tracing: Add a resize function to make one buffer equivalent to another buffer
But the kernel decided to call it "origin" instead. Fix most of the
sites.
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
Pull perf fixes from Ingo Molnar:
"These are late-v3.7 pending fixes for tracing."
Fix up trivial conflict in kernel/trace/ring_buffer.c: the NULL pointer
fix clashed with the change of type of the 'ret' variable.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ring-buffer: Fix race between integrity check and readers
ring-buffer: Fix NULL pointer if rb_set_head_page() fails
ftrace: Clear bits properly in reset_iter_read()
I've legally changed my name with New York State, the US Social Security
Administration, et al. This patch propagates the name change and change
in initials and login to comments in the kernel source as well.
Signed-off-by: Nadia Yvette Chambers <nyc@holomorphy.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The function rb_check_pages() was added to make sure the ring buffer's
pages were sane. This check is done when the ring buffer size is modified
as well as when the iterator is released (closing the "trace" file),
as that was considered a non fast path and a good place to do a sanity
check.
The problem is that the check does not have any locks around it.
If one process were to read the trace file, and another were to read
the raw binary file, the check could happen while the reader is reading
the file.
The issues with this is that the check requires to clear the HEAD page
before doing the full check and it restores it afterward. But readers
require the HEAD page to exist before it can read the buffer, otherwise
it gives a nasty warning and disables the buffer.
By adding the reader lock around the check, this keeps the race from
happening.
Cc: stable@vger.kernel.org # 3.6
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function rb_set_head_page() searches the list of ring buffer
pages for a the page that has the HEAD page flag set. If it does
not find it, it will do a WARN_ON(), disable the ring buffer and
return NULL, as this should never happen.
But if this bug happens to happen, not all callers of this function
can handle a NULL pointer being returned from it. That needs to be
fixed.
Cc: stable@vger.kernel.org # 3.0+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
WARN shouldn't be used as a means of communicating failure to a userspace programmer.
Link: http://lkml.kernel.org/r/20120725153908.GA25203@redhat.com
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It seems that 'ftrace_enabled' flag should not be used inside the tracer
functions. The ftrace core is using this flag for internal purposes, and
the flag wasn't meant to be used in tracers' runtime checks.
stack tracer is the only tracer that abusing the flag. So stop it from
serving as a bad example.
Also, there is a local 'stack_trace_disabled' flag in the stack tracer,
which is never updated; so it can be removed as well.
Link: http://lkml.kernel.org/r/1342637761-9655-1-git-send-email-anton.vorontsov@linaro.org
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Trace buffer size is now per-cpu, so that there are the following two
patterns in resizing of buffers.
(1) resize per-cpu buffers to same given size
(2) resize per-cpu buffers to another trace_array's buffer size
for each CPU (such as preparing the max_tr which is equivalent
to the global_trace's size)
__tracing_resize_ring_buffer() can be used for (1), and had
implemented (2) inside it for resetting the global_trace to the
original size.
(2) was also implemented in another place. So this patch assembles
them in a new function - resize_buffer_duplicate_size().
Link: http://lkml.kernel.org/r/20121017025616.2627.91226.stgit@falsita
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is a typo here where '&' is used instead of '|' and it turns the
statement into a noop. The original code is equivalent to:
iter->flags &= ~((1 << 2) & (1 << 4));
Link: http://lkml.kernel.org/r/20120609161027.GD6488@elgon.mountain
Cc: stable@vger.kernel.org # all of them
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Show raw time stamp values for stats per cpu if you choose counter or tsc mode
for trace_clock. Although a unit of tracing time stamp is nsec in local or global mode,
the units in counter and TSC mode are tracing counter and cycles respectively.
Link: http://lkml.kernel.org/r/1352837903-32191-3-git-send-email-dhsharp@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to promote interoperability between userspace tracers and ftrace,
add a trace_clock that reports raw TSC values which will then be recorded
in the ring buffer. Userspace tracers that also record TSCs are then on
exactly the same time base as the kernel and events can be unambiguously
interlaced.
Tested: Enabled a tracepoint and the "tsc" trace_clock and saw very large
timestamp values.
v2:
Move arch-specific bits out of generic code.
v3:
Rename "x86-tsc", cleanups
v7:
Generic arch bits in Kbuild.
Google-Bug-Id: 6980623
Link: http://lkml.kernel.org/r/1352837903-32191-1-git-send-email-dhsharp@google.com
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add trace_options to the kernel command line parameter to be able to
set options at early boot. For example, to enable stack dumps of
events, add the following:
trace_options=stacktrace
This along with the trace_event option, you can get not only
traces of the events but also the stack dumps with them.
Requested-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Have the ring buffer commit function use the irq_work infrastructure to
wake up any waiters waiting on the ring buffer for new data. The irq_work
was created for such a purpose, where doing the actual wake up at the
time of adding data is too dangerous, as an event or function trace may
be in the midst of the work queue locks and cause deadlocks. The irq_work
will either delay the action to the next timer interrupt, or trigger an IPI
to itself forcing an interrupt to do the work (in a safe location).
With irq_work, all ring buffer commits can safely do wakeups, removing
the need for the ring buffer commit "nowake" variants, which were used
by events and function tracing. All commits can now safely use the
normal commit, and the "nowake" variants can be removed.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing_enabled file was used as a quick way to stop
tracers, and try to bring down overhead for things like
the latency tracers (irqsoff, wakeup, etc). But it didn't
work that well.
The tracing_on file was created as a really fast way to
stop recording into the ftrace ring buffer and can interact
with the kernel. That is a tracing_off() call in the kernel
can disable recording of events, and then from userspace one
could echo 1 into the tracing_on file to continue it. The
tracing_enabled function did too much to allow for this.
The tracing_on has taken over as a way to start and stop tracing
and the tracing_enabled file should not be used. But because of
its existance, it still confuses people. Over a year ago the
following commit was added:
commit 6752ab4a9c
Author: Steven Rostedt <srostedt@redhat.com>
Date: Tue Feb 8 13:54:06 2011 -0500
tracing: Deprecate tracing_enabled for tracing_on
This commit added a WARN_ON() if the tracing_enabled file's variable
was changed. After this was added, only LatencyTop complained, and
they soon fixed their tool as there was no reason that LatencyTop
should touch this file as it was using the perf ring buffers which
this file does not interact with. But since that time no one else
has complained about this WARN_ON(). Thus it is safe to assume that
this file is no longer needed. Time to get rid of it.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing_enabled file has been deprecated as it never was able
to serve its purpose well. The tracing_on file has taken over.
Instead of having code to keep tracing_enabled, have the tracing_enabled
file just set tracing_on, and remove the tracing_enabled variable.
This allows us to remove the tracing_enabled file. The reason that
the remove is in a different change set and not removed here is
in case we find some lonely userspace tool that requires the file
to exist. Then the removal patch will get reverted, but this one
will not.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function register_tracer() is only used by kernel core code,
that never needs to remove the tracer. As trace_events have become
the main way to add new tracing to the kernel, the need to
unregister a tracer has diminished. Remove the unused function
unregister_tracer(). If a need arises where we need it, then we
can always add it back.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The open function used by available_events is the same as set_event even
though it uses different seq functions. This causes a side effect of
writing into available_events clearing all events, even though
available_events is suppose to be read only.
There's no reason to keep a single function for just the open and have
both use different functions for everything else. It is a little
confusing and causes strange behavior. Just have each have their own
function.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ring_buffer_oldest_event_ts() should return a value of u64 type, because
ring_buffer_per_cpu->buffer_page->buffer_data_page->time_stamp is u64 type.
Link: http://lkml.kernel.org/r/1349998076-15495-5-git-send-email-dhsharp@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Because the "tsc" clock isn't in nanoseconds, the ring buffer must be
reset when changing clocks so that incomparable timestamps don't end up
in the same trace.
Tested: Confirmed switching clocks resets the trace buffer.
Google-Bug-Id: 6980623
Link: http://lkml.kernel.org/r/1349998076-15495-3-git-send-email-dhsharp@google.com
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functions defined in include/trace/syscalls.h are not used directly
since struct ftrace_event_class was introduced. Remove them from the
header file and rearrange the ftrace_event_class declarations in
trace_syscalls.c.
Link: http://lkml.kernel.org/r/1339112785-21806-2-git-send-email-vnagarnaik@google.com
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove ftrace_format_syscall() declaration; it is neither defined nor
used. Also update a comment and formatting.
Link: http://lkml.kernel.org/r/1339112785-21806-1-git-send-email-vnagarnaik@google.com
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Whenever an event is registered, the comm of tasks are saved at
every task switch instead of saving them at every event. But if
an event isn't executed much, the comm cache will be filled up
by tasks that did not record the event and you lose out on the comms
that did.
Here's an example, if you enable the following events:
echo 1 > /debug/tracing/events/kvm/kvm_cr/enable
echo 1 > /debug/tracing/events/net/net_dev_xmit/enable
Note, there's no kvm running on this machine so the first event will
never be triggered, but because it is enabled, the storing of comms
will continue. If we now disable the network event:
echo 0 > /debug/tracing/events/net/net_dev_xmit/enable
and look at the trace:
cat /debug/tracing/trace
sshd-2672 [001] ..s2 375.731616: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 375.731617: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 375.859356: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 375.859357: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 375.947351: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 375.947352: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 376.035383: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 376.035383: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 377.563806: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=226 rc=0
sshd-2672 [001] ..s1 377.563807: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=226 rc=0
sshd-2672 [001] ..s2 377.563834: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6be0 len=114 rc=0
sshd-2672 [001] ..s1 377.563842: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6be0 len=114 rc=0
We see that process 2672 which triggered the events has the comm "sshd".
But if we run hackbench for a bit and look again:
cat /debug/tracing/trace
<...>-2672 [001] ..s2 375.731616: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 375.731617: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 375.859356: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 375.859357: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 375.947351: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 375.947352: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 376.035383: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 376.035383: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 377.563806: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=226 rc=0
<...>-2672 [001] ..s1 377.563807: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=226 rc=0
<...>-2672 [001] ..s2 377.563834: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6be0 len=114 rc=0
<...>-2672 [001] ..s1 377.563842: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6be0 len=114 rc=0
The stored "sshd" comm has been flushed out and we get a useless "<...>".
But by only storing comms after a trace event occurred, we can run
hackbench all day and still get the same output.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functon tracing_sched_wakeup_trace() does an open coded unlock
commit and save stack. This is what the trace_nowake_buffer_unlock_commit()
is for.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If comm recording is not enabled when trace_printk() is used then
you just get this type of output:
[ adding trace_printk("hello! %d", irq); in do_IRQ ]
<...>-2843 [001] d.h. 80.812300: do_IRQ: hello! 14
<...>-2734 [002] d.h2 80.824664: do_IRQ: hello! 14
<...>-2713 [003] d.h. 80.829971: do_IRQ: hello! 14
<...>-2814 [000] d.h. 80.833026: do_IRQ: hello! 14
By enabling the comm recorder when trace_printk is enabled:
hackbench-6715 [001] d.h. 193.233776: do_IRQ: hello! 21
sshd-2659 [001] d.h. 193.665862: do_IRQ: hello! 21
<idle>-0 [001] d.h1 193.665996: do_IRQ: hello! 21
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since tracing is not used by 99% of Linux users, even though tracing
may be configured in, it does not make sense to allocate 1.4 Megs
per CPU for the ring buffers if they are not used. Thus, on boot up
the ring buffers are set to a minimal size until something needs the
and they are expanded.
This works well for events and tracers (function, etc), but for the
asynchronous use of trace_printk() which can write to the ring buffer
at any time, does not expand the buffers.
On boot up a check is made to see if any trace_printk() is used to
see if the trace_printk() temp buffer pages should be allocated. This
same code can be used to expand the buffers as well.
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The existing 'overrun' counter is incremented when the ring
buffer wraps around, with overflow on (the default). We wanted
a way to count requests lost from the buffer filling up with
overflow off, too. I decided to add a new counter instead
of retro-fitting the existing one because it seems like a
different statistic to count conceptually, and also because
of how the code was structured.
Link: http://lkml.kernel.org/r/1310765038-26399-1-git-send-email-slavapestov@google.com
Signed-off-by: Slava Pestov <slavapestov@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
print_max and use_max_tr in struct tracer are "int" variables and
used like flags. This is wasteful, so change the type to "bool".
Link: http://lkml.kernel.org/r/20121002082710.9807.86393.stgit@falsita
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's times during debugging that it is helpful to see traces of early
boot functions. But the tracers are initialized at device_initcall()
which is quite late during the boot process. Setting the kernel command
line parameter ftrace=function will not show anything until the function
tracer is initialized. This prevents being able to trace functions before
device_initcall().
There's no reason that the tracers need to be initialized so late in the
boot process. Move them up to core_initcall() as they still need to come
after early_initcall() which initializes the tracing buffers.
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There don't have any 'r' prefix in uprobe event naming, remove it.
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
With a system where, num_present_cpus < num_possible_cpus, even if all
CPUs are online, non-present CPUs don't have per_cpu buffers allocated.
If per_cpu/<cpu>/buffer_size_kb is modified for such a CPU, it can cause
a panic due to NULL dereference in ring_buffer_resize().
To fix this, resize operation is allowed only if the per-cpu buffer has
been initialized.
Link: http://lkml.kernel.org/r/1349912427-6486-1-git-send-email-vnagarnaik@google.com
Cc: stable@vger.kernel.org # 3.5+
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull virtio changes from Rusty Russell:
"New workflow: same git trees pulled by linux-next get sent straight to
Linus. Git is awkward at shuffling patches compared with quilt or mq,
but that doesn't happen often once things get into my -next branch."
* 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (24 commits)
lguest: fix occasional crash in example launcher.
virtio-blk: Disable callback in virtblk_done()
virtio_mmio: Don't attempt to create empty virtqueues
virtio_mmio: fix off by one error allocating queue
drivers/virtio/virtio_pci.c: fix error return code
virtio: don't crash when device is buggy
virtio: remove CONFIG_VIRTIO_RING
virtio: add help to CONFIG_VIRTIO option.
virtio: support reserved vqs
virtio: introduce an API to set affinity for a virtqueue
virtio-ring: move queue_index to vring_virtqueue
virtio_balloon: not EXPERIMENTAL any more.
virtio-balloon: dependency fix
virtio-blk: fix NULL checking in virtblk_alloc_req()
virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
virtio-blk: Add bio-based IO path for virtio-blk
virtio: console: fix error handling in init() function
tools: Fix pthread flag for Makefile of trace-agent used by virtio-trace
tools: Add guest trace agent as a user tool
virtio/console: Allocate scatterlist according to the current pipe size
...
and no longer use its debugfs knobs. The change slightly touches
kernel/trace directory, but it got the needed ack from Steven Rostedt:
http://lkml.org/lkml/2012/8/21/688
2. Added maintainers entry;
3. A bunch of fixes, nothing special.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJQbjoFAAoJEGgI9fZJve1bZgUP/A/ZwFGfdnochDgRhK5p7ljY
baRZpgSh2B+BxIDTEPLfVh6HbOivmYJ8WF0unD9kKTzCCS71ZUMiLB25G/bV4lnZ
fawAOhGfOLG3rXmldxf6nJllHr9JpoSmVEHypvjFNcbjYZ04zhe7jM+YsaWmBw68
eHXQkOSdfPPpKXZ2B0Eef/EoGWhORW0kTD7xFlorsxYAkksSheY0PC0nYgCFhvCZ
168y9pi4T4lucr4s44x8AJ/r/5BQ1jEQAY/A2qUE/iBfRFP4XyE1Oao4OHtVDYdU
KjVPA1VYmwkKSfnkiVFrpb/94IyrKslblgR8nX0kK3L/ccFYjQix4nd9jR+n857s
xfAuj9nfhUO6fI5qoaVSOBufxKyPp1S7X8INEAJ7WQ0c9VoMv00biK9M77ifDGZg
ll/Ecq1CADtcbOnQXf6qwGwRKmpR+qgPkIzpNXcuGMuM4AEPwtckOhCyXFr37Txk
6ZoGM8IIaBJ0yXxHkfpUA7l9ZF0gXR+qHMQCwpUS8tIMx35On+IbybEaKbniKEi1
AURgQ7ZimVYAHPi0Y0L00+EKI3IPVQJvCFH7SG+wUfLWcbEtNbTv3MAer5o3DANJ
GMnWBwNw9ClTydWKI0GMNmnWpFukWhd4OXleyl2+q4qRJi3HhNacrok3s/2r+CnT
QRg8i/0SDvxGuXazrTZT
=1HAE
-----END PGP SIGNATURE-----
Merge tag 'for-v3.7' of git://git.infradead.org/users/cbou/linux-pstore
Pull pstore changes from Anton Vorontsov:
1) We no longer ad-hoc to the function tracer "high level"
infrastructure and no longer use its debugfs knobs. The change
slightly touches kernel/trace directory, but it got the needed ack
from Steven Rostedt:
http://lkml.org/lkml/2012/8/21/688
2) Added maintainers entry;
3) A bunch of fixes, nothing special.
* tag 'for-v3.7' of git://git.infradead.org/users/cbou/linux-pstore:
pstore: Avoid recursive spinlocks in the oops_in_progress case
pstore/ftrace: Convert to its own enable/disable debugfs knob
pstore/ram: Add missing platform_device_unregister
MAINTAINERS: Add pstore maintainers
pstore/ram: Mark ramoops_pstore_write_buf() as notrace
pstore/ram: Fix printk format warning
pstore/ram: Fix possible NULL dereference
Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.
The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.
The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.
Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.
Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.
While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.
Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.
After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."
Fix up some fairly trivial conflicts in netfilter uid/git logging code.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...
Use generic steal operation on pipe buffer to allow stealing
ring buffer's read page from pipe buffer.
Note that this could reduce the performance of splice on the
splice_write side operation without affinity setting.
Since the ring buffer's read pages are allocated on the
tracing-node, but the splice user does not always execute
splice write side operation on the same node. In this case,
the page will be accessed from the another node.
Thus, it is strongly recommended to assign the splicing
thread to corresponding node.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This patch splits trace event initialization in two stages:
* ftrace enable
* sysfs event entry creation
This allows to capture trace events from an earlier point
by using 'trace_event' kernel parameter and is important
to trace boot-up allocations.
Note that, in order to enable events at core_initcall,
it's necessary to move init_ftrace_syscalls() from
core_initcall to early_initcall.
Link: http://lkml.kernel.org/r/1347461277-25302-1-git-send-email-elezegarcia@gmail.com
Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In our application, we have trace markers spread through user-space.
We have markers in GL, X, etc. These are super handy for Chrome's
about:tracing feature (Chrome + system + kernel trace view), but
can be very distracting when you're trying to debug a kernel issue.
I normally, use "grep -v tracing_mark_write" but it would be nice
if I could just temporarily disable markers all together.
Link: http://lkml.kernel.org/r/1347066739-26285-1-git-send-email-msb@chromium.org
CC: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
- When tracing capture the kuid.
- When displaying the data to user space convert the kuid into the
user namespace of the process that opened the report file.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Commit 56449f437 "tracing: make the trace clocks available generally",
in April 2009, made trace_clock available unconditionally, since
CONFIG_X86_DS used it too.
Commit faa4602e47 "x86, perf, bts, mm: Delete the never used BTS-ptrace code",
in March 2010, removed CONFIG_X86_DS, and now only CONFIG_RING_BUFFER (split
out from CONFIG_TRACING for general use) has a dependency on trace_clock. So,
only compile in trace_clock with CONFIG_RING_BUFFER or CONFIG_TRACING
enabled.
Link: http://lkml.kernel.org/r/20120903024513.GA19583@leaf
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With this patch we no longer reuse function tracer infrastructure, now
we register our own tracer back-end via a debugfs knob.
It's a bit more code, but that is the only downside. On the bright side we
have:
- Ability to make persistent_ram module removable (when needed, we can
move ftrace_ops struct into a module). Note that persistent_ram is still
not removable for other reasons, but with this patch it's just one
thing less to worry about;
- Pstore part is more isolated from the generic function tracer. We tried
it already by registering our own tracer in available_tracers, but that
way we're loosing ability to see the traces while we record them to
pstore. This solution is somewhere in the middle: we only register
"internal ftracer" back-end, but not the "front-end";
- When there is only pstore tracing enabled, the kernel will only write
to the pstore buffer, omitting function tracer buffer (which, of course,
still can be enabled via 'echo function > current_tracer').
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
The function graph has a test to check if the frame pointer is
corrupted, which can happen with various options of gcc with mcount.
But this is not an issue with -mfentry as -mfentry does not need nor use
frame pointers for function graph tracing.
Link: http://lkml.kernel.org/r/20120807194059.773895870@goodmis.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Thanks to Andi Kleen, gcc 4.6.0 now supports -mfentry for x86
(and hopefully soon for other archs). What this does is to have
the function profiler start at the beginning of the function
instead of after the stack is set up. As plain -pg (mcount) is
called after the stack is set up, and in some cases can have issues
with the function graph tracer. It also requires frame pointers to
be enabled.
The -mfentry now calls __fentry__ at the beginning of the function.
This allows for compiling without frame pointers and even has the
ability to access parameters if needed.
If the architecture and the compiler both support -mfentry then
use that instead.
Link: http://lkml.kernel.org/r/20120807194059.392617243@goodmis.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
. Fix include order for bison/flex-generated C files, from Ben Hutchings
. Build fixes and documentation corrections from David Ahern
. Group parsing support, from Jiri Olsa
. UI/gtk refactorings and improvements from Namhyung Kim
. NULL deref fix for perf script, from Namhyung Kim
. Assorted cleanups from Robert Richter
. Let O= makes handle relative paths, from Steven Rostedt
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJQMkGhAAoJENZQFvNTUqpAqjsQAJE5iD1LFogC8o/WjvRHz0TY
Y0x+sR/XfW61KYpeq5g+UaKuFU3P44ijCoyks3y5sza97DkYgUwMpEHlLXFSM8Pp
sNOapqY57s24nq3MLrhH1V9w+cSE+m2u/Gi5fGLCQekio9gkOBwYxNGk7vpKri/n
LBRsMozBu/mZjMy20uWOb7Uk8xsAToh+TFaAtjyQ9Snn9nNJj49NUAp37uN888H/
ducMLq32HN5v/6Zd3q6IWdDWgZsHLkIa3R5FIs/GNe3Dih07gtYLmDol4ktPbTFm
yoaWpP5wbtu/62EZlJwE393vMuoeqN/96394ZZQGFafhHVxN4+rcBhXbejBs0T2b
wk/0CzntW8bbUAI/cl3SB9aui//FWOxcjG9aDQ7PsmHzPw1Q4VD0F9Mcod4p+dRX
PsA9q/tST1eAiwzWYthDtj81U7iChINcXKhoZn2xn6+0+aMH+6FFNBmCH8MR5aCU
BvrXhTJjvau/Ym/sILl4Tf4wfssTq49yMsn/YKCwLJ0hg0XlTObWfQRy2MOayXH9
NJvUE+9GSXoTEKhmr1AfTYEG9vObaXZyFwAI74xvPPwUYojCb4ZjEKmG0egW+VGk
IJKFCaJZwwVsGau4aIbFAMP12/L8Qs/Ox91ddCJ0j5TIlSGMaqW5lbV1N1crzlTT
a0GsN49NvhbFttBXrcNX
=0a2X
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
* Fix include order for bison/flex-generated C files, from Ben Hutchings
* Build fixes and documentation corrections from David Ahern
* Group parsing support, from Jiri Olsa
* UI/gtk refactorings and improvements from Namhyung Kim
* NULL deref fix for perf script, from Namhyung Kim
* Assorted cleanups from Robert Richter
* Let O= makes handle relative paths, from Steven Rostedt
* perf script python fixes, from Feng Tang.
* Improve 'perf lock' error message when the needed tracepoints
are not present, from David Ahern.
* Initial bash completion support, from Frederic Weisbecker
* Allow building without libelf, from Namhyung Kim.
* Support DWARF CFI based unwind to have callchains when %bp
based unwinding is not possible, from Jiri Olsa.
* Symbol resolution fixes, while fixing support PPC64 files with an .opt ELF
section was the end goal, several fixes for code that handles all
architectures and cleanups are included, from Cody Schafer.
* Add a description for the JIT interface, from Andi Kleen.
* Assorted fixes for Documentation and build in 32 bit, from Robert Richter
* Add support for non-tracepoint events in perf script python, from Feng Tang
* Cache the libtraceevent event_format associated to each evsel early, so that we
avoid relookups, i.e. calling pevent_find_event repeatedly when processing
tracepoint events.
[ This is to reduce the surface contact with libtraceevents and make clear what
is that the perf tools needs from that lib: so far parsing the common and per
event fields. ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
syscall_get_nr can return -1 in the case that the task is not executing
a system call.
This patch fixes perf_syscall_{enter,exit} to check that the syscall
number is valid before using it as an index into a bitmap.
Link: http://lkml.kernel.org/r/1345137254-7377-1-git-send-email-will.deacon@arm.com
Cc: Jason Baron <jbaron@redhat.com>
Cc: Wade Farnsworth <wade_farnsworth@mentor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add missing initialization for ret variable. Its initialization
is based on the re_cnt variable, which is being set deep down
in the ftrace_function_filter_re function.
I'm not sure compilers would be smart enough to see this in near
future, so killing the warning this way.
Link: http://lkml.kernel.org/r/1340120894-9465-2-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The warkeup_rt self test used msleep() calls to wait for real time
tasks to wake up and run. On bare-metal hardware, this was enough as
the scheduler should let the RT task run way before the non-RT task
wakes up from the msleep(). If it did not, then that would mean the
scheduler was broken.
But when dealing with virtual machines, this is a different story.
If the RT task wakes up on a VCPU, it's up to the host to decide when
that task gets to schedule, which can be far behind the time that the
non-RT task wakes up. In this case, the test would fail incorrectly.
As we are not testing the scheduler, but instead the wake up tracing,
we can use completions to wait and not depend on scheduler timings
to see if events happen on time.
Link: http://lkml.kernel.org/r/1343663105.3847.7.camel@fedora
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf fixes from Ingo Molnar:
"Fix merge window fallout and fix sleep profiling (this was always
broken, so it's not a fix for the merge window - we can skip this one
from the head of the tree)."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/trace: Add ability to set a target task for events
perf/x86: Fix USER/KERNEL tagging of samples properly
perf/x86/intel/uncore: Make UNCORE_PMU_HRTIMER_INTERVAL 64-bit
A few events are interesting not only for a current task.
For example, sched_stat_* events are interesting for a task
which wakes up. For this reason, it will be good if such
events will be delivered to a target task too.
Now a target task can be set by using __perf_task().
The original idea and a draft patch belongs to Peter Zijlstra.
I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
Inspired-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a new filter update interface ftrace_set_filter_ip()
to set ftrace filter by ip address, not only glob pattern.
Link: http://lkml.kernel.org/r/20120605102808.27845.67952.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add selftests to test the save-regs functionality of ftrace.
If the arch supports saving regs, then it will make sure that regs is
at least not NULL in the callback.
If the arch does not support saving regs, it makes sure that the
registering of the ftrace_ops that requests saving regs fails.
It then tests the registering of the ftrace_ops succeeds if the
'IF_SUPPORTED' flag is set. Then it makes sure that the regs passed to
the function is NULL.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add selftests to test the function tracing recursion protection actually
does work. It also tests if a ftrace_ops states it will perform its own
protection. Although, even if the ftrace_ops states it will protect itself,
the ftrace infrastructure may still provide protection if the arch does
not support all features or another ftrace_ops is registered.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As more users of the function tracer utility are being added, they do
not always add the necessary recursion protection. To protect from
function recursion due to tracing, if the callback ftrace_ops does not
specifically specify that it protects against recursion (by setting
the FTRACE_OPS_FL_RECURSION_SAFE flag), the list operation will be
called by the mcount trampoline which adds recursion protection.
If the flag is set, then the function will be called directly with no
extra protection.
Note, the list operation is called if more than one function callback
is registered, or if the arch does not support all of the function
tracer features.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Here's the big staging tree merge for the 3.6-rc1 merge window.
There are some patches in here outside of drivers/staging/, notibly the iio
code (which is still stradeling the staging / not staging boundry), the pstore
code, and the tracing code. All of these have gotten ackes from the various
subsystem maintainers to be included in this tree. The pstore and tracing
patches are related, and are coming here as they replace one of the android
staging drivers.
Otherwise, the normal staging mess. Lots of cleanups and a few new drivers
(some iio drivers, and the large csr wireless driver abomination.)
Note, you will get a merge issue with the following files:
drivers/staging/comedi/drivers/s626.h
drivers/staging/gdm72xx/netlink_k.c
both of which should be trivial for you to handle.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAlAQiD8ACgkQMUfUDdst+ykxhgCeMUjvc+1RTtSprzvkzpejgoUU
6A4AnAleWMnkaCD8vruGnRdGl/Qtz51+
=mN6M
-----END PGP SIGNATURE-----
Merge tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging tree patches from Greg Kroah-Hartman:
"Here's the big staging tree merge for the 3.6-rc1 merge window.
There are some patches in here outside of drivers/staging/, notibly
the iio code (which is still stradeling the staging / not staging
boundry), the pstore code, and the tracing code. All of these have
gotten acks from the various subsystem maintainers to be included in
this tree. The pstore and tracing patches are related, and are coming
here as they replace one of the android staging drivers.
Otherwise, the normal staging mess. Lots of cleanups and a few new
drivers (some iio drivers, and the large csr wireless driver
abomination.)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fixed up trivial conflicts in drivers/staging/comedi/drivers/s626.h and
drivers/staging/gdm72xx/netlink_k.c
* tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1108 commits)
staging: csr: delete a bunch of unused library functions
staging: csr: remove csr_utf16.c
staging: csr: remove csr_pmem.h
staging: csr: remove CsrPmemAlloc
staging: csr: remove CsrPmemFree()
staging: csr: remove CsrMemAllocDma()
staging: csr: remove CsrMemCalloc()
staging: csr: remove CsrMemAlloc()
staging: csr: remove CsrMemFree() and CsrMemFreeDma()
staging: csr: remove csr_util.h
staging: csr: remove CsrOffSetOf()
stating: csr: remove unneeded #includes in csr_util.c
staging: csr: make CsrUInt16ToHex static
staging: csr: remove CsrMemCpy()
staging: csr: remove CsrStrLen()
staging: csr: remove CsrVsnprintf()
staging: csr: remove CsrStrDup
staging: csr: remove CsrStrChr()
staging: csr: remove CsrStrNCmp
staging: csr: remove CsrStrCmp
...
Add a way to have different functions calling different trampolines.
If a ftrace_ops wants regs saved on the return, then have only the
functions with ops registered to save regs. Functions registered by
other ops would not be affected, unless the functions overlap.
If one ftrace_ops registered functions A, B and C and another ops
registered fucntions to save regs on A, and D, then only functions
A and D would be saving regs. Function B and C would work as normal.
Although A is registered by both ops: normal and saves regs; this is fine
as saving the regs is needed to satisfy one of the ops that calls it
but the regs are ignored by the other ops function.
x86_64 implements the full regs saving, and i386 just passes a NULL
for regs to satisfy the ftrace_ops passing. Where an arch must supply
both regs and ftrace_ops parameters, even if regs is just NULL.
It is OK for an arch to pass NULL regs. All function trace users that
require regs passing must add the flag FTRACE_OPS_FL_SAVE_REGS when
registering the ftrace_ops. If the arch does not support saving regs
then the ftrace_ops will fail to register. The flag
FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED may be set that will prevent the
ftrace_ops from failing to register. In this case, the handler may
either check if regs is not NULL or check if ARCH_SUPPORTS_FTRACE_SAVE_REGS.
If the arch supports passing regs it will set this macro and pass regs
for ops that request them. All other archs will just pass NULL.
Link: Link: http://lkml.kernel.org/r/20120711195745.107705970@goodmis.org
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Return as the 4th paramater to the function tracer callback the pt_regs.
Later patches that implement regs passing for the architectures will require
having the ftrace_ops set the SAVE_REGS flag, which will tell the arch
to take the time to pass a full set of pt_regs to the ftrace_ops callback
function. If the arch does not support it then it should pass NULL.
If an arch can pass full regs, then it should define:
ARCH_SUPPORTS_FTRACE_SAVE_REGS to 1
Link: http://lkml.kernel.org/r/20120702201821.019966811@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the function tracer starts to get more features, the support for
theses features will spread out throughout the different architectures
over time. These features boil down to what each arch does in the
mcount trampoline (the ftrace_caller).
Currently there's two features that are not the same throughout the
archs.
1) Support to stop function tracing before the callback
2) passing of the ftrace ops
Both of these require placing an indirect function to support the
features if the mcount trampoline does not.
On a side note, for all architectures, when more than one callback
is registered to the function tracer, an intermediate 'list' function
is called by the mcount trampoline to iterate through the callbacks
that are registered.
Instead of making a separate function for each of these features,
and requiring several indirect calls, just use the single 'list' function
as the intermediate, to handle all cases. If an arch does not support
the 'stop function tracing' or the passing of ftrace ops, just force
it to use the list function that will handle the features required.
This makes the code cleaner and simpler and removes a lot of
#ifdefs in the code.
Link: http://lkml.kernel.org/r/20120612225424.495625483@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the function trace callback receives only the ip and parent_ip
of the function that it traced. It would be more powerful to also return
the ops that registered the function as well. This allows the same function
to act differently depending on what ftrace_ops registered it.
Link: http://lkml.kernel.org/r/20120612225424.267254552@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the function accepts just one bit, we can use the switch
construction instead of if/else if/...
Just a cosmetic change, there should be no functional changes.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch introduces 'func_ptrace' option, now available in
/sys/kernel/debug/tracing/options when function tracer
is selected.
The patch also adds some tiny code that calls back to pstore
to record the trace. The callback is no-op when PSTORE=n.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If tracer->init() fails, current code will leave current_tracer pointing
to an unusable tracer, which at best makes 'current_tracer' report
inaccurate value.
Fix the issue by pointing current_tracer to nop tracer, and only update
current_tracer with the new one after all the initialization succeeds.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull RCU, perf, and scheduler fixes from Ingo Molnar.
The RCU fix is a revert for an optimization that could cause deadlocks.
One of the scheduler commits (164c33c6ad "sched: Fix fork() error path
to not crash") is correct but not complete (some architectures like Tile
are not covered yet) - the resulting additional fixes are still WIP and
Ingo did not want to delay these pending fixes. See this thread on
lkml:
[PATCH] fork: fix error handling in dup_task()
The perf fixes are just trivial oneliners.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf kvm: Fix segfault with report and mixed guestmount use
perf kvm: Fix regression with guest machine creation
perf script: Fix format regression due to libtraceevent merge
ring-buffer: Fix accounting of entries when removing pages
ring-buffer: Fix crash due to uninitialized new_pages list head
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS/sched: Update scheduler file pattern
sched/nohz: Rewrite and fix load-avg computation -- again
sched: Fix fork() error path to not crash
Clean up and return -ENOMEM on if the kzalloc() fails.
This also prevents a potential crash, as the pointer that failed to
allocate would be later used.
Link: http://lkml.kernel.org/r/20120711063507.GF11812@elgon.mountain
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull block bits from Jens Axboe:
"As vacation is coming up, thought I'd better get rid of my pending
changes in my for-linus branch for this iteration. It contains:
- Two patches for mtip32xx. Killing a non-compliant sysfs interface
and moving it to debugfs, where it belongs.
- A few patches from Asias. Two legit bug fixes, and one killing an
interface that is no longer in use.
- A patch from Jan, making the annoying partition ioctl warning a bit
less annoying, by restricting it to !CAP_SYS_RAWIO only.
- Three bug fixes for drbd from Lars Ellenberg.
- A fix for an old regression for umem, it hasn't really worked since
the plugging scheme was changed in 3.0.
- A few fixes from Tejun.
- A splice fix from Eric Dumazet, fixing an issue with pipe
resizing."
* 'for-linus' of git://git.kernel.dk/linux-block:
scsi: Silence unnecessary warnings about ioctl to partition
block: Drop dead function blk_abort_queue()
block: Mitigate lock unbalance caused by lock switching
block: Avoid missed wakeup in request waitqueue
umem: fix up unplugging
splice: fix racy pipe->buffers uses
drbd: fix null pointer dereference with on-congestion policy when diskless
drbd: fix list corruption by failing but already aborted reads
drbd: fix access of unallocated pages and kernel panic
xen/blkfront: Add WARN to deal with misbehaving backends.
blkcg: drop local variable @q from blkg_destroy()
mtip32xx: Create debugfs entries for troubleshooting
mtip32xx: Remove 'registers' and 'flags' from sysfs
blkcg: fix blkg_alloc() failure path
block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED
block: fix return value on cfq_init() failure
mtip32xx: Remove version.h header file inclusion
xen/blkback: Copy id field when doing BLKIF_DISCARD.
When removing pages from the ring buffer, its state is not reset. This
means that the counters need to be correctly updated to account for the
pages removed.
Update the overrun counter to reflect the removed events from the pages.
Link: http://lkml.kernel.org/r/1340998301-1715-1-git-send-email-vnagarnaik@google.com
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The new_pages list head in the cpu_buffer is not initialized. When
adding pages to the ring buffer, if the memory allocation fails in
ring_buffer_resize, the clean up handler tries to free up the allocated
pages from all the cpu buffers. The panic is caused by referencing the
uninitialized new_pages list head.
Initializing the new_pages list head in rb_allocate_cpu_buffer fixes
this.
Link: http://lkml.kernel.org/r/1340391005-10880-1-git-send-email-vnagarnaik@google.com
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ring buffer reader page is used to swap a page from the writable
ring buffer. If the writer happens to be on that page, it ends up on the
reader page, but will simply move off of it, back into the writable ring
buffer as writes are added.
The time stamp passed back to the readers is stored in the cpu_buffer per
CPU descriptor. This stamp is updated when a swap of the reader page takes
place, and it reads the current stamp from the page taken from the writable
ring buffer. Everytime a writer goes to a new page, it updates the time stamp
of that page.
The problem happens if a reader reads a page from an empty per CPU ring buffer.
If the buffer is empty, the swap still takes place, placing the writer at the
start of the reader page. If at a later time, a write happens, it updates the
page's time stamp and continues. But the problem is that the read_stamp does
not get updated, because the page was already swapped.
The solution to this was to not swap the page if the ring buffer happens to
be empty. This also removes the side effect that the writes on the reader
page will not get updated because the writer never gets back on the reader
page without a swap. That is, if a read happens on an empty buffer, but then
no reads happen for a while. If a swap took place, and the writer were to start
writing a lot of data (function tracer), it will start overflowing the ring buffer
and overwrite the older data. But because the writer never goes back onto the
reader page, the data left on the reader page never gets overwritten. This
causes the reader to see really old data, followed by a jump to newer data.
Link: http://lkml.kernel.org/r/1340060577-9112-1-git-send-email-dhsharp@google.com
Google-Bug-Id: 6410455
Reported-by: David Sharp <dhsharp@google.com>
tested-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Replace the NR_CPUS array of buffer_iter from the trace_iterator
with an allocated array. This will just create an array of
possible CPUS instead of the max number specified.
The use of NR_CPUS in that array caused allocation failures for
machines that were tight on memory. This did not cause any failures
to the system itself (no crashes), but caused unnecessary failures
for reading the trace files.
Added a helper function called 'trace_buffer_iter()' that returns
the buffer_iter item or NULL if it is not defined or the array was
not allocated. Some routines do not require the array
(tracing_open_pipe() for one).
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a WARN_ON() output on test failures so that they are easier to detect
in automated tests. Although, the WARN_ON() will not print if the test
causes the system to crash, obviously.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All trace events including ftrace internel events (like trace_printk
and function tracing), register functions that describe how to print
their output. The events may be recorded as soon as the ring buffer
is allocated, but they are just raw binary in the buffer. The mapping
of event ids to how to print them are held within a structure that
is registered on system boot.
If a crash happens in boot up before these functions are registered
then their output (via ftrace_dump_on_oops) will be useless:
Dumping ftrace buffer:
---------------------------------
<...>-1 0.... 319705us : Unknown type 6
---------------------------------
This can be quite frustrating for a kernel developer trying to see
what is going wrong.
There's no reason to register them so late in the boot up process.
They can be registered by early_initcall().
Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
register_ftrace_function() checks ftrace_disabled and calls
__register_ftrace_function which does it again.
Drop the first check and add the unlikely hint to the second one. Also,
drop the label as John correctly notices.
No functional change.
Link: http://lkml.kernel.org/r/20120329171140.GE6409@aftab
Cc: Borislav Petkov <bp@amd64.org>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
by splice_shrink_spd() called from vmsplice_to_pipe()
commit 35f3d14dbb (pipe: add support for shrinking and growing pipes)
added capability to adjust pipe->buffers.
Problem is some paths don't hold pipe mutex and assume pipe->buffers
doesn't change for their duration.
Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
use it in place of pipe->buffers where appropriate.
splice_shrink_spd() loses its struct pipe_inode_info argument.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tom Herbert <therbert@google.com>
Cc: stable <stable@vger.kernel.org> # 2.6.35
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A recent update to have tracing_on/off() only affect the ftrace ring
buffers instead of all ring buffers had a cut and paste error.
The tracing_off() did the exact same thing as tracing_on() and
would not actually turn off tracing. Unfortunately, tracing_off()
is more important to be working than tracing_on() as this is a key
development tool, as it lets the developer turn off tracing as soon
as a problem is discovered. It is also used by panic and oops code.
This bug also breaks the 'echo func:traceoff > set_ftrace_filter'
Cc: <stable@vger.kernel.org> # 3.4
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull perf updates from Ingo Molnar.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
perf ui browser: Stop using 'self'
perf annotate browser: Read perf config file for settings
perf config: Allow '_' in config file variable names
perf annotate browser: Make feature toggles global
perf annotate browser: The idx_asm field should be used in asm only view
perf tools: Convert critical messages to ui__error()
perf ui: Make --stdio default when TUI is not supported
tools lib traceevent: Silence compiler warning on 32bit build
perf record: Fix branch_stack type in perf_record_opts
perf tools: Reconstruct event with modifiers from perf_event_attr
perf top: Fix counter name fixup when fallbacking to cpu-clock
perf tools: fix thread_map__new_by_pid_str() memory leak in error path
perf tools: Do not use _FORTIFY_SOURCE when DEBUG=1 is specified
tools lib traceevent: Fix signature of create_arg_item()
tools lib traceevent: Use proper function parameter type
tools lib traceevent: Fix freeing arg on process_dynamic_array()
tools lib traceevent: Fix a possibly wrong memory dereference
tools lib traceevent: Fix a possible memory leak
tools lib traceevent: Allow expressions in __print_symbolic() fields
perf evlist: Explicititely initialize input_name
...
Pull user-space probe instrumentation from Ingo Molnar:
"The uprobes code originates from SystemTap and has been used for years
in Fedora and RHEL kernels. This version is much rewritten, reviews
from PeterZ, Oleg and myself shaped the end result.
This tree includes uprobes support in 'perf probe' - but SystemTap
(and other tools) can take advantage of user probe points as well.
Sample usage of uprobes via perf, for example to profile malloc()
calls without modifying user-space binaries.
First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.
If you don't know which function you want to probe you can pick one
from 'perf top' or can get a list all functions that can be probed
within libc (binaries can be specified as well):
$ perf probe -F -x /lib/libc.so.6
To probe libc's malloc():
$ perf probe -x /lib64/libc.so.6 malloc
Added new event:
probe_libc:malloc (on 0x7eac0)
You can now use it in all perf tools, such as:
perf record -e probe_libc:malloc -aR sleep 1
Make use of it to create a call graph (as the flat profile is going to
look very boring):
$ perf record -e probe_libc:malloc -gR make
[ perf record: Woken up 173 times to write data ]
[ perf record: Captured and wrote 44.190 MB perf.data (~1930712
$ perf report | less
32.03% git libc-2.15.so [.] malloc
|
--- malloc
29.49% cc1 libc-2.15.so [.] malloc
|
--- malloc
|
|--0.95%-- 0x208eb1000000000
|
|--0.63%-- htab_traverse_noresize
11.04% as libc-2.15.so [.] malloc
|
--- malloc
|
7.15% ld libc-2.15.so [.] malloc
|
--- malloc
|
5.07% sh libc-2.15.so [.] malloc
|
--- malloc
|
4.99% python-config libc-2.15.so [.] malloc
|
--- malloc
|
4.54% make libc-2.15.so [.] malloc
|
--- malloc
|
|--7.34%-- glob
| |
| |--93.18%-- 0x41588f
| |
| --6.82%-- glob
| 0x41588f
...
Or:
$ perf report -g flat | less
# Overhead Command Shared Object Symbol
# ........ ............. ............. ..........
#
32.03% git libc-2.15.so [.] malloc
27.19%
malloc
29.49% cc1 libc-2.15.so [.] malloc
24.77%
malloc
11.04% as libc-2.15.so [.] malloc
11.02%
malloc
7.15% ld libc-2.15.so [.] malloc
6.57%
malloc
...
The core uprobes design is fairly straightforward: uprobes probe
points register themselves at (inode:offset) addresses of
libraries/binaries, after which all existing (or new) vmas that map
that address will have a software breakpoint injected at that address.
vmas are COW-ed to preserve original content. The probe points are
kept in an rbtree.
If user-space executes the probed inode:offset instruction address
then an event is generated which can be recovered from the regular
perf event channels and mmap-ed ring-buffer.
Multiple probes at the same address are supported, they create a
dynamic callback list of event consumers.
The basic model is further complicated by the XOL speedup: the
original instruction that is probed is copied (in an architecture
specific fashion) and executed out of line when the probe triggers.
The XOL area is a single vma per process, with a fixed number of
entries (which limits probe execution parallelism).
The API: uprobes are installed/removed via
/sys/kernel/debug/tracing/uprobe_events, the API is integrated to
align with the kprobes interface as much as possible, but is separate
to it.
Injecting a probe point is privileged operation, which can be relaxed
by setting perf_paranoid to -1.
You can use multiple probes as well and mix them with kprobes and
regular PMU events or tracepoints, when instrumenting a task."
Fix up trivial conflicts in mm/memory.c due to previous cleanup of
unmap_single_vma().
* 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
perf probe: Detect probe target when m/x options are absent
perf probe: Provide perf interface for uprobes
tracing: Fix kconfig warning due to a typo
tracing: Provide trace events interface for uprobes
tracing: Extract out common code for kprobes/uprobes trace events
tracing: Modify is_delete, is_return from int to bool
uprobes/core: Decrement uprobe count before the pages are unmapped
uprobes/core: Make background page replacement logic account for rss_stat counters
uprobes/core: Optimize probe hits with the help of a counter
uprobes/core: Allocate XOL slots for uprobes use
uprobes/core: Handle breakpoint and singlestep exceptions
uprobes/core: Rename bkpt to swbp
uprobes/core: Make order of function parameters consistent across functions
uprobes/core: Make macro names consistent
uprobes: Update copyright notices
uprobes/core: Move insn to arch specific structure
uprobes/core: Remove uprobe_opcode_sz
uprobes/core: Make instruction tables volatile
uprobes: Move to kernel/events/
uprobes/core: Clean up, refactor and improve the code
...
Pull an ftrace ring-buffer fix from Steve Rostedt:
* fix kernel crash when changing the size of the ring-buffer on
boxes where possible_cpus != online_cpus.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On some machines the number of possible CPUS is not the same as the
number of CPUs that is on the machine. Ftrace uses possible_cpus to
update the tracing structures but the ring buffer only allocates
per cpu buffers for online CPUs when they come up.
When the wakeup tracer was enabled in such a case, the ftrace code
enabled all possible cpu buffers, but the code in ring_buffer_resize()
did not check to see if the buffer in question was allocated. Since
boot up CPUs did not match possible CPUs it caused the following
crash:
BUG: unable to handle kernel NULL pointer dereference at 00000020
IP: [<c1097851>] ring_buffer_resize+0x16a/0x28d
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in: [last unloaded: scsi_wait_scan]
Pid: 1387, comm: bash Not tainted 3.4.0-test+ #13 /DG965MQ
EIP: 0060:[<c1097851>] EFLAGS: 00010217 CPU: 0
EIP is at ring_buffer_resize+0x16a/0x28d
EAX: f5a14340 EBX: f6026b80 ECX: 00000ff4 EDX: 00000ff3
ESI: 00000000 EDI: 00000002 EBP: f4275ecc ESP: f4275eb0
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 80050033 CR2: 00000020 CR3: 34396000 CR4: 000007d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process bash (pid: 1387, ti=f4274000 task=f4380cb0 task.ti=f4274000)
Stack:
c109cf9a f6026b98 00000162 00160f68 00000006 00160f68 00000002 f4275ef0
c109d013 f4275ee8 c123b72a c1c0bf00 c1cc81dc 00000005 f4275f98 00000007
f4275f70 c109d0c7 7700000e 75656b61 00000070 f5e90900 f5c4e198 00000301
Call Trace:
[<c109cf9a>] ? tracing_set_tracer+0x115/0x1e9
[<c109d013>] tracing_set_tracer+0x18e/0x1e9
[<c123b72a>] ? _copy_from_user+0x30/0x46
[<c109d0c7>] tracing_set_trace_write+0x59/0x7f
[<c10ec01e>] ? fput+0x18/0x1c6
[<c11f8732>] ? security_file_permission+0x27/0x2b
[<c10eaacd>] ? rw_verify_area+0xcf/0xf2
[<c10ec01e>] ? fput+0x18/0x1c6
[<c109d06e>] ? tracing_set_tracer+0x1e9/0x1e9
[<c10ead77>] vfs_write+0x8b/0xe3
[<c10ebead>] ? fget_light+0x30/0x81
[<c10eaf54>] sys_write+0x42/0x63
[<c1834fbf>] sysenter_do_call+0x12/0x28
This happens with the latency tracer as the ftrace code updates the
saved max buffer via its cpumask and not with a global setting.
Adding a check in ring_buffer_resize() to make sure the buffer being resized
exists, fixes the problem.
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull trivial updates from Jiri Kosina:
"As usual, it's mostly typo fixes, redundant code elimination and some
documentation updates."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits)
edac, mips: don't change code that has been removed in edac/mips tree
xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer
lib: Change mail address of Oskar Schirmer
net: Change mail address of Oskar Schirmer
arm/m68k: Change mail address of Sebastian Hess
i2c: Change mail address of Oskar Schirmer
net: Fix tcp_build_and_update_options comment in struct tcp_sock
atomic64_32.h: fix parameter naming mismatch
Kconfig: replace "--- help ---" with "---help---"
c2port: fix bogus Kconfig "default no"
edac: Fix spelling errors.
qla1280: Remove redundant NULL check before release_firmware() call
remoteproc: remove redundant NULL check before release_firmware()
qla2xxx: Remove redundant NULL check before release_firmware() call.
aic94xx: Get rid of redundant NULL check before release_firmware() call
tehuti: delete redundant NULL check before release_firmware()
qlogic: get rid of a redundant test for NULL before call to release_firmware()
bna: remove redundant NULL test before release_firmware()
tg3: remove redundant NULL test before release_firmware() call
typhoon: get rid of redundant conditional before all to release_firmware()
...
Pull perf changes from Ingo Molnar:
"Lots of changes:
- (much) improved assembly annotation support in perf report, with
jump visualization, searching, navigation, visual output
improvements and more.
- kernel support for AMD IBS PMU hardware features. Notably 'perf
record -e cycles:p' and 'perf top -e cycles:p' should work without
skid now, like PEBS does on the Intel side, because it takes
advantage of IBS transparently.
- the libtracevents library: it is the first step towards unifying
tracing tooling and perf, and it also gives a tracing library for
external tools like powertop to rely on.
- infrastructure: various improvements and refactoring of the UI
modules and related code
- infrastructure: cleanup and simplification of the profiling
targets code (--uid, --pid, --tid, --cpu, --all-cpus, etc.)
- tons of robustness fixes all around
- various ftrace updates: speedups, cleanups, robustness
improvements.
- typing 'make' in tools/ will now give you a menu of projects to
build and a short help text to explain what each does.
- ... and lots of other changes I forgot to list.
The perf record make bzImage + perf report regression you reported
should be fixed."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (166 commits)
tracing: Remove kernel_lock annotations
tracing: Fix initial buffer_size_kb state
ring-buffer: Merge separate resize loops
perf evsel: Create events initially disabled -- again
perf tools: Split term type into value type and term type
perf hists: Fix callchain ip printf format
perf target: Add uses_mmap field
ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER
ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code()
ftrace: Make ftrace_modify_all_code() global for archs to use
ftrace: Return record ip addr for ftrace_location()
ftrace: Consolidate ftrace_location() and ftrace_text_reserved()
ftrace: Speed up search by skipping pages by address
ftrace: Remove extra helper functions
ftrace: Sort all function addresses, not just per page
tracing: change CPU ring buffer state from tracing_cpumask
tracing: Check return value of tracing_dentry_percpu()
ring-buffer: Reset head page before running self test
ring-buffer: Add integrity check at end of iter read
ring-buffer: Make addition of pages in ring buffer atomic
...
Pull workqueue changes from Tejun Heo:
"Nothing exciting. Most are updates to debug stuff and related fixes.
Two not-too-critical bugs are fixed - WARN_ON() triggering spurious
during cpu offlining and unlikely lockdep related oops."
* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
lockdep: fix oops in processing workqueue
workqueue: skip nr_running sanity check in worker_enter_idle() if trustee is active
workqueue: Catch more locking problems with flush_work()
workqueue: change BUG_ON() to WARN_ON()
trace: Remove unused workqueue tracer
Fixes for perf/core:
- Rename some perf_target methods to avoid double negation, from Namhyung Kim.
- Revert change to use per task events with inheritance, from Namhyung Kim.
- Events should start disabled till children starts running, from David Ahern.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make sure that the state of buffer_size_kb is initialized correctly and
returns actual size of the ring buffer.
Link: http://lkml.kernel.org/r/1336066834-1673-1-git-send-email-vnagarnaik@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are 2 separate loops to resize cpu buffers that are online and
offline. Merge them to make the code look better.
Also change the name from update_completion to update_done to allow
shorter lines.
Link: http://lkml.kernel.org/r/1337372991-14783-1-git-send-email-vnagarnaik@google.com
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Merge reason: We are going to queue up a dependent patch:
"perf tools: Move parse event automated tests to separated object"
That depends on:
commit e7c72d8
perf tools: Add 'G' and 'H' modifiers to event parsing
Conflicts:
tools/perf/builtin-stat.c
Conflicted with the recent 'perf_target' patches when checking the
result of perf_evsel open routines to see if a retry is needed to cope
with older kernels where the exclude guest/host perf_event_attr bits
were not used.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The function tracer will enable the -pg option with gcc, which requires
that frame pointers. When FRAME_POINTER is defined in the kernel config
it adds the gcc option -fno-omit-frame-pointer which causes some problems
on some architectures. For those architectures, the FRAME_POINTER select
was not set.
When FUNCTION_TRACER was selected on these architectures that can not have
-fno-omit-frame-pointer, the -pg option is still set. But when
FRAME_POINTER is not selected, the kernel config would add the gcc option
-fomit-frame-pointer. Adding this option is incompatible with -pg
even on archs that do not need frame pointers with -pg.
The answer to this was to just not add either -fno-omit-frame-pointer
or -fomit-frame-pointer on these archs that want function tracing
but do not set FRAME_POINTER.
As it turns out, for archs that require frame pointers for function
tracing, the same can be used. If gcc requires frame pointers with
-pg, it will simply add it. The best thing to do is not select FRAME_POINTER
when function tracing is selected, and let gcc add it if needed.
Only add the -fno-omit-frame-pointer when something else selects
FRAME_POINTER, but do not add -fomit-frame-pointer if function tracing
is selected.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To remove duplicate code, have the ftrace arch_ftrace_update_code()
use the generic ftrace_modify_all_code(). This requires that the
default ftrace_replace_code() becomes a weak function so that an
arch may override it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Rename __ftrace_modify_code() to ftrace_modify_all_code() and make
it global for all archs to use. This will remove the duplication
of code, as archs that can modify code without stop_machine()
can use it directly outside of the stop_machine() call.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_location() is passed an addr, and returns 1 if the addr is
on a ftrace nop (or caller to ftrace_caller), and 0 otherwise.
To let kprobes know if it should move a breakpoint or not, it
must return the actual addr that is the start of the ftrace nop.
This way a kprobe placed on the location of a ftrace nop, can
instead be placed on the instruction after the nop. Even if the
probe addr is on the second or later byte of the nop, it can
simply be moved forward.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both ftrace_location() and ftrace_text_reserved() do basically the same thing.
They search to see if an address is in the ftace table (contains an address
that may change from nop to call ftrace_caller). The difference is
that ftrace_location() searches a single address, but ftrace_text_reserved()
searches a range.
This also makes the ftrace_text_reserved() faster as it now uses a bsearch()
instead of linearly searching all the addresses within a page.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As all records in a page of the ftrace table are sorted, we can
speed up the search algorithm by checking if the address to look for
falls in between the first and last record ip on the page.
This speeds up both the ftrace_location() and ftrace_text_reserved()
algorithms, as it can skip full pages when the search address is
not in them.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_record_ip() and ftrace_alloc_dyn_node() were from the
time of the ftrace daemon. Although they were still used, they
still make things a bit more complex than necessary.
Move the code into the one function that uses it, and remove the
helper functions.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of just sorting the ip's of the functions per ftrace page,
sort the entire list before adding them to the ftrace pages.
This will allow the bsearch algorithm to be sped up as it can
also sort by pages, not just records within a page.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to Documentation/trace/ftrace.txt:
tracing_cpumask:
This is a mask that lets the user only trace
on specified CPUS. The format is a hex string
representing the CPUS.
The tracing_cpumask currently doesn't affect the tracing state of
per-CPU ring buffers.
This patch enables/disables CPU recording as its corresponding bit in
tracing_cpumask is set/unset.
Link: http://lkml.kernel.org/r/1336096792-25373-3-git-send-email-vnagarnaik@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If tracing_dentry_percpu() failed, tracing_init_debugfs_percpu()
will try to create each cpu directories on debugfs' root directory
as d_percpu is NULL.
Link: http://lkml.kernel.org/r/1335143517-2285-1-git-send-email-namhyung.kim@lge.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Namhyung Kim <namhyung.kim@lge.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the ring buffer does its consistency test on itself, it
removes the head page, runs the tests, and then adds it back
to what the "head_page" pointer was. But because the head_page
pointer may lack behind the real head page (held by the link
list pointer). The reset may be incorrect.
Instead, if the head_page exists (it does not on first allocation)
reset it back to the real head page before running the consistency
tests. Then it will be put back to its original location after
the tests are complete.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There use to be ring buffer integrity checks after updating the
size of the ring buffer. But now that the ring buffer can modify
the size while the system is running, the integrity checks were
removed, as they require the ring buffer to be disabed to perform
the check.
Move the integrity check to the reading of the ring buffer via the
iterator reads (the "trace" file). As reading via an iterator requires
disabling the ring buffer, it is a perfect place to have it.
If the ring buffer happens to be disabled when updating the size,
we still perform the integrity check.
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch adds the capability to add new pages to a ring buffer
atomically while write operations are going on. This makes it possible
to expand the ring buffer size without reinitializing the ring buffer.
The new pages are attached between the head page and its previous page.
Link: http://lkml.kernel.org/r/1336096792-25373-2-git-send-email-vnagarnaik@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch adds the capability to remove pages from a ring buffer
without destroying any existing data in it.
This is done by removing the pages after the tail page. This makes sure
that first all the empty pages in the ring buffer are removed. If the
head page is one in the list of pages to be removed, then the page after
the removed ones is made the head page. This removes the oldest data
from the ring buffer and keeps the latest data around to be read.
To do this in a non-racey manner, tracing is stopped for a very short
time while the pages to be removed are identified and unlinked from the
ring buffer. The pages are freed after the tracing is restarted to
minimize the time needed to stop tracing.
The context in which the pages from the per-cpu ring buffer are removed
runs on the respective CPU. This minimizes the events not traced to only
NMI trace contexts.
Link: http://lkml.kernel.org/r/1336096792-25373-1-git-send-email-vnagarnaik@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Justin Teravest <teravest@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On gcc 4.5 the function tracing_mark_write() would give a warning
of page2 being uninitialized. This is due to a bug in gcc because
the logic prevents page2 from being used uninitialized, and
gcc 4.6+ does not complain (correctly).
Instead of adding a "unitialized" around page2, which could show
a bug later on, I combined page1 and page2 into an array map_pages[].
This binds the two and the two are modified according to nr_pages
(what gcc 4.5 seems to ignore). This no longer gives a warning with
gcc 4.5 nor with gcc 4.6.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the adding of function tracing event to perf, it caused a
side effect that produces the following warning when enabling all
events in ftrace:
# echo 1 > /sys/kernel/debug/tracing/events/enable
[console]
event trace: Could not enable event function
This is because when enabling all events via the debugfs system
it ignores events that do not have a ->reg() function assigned.
This was to skip over the ftrace internal events (as they are
not TRACE_EVENTs). But as the ftrace function event now has
a ->reg() function attached to it for use with perf, it is no
longer ignored.
Worse yet, this ->reg() function is being called when it should
not be. It returns an error and causes the above warning to
be printed.
By adding a new event_call flag (TRACE_EVENT_FL_IGNORE_ENABLE)
and have all ftrace internel event structures have it set,
setting the events/enable will no longe try to incorrectly enable
the function event and does not warn.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ftrace_disable_cpu() and ftrace_enable_cpu() functions were
needed back before the ring buffer was lockless. Now that the
ring buffer is lockless (and has been for some time), these functions
serve no purpose, and unnecessarily slow down operations of the tracer.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It's appropriate to use __seq_open_private interface to open
some of trace seq files, because it covers all steps we are
duplicating in tracing code - zallocating the iterator and
setting it as seq_file's private.
Using this for following files:
trace
available_filter_functions
enabled_functions
Link: http://lkml.kernel.org/r/1335342219-2782-5-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
[
Fixed warnings for:
kernel/trace/trace.c: In function '__tracing_open':
kernel/trace/trace.c:2418:11: warning: unused variable 'ret' [-Wunused-variable]
kernel/trace/trace.c:2417:19: warning: unused variable 'm' [-Wunused-variable]
]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Implements trace_event support for uprobes. In its current form
it can be used to put probes at a specified offset in a file and
dump the required registers when the code flow reaches the
probed address.
The following example shows how to dump the instruction pointer
and %ax a register at the probed text address. Here we are
trying to probe zfree in /bin/zsh:
# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g DF .text 0000000000000012 Base
zfree # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120411103043.GB29437@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c. There are no functional changes.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120409091144.8343.76218.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
is_delete and is_return can take utmost 2 values and are better
of being a boolean than a int. There are no functional changes.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120409091133.8343.65289.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a debugfs entry under per_cpu/ folder for each cpu called
buffer_size_kb to control the ring buffer size for each CPU
independently.
If the global file buffer_size_kb is used to set size, the individual
ring buffers will be adjusted to the given size. The buffer_size_kb will
report the common size to maintain backward compatibility.
If the buffer_size_kb file under the per_cpu/ directory is used to
change buffer size for a specific CPU, only the size of the respective
ring buffer is updated. When tracing/buffer_size_kb is read, it reports
'X' to indicate that sizes of per_cpu ring buffers are not equivalent.
Link: http://lkml.kernel.org/r/1328212844-11889-1-git-send-email-vnagarnaik@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Justin Teravest <teravest@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
memcpy() returns a pointer to "bug". Hopefully, it's not NULL here or
we would already have Oopsed.
Link: http://lkml.kernel.org/r/20120420063145.GA22649@elgon.mountain
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, trace_printk() uses a single buffer to write into
to calculate the size and format needed to save the trace. To
do this safely in an SMP environment, a spin_lock() is taken
to only allow one writer at a time to the buffer. But this could
also affect what is being traced, and add synchronization that
would not be there otherwise.
Ideally, using percpu buffers would be useful, but since trace_printk()
is only used in development, having per cpu buffers for something
never used is a waste of space. Thus, the use of the trace_bprintk()
format section is changed to be used for static fmts as well as dynamic ones.
Then at boot up, we can check if the section that holds the trace_printk
formats is non-empty, and if it does contain something, then we
know a trace_printk() has been added to the kernel. At this time
the trace_printk per cpu buffers are allocated. A check is also
done at module load time in case a module is added that contains a
trace_printk().
Once the buffers are allocated, they are never freed. If you use
a trace_printk() then you should know what you are doing.
A buffer is made for each type of context:
normal
softirq
irq
nmi
The context is checked and the appropriate buffer is used.
This allows for totally lockless usage of trace_printk(),
and they no longer even disable interrupts.
Requested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While debugging a latency with someone on IRC (mirage335) on #linux-rt (OFTC),
we discovered that the stacktrace output of the latency tracers
(preemptirqsoff) was empty.
This bug was caused by the creation of the dynamic length stack trace
again (like commit 12b5da3 "tracing: Fix ent_size in trace output" was).
This bug is caused by the latency tracers requiring the next event
to determine the time between the current event and the next. But by
grabbing the next event, the iter->ent_size is set to the next event
instead of the current one. As the stacktrace event is the last event,
this makes the ent_size zero and causes nothing to be printed for
the stack trace. The dynamic stacktrace uses the ent_size to determine
how much of the stack can be printed. The ent_size of zero means
no stack.
The simple fix is to save the iter->ent_size before finding the next event.
Note, mirage335 asked to remain anonymous from LKML and git, so I will
not add the Reported-by and Tested-by tags, even though he did report
the issue and tested the fix.
Cc: stable@vger.kernel.org # 3.1+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The change to make tracing_on affect only the ftrace ring buffer, caused
a bug where it wont affect any ring buffer. The problem was that the buffer
of the trace_array was passed to the write function and not the trace array
itself.
The trace_array can change the buffer when running a latency tracer. If this
happens, then the buffer being disabled may not be the buffer currently used
by ftrace. This will cause the tracing_on file to become useless.
The simple fix is to pass the trace_array to the write function instead of
the buffer. Then the actual buffer may be changed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Today's -next fails to link for me:
kernel/built-in.o:(.data+0x178e50): undefined reference to `perf_ftrace_event_register'
It looks like multiple fixes have been merged for the issue fixed by
commit fa73dc9 (tracing: Fix build breakage without CONFIG_PERF_EVENTS)
though I can't identify the other changes that have gone in at the
minute, it's possible that the changes which caused the breakage fixed
by the previous commit got dropped but the fix made it in.
Link: http://lkml.kernel.org/r/1334307179-21255-1-git-send-email-broonie@opensource.wolfsonmicro.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This tracer was temporarily removed in 6416669 (workqueue:
temporarily remove workqueue tracing, 2010-06-29) but never
reinstated after concurrency managed workqueues were completed.
For almost two years it hasn't been compilable so it seems nobody
is using it. Delete it.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Merge batch of fixes from Andrew Morton:
"The simple_open() cleanup was held back while I wanted for laggards to
merge things.
I still need to send a few checkpoint/restore patches. I've been
wobbly about merging them because I'm wobbly about the overall
prospects for success of the project. But after speaking with Pavel
at the LSF conference, it sounds like they're further toward
completion than I feared - apparently davem is at the "has stopped
complaining" stage regarding the net changes. So I need to go back
and re-review those patchs and their (lengthy) discussion."
* emailed from Andrew Morton <akpm@linux-foundation.org>: (16 patches)
memcg swap: use mem_cgroup_uncharge_swap fix
backlight: add driver for DA9052/53 PMIC v1
C6X: use set_current_blocked() and block_sigmask()
MAINTAINERS: add entry for sparse checker
MAINTAINERS: fix REMOTEPROC F: typo
alpha: use set_current_blocked() and block_sigmask()
simple_open: automatically convert to simple_open()
scripts/coccinelle/api/simple_open.cocci: semantic patch for simple_open()
libfs: add simple_open()
hugetlbfs: remove unregister_filesystem() when initializing module
drivers/rtc/rtc-88pm860x.c: fix rtc irq enable callback
fs/xattr.c:setxattr(): improve handling of allocation failures
fs/xattr.c:listxattr(): fall back to vmalloc() if kmalloc() failed
fs/xattr.c: suppress page allocation failure warnings from sys_listxattr()
sysrq: use SEND_SIG_FORCED instead of force_sig()
proc: fix mount -t proc -o AAA
Many users of debugfs copy the implementation of default_open() when
they want to support a custom read/write function op. This leads to a
proliferation of the default_open() implementation across the entire
tree.
Now that the common implementation has been consolidated into libfs we
can replace all the users of this function with simple_open().
This replacement was done with the following semantic patch:
<smpl>
@ open @
identifier open_f != simple_open;
identifier i, f;
@@
-int open_f(struct inode *i, struct file *f)
-{
(
-if (i->i_private)
-f->private_data = i->i_private;
|
-f->private_data = i->i_private;
)
-return 0;
-}
@ has_open depends on open @
identifier fops;
identifier open.open_f;
@@
struct file_operations fops = {
...
-.open = open_f,
+.open = simple_open,
...
};
</smpl>
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reading the trace file, the records of each of the per_cpu buffers
are examined to find the next event to print out. At the point of looking
at the event, the size of the event is recorded. But if the first event is
chosen, the other events in the other CPU buffers will reset the event size
that is stored in the iterator descriptor, causing the event size passed to
the output functions to be incorrect.
In most cases this is not a problem, but for the case of stack traces, it
is. With the change to the stack tracing to record a dynamic number of
back traces, the output depends on the size of the entry instead of the
fixed 8 back traces. When the entry size is not correct, the back traces
would not be fully printed.
Note, reading from the per-cpu trace files were not affected.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull vfs pile 1 from Al Viro:
"This is _not_ all; in particular, Miklos' and Jan's stuff is not there
yet."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
ext4: initialization of ext4_li_mtx needs to be done earlier
debugfs-related mode_t whack-a-mole
hfsplus: add an ioctl to bless files
hfsplus: change finder_info to u32
hfsplus: initialise userflags
qnx4: new helper - try_extent()
qnx4: get rid of qnx4_bread/qnx4_getblk
take removal of PF_FORKNOEXEC to flush_old_exec()
trim includes in inode.c
um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
um: embed ->stub_pages[] into mmu_context
gadgetfs: list_for_each_safe() misuse
ocfs2: fix leaks on failure exits in module_init
ecryptfs: make register_filesystem() the last potential failure exit
ntfs: forgets to unregister sysctls on register_filesystem() failure
logfs: missing cleanup on register_filesystem() failure
jfs: mising cleanup on register_filesystem() failure
make configfs_pin_fs() return root dentry on success
configfs: configfs_create_dir() has parent dentry in dentry->d_parent
configfs: sanitize configfs_create()
...
Today's -next fails to build for me:
CC kernel/trace/trace_export.o
In file included from kernel/trace/trace_export.c:197: kernel/trace/trace_entries.h:58: error: 'perf_ftrace_event_register' undeclared here (not in a function)
make[2]: *** [kernel/trace/trace_export.o] Error 1
make[1]: *** [kernel/trace] Error 2
make: *** [kernel] Error 2
because as of ced390 (ftrace, perf: Add support to use function
tracepoint in perf) perf_trace_event_register() is declared in trace.h
only if CONFIG_PERF_EVENTS is enabled but I don't have that set.
Ensure that we always have a definition of perf_trace_event_register()
by making the definition unconditional.
Link: http://lkml.kernel.org/r/1330426967-17067-1-git-send-email-broonie@opensource.wolfsonmicro.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When CONFIG_DYNAMIC_FTRACE is not set, some archs (ARM) test
the variable function_trace_function to determine if it should
call the function tracer. If it is not set to ftrace_stub, then
it will call the function and return, and not call the function
graph tracer.
But some of these archs (ARM) do not have the assembly code
to test if function tracing is enabled or not (quick stop of tracing)
and it calls the helper routine ftrace_test_stop_func() instead.
If function tracer is enabled and then disabled, the variable
ftrace_trace_function is still set to the helper routine
ftrace_test_stop_func(), and not to ftrace_stub. This will
prevent the function graph tracer from ever running.
Output before patch
/debug/tracing # echo function > current_tracer
/debug/tracing # echo function_graph > current_tracer
/debug/tracing # cat trace
Output after patch
/debug/tracing # echo function > current_tracer
/debug/tracing # echo function_graph > current_tracer
/debug/tracing # cat trace
0) ! 253.375 us | } /* irq_enter */
0) | generic_handle_irq() {
0) | handle_fasteoi_irq() {
0) 9.208 us | _raw_spin_lock();
0) | handle_irq_event() {
0) | handle_irq_event_percpu() {
Signed-off-by: Rajesh Bhagat <rajesh.lnx@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As ftrace_dump() (called by ftrace_dump_on_oops) disables interrupts
as it dumps its output to the console, it can keep interrupts disabled
for long periods of time. This is likely to trigger the NMI watchdog,
and it can disrupt the output of critical data.
Add a touch_nmi_watchdog() to each event that is written to the screen
to keep the NMI watchdog from affecting the output.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On PowerPC, FUNCTION_TRACER selects FRAME_POINTER, even
though the architecture does not support it.
This causes the following warning:
warning: (LOCKDEP && FAULT_INJECTION_STACKTRACE_FILTER && LATENCYTOP && FUNCTION_TRACER && KMEMCHECK) selects FRAME_POINTER which has unmet direct dependencies (DEBUG_KERNEL && (CRIS || M68K || FRV || UML || AVR32 || SUPERH || BLACKFIN || MN10300) || ARCH_WANT_FRAME_POINTERS)
So remove the warning by adding the extra condition
"if !PPC" to FUNCTION_TRACER for FRAME_POINTER selection
Link: http://lkml.kernel.org/r/1330330101-8618-1-git-send-email-gerlando.falauto@keymile.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the ring-buffer code is being used by other facilities in the
kernel, having tracing_on file disable *all* buffers is not a desired
affect. It should only disable the ftrace buffers that are being used.
Move the code into the trace.c file and use the buffer disabling
for tracing_on() and tracing_off(). This way only the ftrace buffers
will be affected by them and other kernel utilities will not be
confused to why their output suddenly stopped.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding support to filter function trace event via perf
interface. It is now possible to use filter interface
in the perf tool like:
perf record -e ftrace:function --filter="(ip == mm_*)" ls
The filter syntax is restricted to the the 'ip' field only,
and following operators are accepted '==' '!=' '||', ending
up with the filter strings like:
ip == f1[, ]f2 ... || ip != f3[, ]f4 ...
with comma ',' or space ' ' as a function separator. If the
space ' ' is used as a separator, the right side of the
assignment needs to be enclosed in double quotes '"', e.g.:
perf record -e ftrace:function --filter '(ip == do_execve,sys_*,ext*)' ls
perf record -e ftrace:function --filter '(ip == "do_execve,sys_*,ext*")' ls
perf record -e ftrace:function --filter '(ip == "do_execve sys_* ext*")' ls
The '==' operator adds trace filter with same effect as would
be added via set_ftrace_filter file.
The '!=' operator adds trace filter with same effect as would
be added via set_ftrace_notrace file.
The right side of the '!=', '==' operators is list of functions
or regexp. to be added to filter separated by space.
The '||' operator is used for connecting multiple filter definitions
together. It is possible to have more than one '==' and '!='
operators within one filter string.
Link: http://lkml.kernel.org/r/1329317514-8131-8-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding FILTER_TRACE_FN event field type for function tracepoint
event, so it can be properly recognized within filtering code.
Currently all fields of ftrace subsystem events share the common
field type FILTER_OTHER. Since the function trace fields need
special care within the filtering code we need to recognize it
properly, hence adding the FILTER_TRACE_FN event type.
Adding filter parameter to the FTRACE_ENTRY macro, to specify the
filter field type for the event.
Link: http://lkml.kernel.org/r/1329317514-8131-7-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding perf registration support for the ftrace function event,
so it is now possible to register it via perf interface.
The perf_event struct statically contains ftrace_ops as a handle
for function tracer. The function tracer is registered/unregistered
in open/close actions.
To be efficient, we enable/disable ftrace_ops each time the traced
process is scheduled in/out (via TRACE_REG_PERF_(ADD|DELL) handlers).
This way tracing is enabled only when the process is running.
Intentionally using this way instead of the event's hw state
PERF_HES_STOPPED, which would not disable the ftrace_ops.
It is now possible to use function trace within perf commands
like:
perf record -e ftrace:function ls
perf stat -e ftrace:function ls
Allowed only for root.
Link: http://lkml.kernel.org/r/1329317514-8131-6-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding FTRACE_ENTRY_REG macro so particular ftrace entries
could specify registration function and thus become accesible
via perf.
This will be used in upcomming patch for function trace.
Link: http://lkml.kernel.org/r/1329317514-8131-5-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding TRACE_REG_PERF_ADD and TRACE_REG_PERF_DEL to handle
perf event schedule in/out actions.
The add action is invoked for when the perf event is scheduled in,
while the del action is invoked when the event is scheduled out.
Link: http://lkml.kernel.org/r/1329317514-8131-4-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding TRACE_REG_PERF_OPEN and TRACE_REG_PERF_CLOSE to differentiate
register/unregister from open/close actions.
The register/unregister actions are invoked for the first/last
tracepoint user when opening/closing the event.
The open/close actions are invoked for each tracepoint user when
opening/closing the event.
Link: http://lkml.kernel.org/r/1329317514-8131-3-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding a way to temporarily enable/disable ftrace_ops. The change
follows the same way as 'global' ftrace_ops are done.
Introducing 2 global ftrace_ops - control_ops and ftrace_control_list
which take over all ftrace_ops registered with FTRACE_OPS_FL_CONTROL
flag. In addition new per cpu flag called 'disabled' is also added to
ftrace_ops to provide the control information for each cpu.
When ftrace_ops with FTRACE_OPS_FL_CONTROL is registered, it is
set as disabled for all cpus.
The ftrace_control_list contains all the registered 'control' ftrace_ops.
The control_ops provides function which iterates ftrace_control_list
and does the check for 'disabled' flag on current cpu.
Adding 3 inline functions:
ftrace_function_local_disable/ftrace_function_local_enable
- enable/disable the ftrace_ops on current cpu
ftrace_function_local_disabled
- get disabled ftrace_ops::disabled value for current cpu
Link: http://lkml.kernel.org/r/1329317514-8131-2-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If more than one __print_*() function is used in a tracepoint
(__print_flags(), __print_symbols(), etc), then the temp seq buffer will
not be zero on entry. Using the temp seq buffer's length to know if
data has been printed or not in the current function is incorrect and
may produce incorrect results.
Currently, no in-tree tracepoint causes this bug, but new ones may
be created.
Cc: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If __print_flags() is used after another __print_*() function, the
temp seq_file buffer will not be empty on entry, and the delimiter will
be printed even though there's just one field. We get something like:
|S
instead of just:
S
This is because the length of the temp seq buffer is used to determine
if the delimiter is printed or not. But this algorithm fails when
the seq buffer is not empty on entry, and the delimiter will be printed
because it thinks that a previous field was already printed.
Link: http://lkml.kernel.org/r/1329650167-480655-1-git-send-email-avagin@openvz.org
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The advantage of kcalloc is, that will prevent integer overflows which could
result from the multiplication of number of elements and size and it is also
a bit nicer to read.
The semantic patch that makes this change is available
in https://lkml.org/lkml/2011/11/25/107
Link: http://lkml.kernel.org/r/1322600880.1534.347.camel@localhost.localdomain
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Actually, sched_switch function tracer is merged into wakeup/wakeup_rt
Update 'mini-HOWTO' for ftrace(Kernel function tracer).
If we want to trace "sched:sched_switch" to trace sched_switch func,
We may utilize event option.(e.g: trace-cmd list -e | grep sched)
This patch is based on Linux-3.3.rc2-SMP-PREEMPT
Link: http://lkml.kernel.org/r/1328695537-15081-1-git-send-email-geunsik.lim@gmail.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Geunsik Lim <geunsik.lim@samsung.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the ftrace_set_filter and ftrace_set_notrace functions
do not return any return code. So there's no way for ftrace_ops
user to tell wether the filter was correctly applied.
The set_ftrace_filter interface returns error in case the filter
did not match:
# echo krava > set_ftrace_filter
bash: echo: write error: Invalid argument
Changing both ftrace_set_filter and ftrace_set_notrace functions
to return zero if the filter was applied correctly or -E* values
in case of error.
Link: http://lkml.kernel.org/r/1325495060-6402-2-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
perf tools: Fix compile error on x86_64 Ubuntu
perf report: Fix --stdio output alignment when --showcpuutilization used
perf annotate: Get rid of field_sep check
perf annotate: Fix usage string
perf kmem: Fix a memory leak
perf kmem: Add missing closedir() calls
perf top: Add error message for EMFILE
perf test: Change type of '-v' option to INCR
perf script: Add missing closedir() calls
tracing: Fix compile error when static ftrace is enabled
recordmcount: Fix handling of elf64 big-endian objects.
perf tools: Add const.h to MANIFEST to make perf-tar-src-pkg work again
perf tools: Add support for guest/host-only profiling
perf kvm: Do guest-only counting by default
perf top: Don't update total_period on process_sample
perf hists: Stop using 'self' for struct hist_entry
perf hists: Rename total_session to total_period
x86: Add counter when debug stack is used with interrupts enabled
x86: Allow NMIs to hit breakpoints in i386
x86: Keep current stack in NMI breakpoints
...
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
reiserfs: Properly display mount options in /proc/mounts
vfs: prevent remount read-only if pending removes
vfs: count unlinked inodes
vfs: protect remounting superblock read-only
vfs: keep list of mounts for each superblock
vfs: switch ->show_options() to struct dentry *
vfs: switch ->show_path() to struct dentry *
vfs: switch ->show_devname() to struct dentry *
vfs: switch ->show_stats to struct dentry *
switch security_path_chmod() to struct path *
vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
vfs: trim includes a bit
switch mnt_namespace ->root to struct mount
vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
vfs: opencode mntget() mnt_set_mountpoint()
vfs: spread struct mount - remaining argument of next_mnt()
vfs: move fsnotify junk to struct mount
vfs: move mnt_devname
vfs: move mnt_list to struct mount
vfs: switch pnode.h macros to struct mount *
...
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (106 commits)
perf kvm: Fix copy & paste error in description
perf script: Kill script_spec__delete
perf top: Fix a memory leak
perf stat: Introduce get_ratio_color() helper
perf session: Remove impossible condition check
perf tools: Fix feature-bits rework fallout, remove unused variable
perf script: Add generic perl handler to process events
perf tools: Use for_each_set_bit() to iterate over feature flags
perf tools: Unify handling of features when writing feature section
perf report: Accept fifos as input file
perf tools: Moving code in some files
perf tools: Fix out-of-bound access to struct perf_session
perf tools: Continue processing header on unknown features
perf tools: Improve macros for struct feature_ops
perf: builtin-record: Document and check that mmap_pages must be a power of two.
perf: builtin-record: Provide advice if mmap'ing fails with EPERM.
perf tools: Fix truncated annotation
perf script: look up thread using tid instead of pid
perf tools: Look up thread names for system wide profiling
perf tools: Fix comm for processes with named threads
...
There are four places where new filter for a given filter string is
created, which involves several different steps. This patch factors
those steps into create_[system_]filter() functions which in turn make
use of create_filter_{start|finish}() for common parts.
The only functional change is that if replace_filter_string() is
requested and fails, creation fails without any side effect instead of
being ignored.
Note that system filter is now installed after the processing is
complete which makes freeing before and then restoring filter string
on error unncessary.
-v2: Rebased to resolve conflict with 49aa29513e and updated both
create_filter() functions to always set *filterp instead of
requiring the caller to clear it to %NULL on entry.
Link: http://lkml.kernel.org/r/1323988305-1469-2-git-send-email-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add stacktrace_filter= to the kernel command line that lets
the user pick specific functions to check the stack on.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Change set_ftrace_early_filter() to ftrace_set_early_filter()
and make it a global function. This will allow other subsystems
in the kernel to be able to enable function tracing at start
up and reuse the ftrace function parsing code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The stack_tracer is used to look at every function and check
if the current stack is bigger than the last recorded max stack size.
When a new max is found, then it saves that stack off.
Currently the stack tracer is limited by the global_ops of
the function tracer. As the stack tracer has nothing to do with
the ftrace function tracer, except that it uses it as its internal
engine, the stack tracer should have its own list.
A new file is added to the tracing debugfs directory called:
stack_trace_filter
that can be used to select which functions you want to check the stack
on.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The set_ftrace_filter shows "hashed" functions, which are functions
that are added with operations to them (like traceon and traceoff).
As other subsystems may be able to show what functions they are
using for function tracing, the hash items should no longer
be shown just because the FILTER flag is set. As they have nothing
to do with other subsystems filters.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function tracer is set up to allow any other subsystem (like perf)
to use it. Ftrace already has a way to list what functions are enabled
by the global_ops. It would be very helpful to let other users of
the function tracer to be able to use the same code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are two types of hashes in the ftrace_ops; one type
is the filter_hash and the other is the notrace_hash. Either
one may be null, meaning it has no elements. But when elements
are added, the hash is allocated.
Throughout the code, a check needs to be made to see if a hash
exists or the hash has elements, but the check if the hash exists
is usually missing causing the possible "NULL pointer dereference bug".
Add a helper routine called "ftrace_hash_empty()" that returns
true if the hash doesn't exist or its count is zero. As they mean
the same thing.
Last-bug-reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When disabling the "notrace" records, that means we want to trace them.
If the notrace_hash is zero, it means that we want to trace all
records. But to disable a zero notrace_hash means nothing.
The check for the notrace_hash count was incorrect with:
if (hash && !hash->count)
return
With the correct comment above it that states that we do nothing
if the notrace_hash has zero count. But !hash also means that
the notrace hash has zero count. I think this was done to
protect against dereferencing NULL. But if !hash is true, then
we go through the following loop without doing a single thing.
Fix it to:
if (!hash || !hash->count)
return;
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that each set of pages in the function list are sorted by
ip, we can use bsearch to find a record within each set of pages.
This speeds up the ftrace_location() function by magnitudes.
For archs (like x86) that need to add a breakpoint at every function
that will be converted from a nop to a callback and vice versa,
the breakpoint callback needs to know if the breakpoint was for
ftrace or not. It requires finding the breakpoint ip within the
records. Doing a linear search is extremely inefficient. It is
a must to be able to do a fast binary search to find these locations.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Sort records by ip locations of the ftrace mcount calls on each of the
set of pages in the function list. This helps in localizing cache
usuage when updating the function locations, as well as gives us
the ability to quickly find an ip location in the list.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As new functions come in to be initalized from mcount to nop,
they are done by groups of pages. Whether it is the core kernel
or a module. There's no need to keep track of these on a per record
basis.
At startup, and as any module is loaded, the functions to be
traced are stored in a group of pages and added to the function
list at the end. We just need to keep a pointer to the first
page of the list that was added, and use that to know where to
start on the list for initializing functions.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allocate the mcount record pages as a group of pages as big
as can be allocated and waste no more than a single page.
Grouping the mcount pages as much as possible helps with cache
locality, as we do not need to redirect with descriptors as we
cross from page to page. It also allows us to do more with the
records later on (sort them with bigger benefits).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Records that are added to the function trace table are
permanently there, except for modules. By separating out the
modules to their own pages that can be freed in one shot
we can remove the "freed" flag and simplify some of the record
management.
Another benefit of doing this is that we can also move the
records around; sort them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The stop machine method to modify all functions in the kernel
(some 20,000 of them) is the safest way to do so across all archs.
But some archs may not need this big hammer approach to modify code
on SMP machines, and can simply just update the code it needs.
Adding a weak function arch_ftrace_update_code() that now does the
stop machine, will also let any arch override this method.
If the arch needs to check the system and then decide if it can
avoid stop machine, it can still call ftrace_run_stop_machine() to
use the old method.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Multiple users of the function tracer can register their functions
with the ftrace_ops structure. The accounting within ftrace will
update the counter on each function record that is being traced.
When the ftrace_ops filtering adds or removes functions, the
function records will be updated accordingly if the ftrace_ops is
still registered.
When a ftrace_ops is removed, the counter of the function records,
that the ftrace_ops traces, are decremented. When they reach zero
the functions that they represent are modified to stop calling the
mcount code.
When changes are made, the code is updated via stop_machine() with
a command passed to the function to tell it what to do. There is an
ENABLE and DISABLE command that tells the called function to enable
or disable the functions. But the ENABLE is really a misnomer as it
should just update the records, as records that have been enabled
and now have a count of zero should be disabled.
The DISABLE command is used to disable all functions regardless of
their counter values. This is the big off switch and is not the
complement of the ENABLE command.
To make matters worse, when a ftrace_ops is unregistered and there
is another ftrace_ops registered, neither the DISABLE nor the
ENABLE command are set when calling into the stop_machine() function
and the records will not be updated to match their counter. A command
is passed to that function that will update the mcount code to call
the registered callback directly if it is the only one left. This
means that the ftrace_ops that is still registered will have its callback
called by all functions that have been set for it as well as the ftrace_ops
that was just unregistered.
Here's a way to trigger this bug. Compile the kernel with
CONFIG_FUNCTION_PROFILER set and with CONFIG_FUNCTION_GRAPH not set:
CONFIG_FUNCTION_PROFILER=y
# CONFIG_FUNCTION_GRAPH is not set
This will force the function profiler to use the function tracer instead
of the function graph tracer.
# cd /sys/kernel/debug/tracing
# echo schedule > set_ftrace_filter
# echo function > current_tracer
# cat set_ftrace_filter
schedule
# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 692/68108025 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
kworker/0:2-909 [000] .... 531.235574: schedule <-worker_thread
<idle>-0 [001] .N.. 531.235575: schedule <-cpu_idle
kworker/0:2-909 [000] .... 531.235597: schedule <-worker_thread
sshd-2563 [001] .... 531.235647: schedule <-schedule_hrtimeout_range_clock
# echo 1 > function_profile_enabled
# echo 0 > function_porfile_enabled
# cat set_ftrace_filter
schedule
# cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 159701/118821262 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [002] ...1 604.870655: local_touch_nmi <-cpu_idle
<idle>-0 [002] d..1 604.870655: enter_idle <-cpu_idle
<idle>-0 [002] d..1 604.870656: atomic_notifier_call_chain <-enter_idle
<idle>-0 [002] d..1 604.870656: __atomic_notifier_call_chain <-atomic_notifier_call_chain
The same problem could have happened with the trace_probe_ops,
but they are modified with the set_frace_filter file which does the
update at closure of the file.
The simple solution is to change ENABLE to UPDATE and call it every
time an ftrace_ops is unregistered.
Link: http://lkml.kernel.org/r/1323105776-26961-3-git-send-email-jolsa@redhat.com
Cc: stable@vger.kernel.org # 3.0+
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add an EXPORT_SYMBOL_GPL() so that rcutorture can dump the trace buffer
upon detection of an RCU error.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
If the set_ftrace_filter is cleared by writing just whitespace to
it, then the filter hash refcounts will be decremented but not
updated. This causes two bugs:
1) No functions will be enabled for tracing when they all should be
2) If the users clears the set_ftrace_filter twice, it will crash ftrace:
------------[ cut here ]------------
WARNING: at /home/rostedt/work/git/linux-trace.git/kernel/trace/ftrace.c:1384 __ftrace_hash_rec_update.part.27+0x157/0x1a7()
Modules linked in:
Pid: 2330, comm: bash Not tainted 3.1.0-test+ #32
Call Trace:
[<ffffffff81051828>] warn_slowpath_common+0x83/0x9b
[<ffffffff8105185a>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810ba362>] __ftrace_hash_rec_update.part.27+0x157/0x1a7
[<ffffffff810ba6e8>] ? ftrace_regex_release+0xa7/0x10f
[<ffffffff8111bdfe>] ? kfree+0xe5/0x115
[<ffffffff810ba51e>] ftrace_hash_move+0x2e/0x151
[<ffffffff810ba6fb>] ftrace_regex_release+0xba/0x10f
[<ffffffff8112e49a>] fput+0xfd/0x1c2
[<ffffffff8112b54c>] filp_close+0x6d/0x78
[<ffffffff8113a92d>] sys_dup3+0x197/0x1c1
[<ffffffff8113a9a6>] sys_dup2+0x4f/0x54
[<ffffffff8150cac2>] system_call_fastpath+0x16/0x1b
---[ end trace 77a3a7ee73794a02 ]---
Link: http://lkml.kernel.org/r/20111101141420.GA4918@debian
Reported-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A forced undef of a config value was used for testing and was
accidently left in during the final commit. This causes x86 to
run slower than needed while running function tracing as well
as causes the function graph selftest to fail when DYNMAIC_FTRACE
is not set. This is because the code in MCOUNT expects the ftrace
code to be processed with the config value set that happened to
be forced not set.
The forced config option was left in by:
commit 6331c28c96
ftrace: Fix dynamic selftest failure on some archs
Link: http://lkml.kernel.org/r/20111102150255.GA6973@debian
Cc: stable@vger.kernel.org
Reported-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Though not all events have field 'prev_pid', it was allowed to do this:
# echo 'prev_pid == 100' > events/sched/filter
but commit 75b8e98263 (tracing/filter: Swap
entire filter of events) broke it without any reason.
Link: http://lkml.kernel.org/r/4EAF46CF.8040408@cn.fujitsu.com
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix a bug introduced by e9dbfae5, which prevents event_subsystem from
ever being released.
Ref_count was added to keep track of subsystem users, not for counting
events. Subsystem is created with ref_count = 1, so there is no need to
increment it for every event, we have nr_events for that. Fix this by
touching ref_count only when we actually have a new user -
subsystem_open().
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Link: http://lkml.kernel.org/r/1320052062-7846-1-git-send-email-idryomov@gmail.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_event_call->filter is sched RCU protected but didn't use
rcu_assign_pointer(). Use it.
TODO: Add proper __rcu annotation to call->filter and all its users.
-v2: Use RCU_INIT_POINTER() for %NULL clearing as suggested by Eric.
Link: http://lkml.kernel.org/r/20111123164949.GA29639@google.com
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: stable@kernel.org # (2.6.39+)
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Knowing the number of event entries in the ring buffer compared
to the total number that were written is useful information. The
latency format gives this information and there's no reason that the
default format does not.
This information is now added to the default header, along with the
number of online CPUs:
# tracer: nop
#
# entries-in-buffer/entries-written: 159836/64690869 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [000] ...2 49.442971: local_touch_nmi <-cpu_idle
<idle>-0 [000] d..2 49.442973: enter_idle <-cpu_idle
<idle>-0 [000] d..2 49.442974: atomic_notifier_call_chain <-enter_idle
<idle>-0 [000] d..2 49.442976: __atomic_notifier_call_chain <-atomic_notifier
The above shows that the trace contains 159836 entries, but
64690869 were written. One could figure out that there were
64531033 entries that were dropped.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
People keep asking how to get the preempt count, irq, and need resched info
and we keep telling them to enable the latency format. Some developers think
that traces without this info is completely useless, and for a lot of tasks
it is useless.
The first option was to enable the latency trace as the default format, but
the header for the latency format is pretty useless for most tracers and
it also does the timestamp in straight microseconds from the time the trace
started. This is sometimes more difficult to read as the default trace is
seconds from the start of boot up.
Latency format:
# tracer: nop
#
# nop latency trace v1.1.5 on 3.2.0-rc1-test+
# --------------------------------------------------------------------
# latency: 0 us, #159771/64234230, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / delay
# cmd pid ||||| time | caller
# \ / ||||| \ | /
migratio-6 0...2 41778231us+: rcu_note_context_switch <-__schedule
migratio-6 0...2 41778233us : trace_rcu_utilization <-rcu_note_context_switch
migratio-6 0...2 41778235us+: rcu_sched_qs <-rcu_note_context_switch
migratio-6 0d..2 41778236us+: rcu_preempt_qs <-rcu_note_context_switch
migratio-6 0...2 41778238us : trace_rcu_utilization <-rcu_note_context_switch
migratio-6 0...2 41778239us+: debug_lockdep_rcu_enabled <-__schedule
default format:
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
migration/0-6 [000] 50.025810: rcu_note_context_switch <-__schedule
migration/0-6 [000] 50.025812: trace_rcu_utilization <-rcu_note_context_switch
migration/0-6 [000] 50.025813: rcu_sched_qs <-rcu_note_context_switch
migration/0-6 [000] 50.025815: rcu_preempt_qs <-rcu_note_context_switch
migration/0-6 [000] 50.025817: trace_rcu_utilization <-rcu_note_context_switch
migration/0-6 [000] 50.025818: debug_lockdep_rcu_enabled <-__schedule
migration/0-6 [000] 50.025820: debug_lockdep_rcu_enabled <-__schedule
The latency format header has latency information that is pretty meaningless
for most tracers. Although some of the header is useful, and we can add that
later to the default format as well.
What is really useful with the latency format is the irqs-off, need-resched
hard/softirq context and the preempt count.
This commit adds the option irq-info which is on by default that adds this
information:
# tracer: nop
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [000] d..2 49.309305: cpuidle_get_driver <-cpuidle_idle_call
<idle>-0 [000] d..2 49.309307: mwait_idle <-cpu_idle
<idle>-0 [000] d..2 49.309309: need_resched <-mwait_idle
<idle>-0 [000] d..2 49.309310: test_ti_thread_flag <-need_resched
<idle>-0 [000] d..2 49.309312: trace_power_start.constprop.13 <-mwait_idle
<idle>-0 [000] d..2 49.309313: trace_cpu_idle <-mwait_idle
<idle>-0 [000] d..2 49.309315: need_resched <-mwait_idle
If a user wants the old format, they can disable the 'irq-info' option:
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<idle>-0 [000] 49.309305: cpuidle_get_driver <-cpuidle_idle_call
<idle>-0 [000] 49.309307: mwait_idle <-cpu_idle
<idle>-0 [000] 49.309309: need_resched <-mwait_idle
<idle>-0 [000] 49.309310: test_ti_thread_flag <-need_resched
<idle>-0 [000] 49.309312: trace_power_start.constprop.13 <-mwait_idle
<idle>-0 [000] 49.309313: trace_cpu_idle <-mwait_idle
<idle>-0 [000] 49.309315: need_resched <-mwait_idle
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If the set_ftrace_filter is cleared by writing just whitespace to
it, then the filter hash refcounts will be decremented but not
updated. This causes two bugs:
1) No functions will be enabled for tracing when they all should be
2) If the users clears the set_ftrace_filter twice, it will crash ftrace:
------------[ cut here ]------------
WARNING: at /home/rostedt/work/git/linux-trace.git/kernel/trace/ftrace.c:1384 __ftrace_hash_rec_update.part.27+0x157/0x1a7()
Modules linked in:
Pid: 2330, comm: bash Not tainted 3.1.0-test+ #32
Call Trace:
[<ffffffff81051828>] warn_slowpath_common+0x83/0x9b
[<ffffffff8105185a>] warn_slowpath_null+0x1a/0x1c
[<ffffffff810ba362>] __ftrace_hash_rec_update.part.27+0x157/0x1a7
[<ffffffff810ba6e8>] ? ftrace_regex_release+0xa7/0x10f
[<ffffffff8111bdfe>] ? kfree+0xe5/0x115
[<ffffffff810ba51e>] ftrace_hash_move+0x2e/0x151
[<ffffffff810ba6fb>] ftrace_regex_release+0xba/0x10f
[<ffffffff8112e49a>] fput+0xfd/0x1c2
[<ffffffff8112b54c>] filp_close+0x6d/0x78
[<ffffffff8113a92d>] sys_dup3+0x197/0x1c1
[<ffffffff8113a9a6>] sys_dup2+0x4f/0x54
[<ffffffff8150cac2>] system_call_fastpath+0x16/0x1b
---[ end trace 77a3a7ee73794a02 ]---
Link: http://lkml.kernel.org/r/20111101141420.GA4918@debian
Reported-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A forced undef of a config value was used for testing and was
accidently left in during the final commit. This causes x86 to
run slower than needed while running function tracing as well
as causes the function graph selftest to fail when DYNMAIC_FTRACE
is not set. This is because the code in MCOUNT expects the ftrace
code to be processed with the config value set that happened to
be forced not set.
The forced config option was left in by:
commit 6331c28c96
ftrace: Fix dynamic selftest failure on some archs
Link: http://lkml.kernel.org/r/20111102150255.GA6973@debian
Cc: stable@vger.kernel.org
Reported-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The system filter can be used to set multiple event filters that
exist within the system. But currently it displays the last filter
written that does not necessarily correspond to the filters within
the system. The system filter itself is not used to filter any events.
The system filter is just a means to set filters of the events within
it.
Because this causes an ambiguous state when the system filter reads
a filter string but the events within the system have different strings
it is best to just show a boiler plate:
### global filter ###
# Use this to set filters for multiple events.
# Only events with the given fields will be affected.
# If no events are modified, an error message will be displayed here.
If an error occurs while writing to the system filter, the system
filter will replace the boiler plate with the error message as it
currently does.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Though not all events have field 'prev_pid', it was allowed to do this:
# echo 'prev_pid == 100' > events/sched/filter
but commit 75b8e98263 (tracing/filter: Swap
entire filter of events) broke it without any reason.
Link: http://lkml.kernel.org/r/4EAF46CF.8040408@cn.fujitsu.com
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
These files were getting <linux/module.h> via an implicit non-obvious
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason. Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Fix a bug introduced by e9dbfae5, which prevents event_subsystem from
ever being released.
Ref_count was added to keep track of subsystem users, not for counting
events. Subsystem is created with ref_count = 1, so there is no need to
increment it for every event, we have nr_events for that. Fix this by
touching ref_count only when we actually have a new user -
subsystem_open().
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Link: http://lkml.kernel.org/r/1320052062-7846-1-git-send-email-idryomov@gmail.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
These files are doing things like module_put and try_module_get
so they need to call out the module.h for explicit inclusion,
rather than getting it via <linux/device.h> which we ideally want
to remove the module.h inclusion from.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (121 commits)
perf symbols: Increase symbol KSYM_NAME_LEN size
perf hists browser: Refuse 'a' hotkey on non symbolic views
perf ui browser: Use libslang to read keys
perf tools: Fix tracing info recording
perf hists browser: Elide DSO column when it is set to just one DSO, ditto for threads
perf hists: Don't consider filtered entries when calculating column widths
perf hists: Don't decay total_period for filtered entries
perf hists browser: Honour symbol_conf.show_{nr_samples,total_period}
perf hists browser: Do not exit on tab key with single event
perf annotate browser: Don't change selection line when returning from callq
perf tools: handle endianness of feature bitmap
perf tools: Add prelink suggestion to dso update message
perf script: Fix unknown feature comment
perf hists browser: Apply the dso and thread filters when merging new batches
perf hists: Move the dso and thread filters from hist_browser
perf ui browser: Honour the xterm colors
perf top tui: Give color hints just on the percentage, like on --stdio
perf ui browser: Make the colors configurable and change the defaults
perf tui: Remove unneeded call to newtCls on startup
perf hists: Don't format the percentage on hist_entry__snprintf
...
Fix up conflicts in arch/x86/kernel/kprobes.c manually.
Ingo's tree did the insane "add volatile to const array", which just
doesn't make sense ("volatile const"?). But we could remove the const
*and* make the array volatile to make doubly sure that gcc doesn't
optimize it away..
Also fix up kernel/trace/ring_buffer.c non-data-conflicts manually: the
reader_lock has been turned into a raw lock by the core locking merge,
and there was a new user of it introduced in this perf core merge. Make
sure that new use also uses the raw accessor functions.
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
rtmutex: Add missing rcu_read_unlock() in debug_rt_mutex_print_deadlock()
lockdep: Comment all warnings
lib: atomic64: Change the type of local lock to raw_spinlock_t
locking, lib/atomic64: Annotate atomic64_lock::lock as raw
locking, x86, iommu: Annotate qi->q_lock as raw
locking, x86, iommu: Annotate irq_2_ir_lock as raw
locking, x86, iommu: Annotate iommu->register_lock as raw
locking, dma, ipu: Annotate bank_lock as raw
locking, ARM: Annotate low level hw locks as raw
locking, drivers/dca: Annotate dca_lock as raw
locking, powerpc: Annotate uic->lock as raw
locking, x86: mce: Annotate cmci_discover_lock as raw
locking, ACPI: Annotate c3_lock as raw
locking, oprofile: Annotate oprofilefs lock as raw
locking, video: Annotate vga console lock as raw
locking, latencytop: Annotate latency_lock as raw
locking, timer_stats: Annotate table_lock as raw
locking, rwsem: Annotate inner lock as raw
locking, semaphores: Annotate inner lock as raw
locking, sched: Annotate thread_group_cputimer as raw
...
Fix up conflicts in kernel/posix-cpu-timers.c manually: making
cputimer->cputime a raw lock conflicted with the ABBA fix in commit
bcd5cff721 ("cputimer: Cure lock inversion").
The trace_pipe_raw handler holds a cached page from the time the file
is opened to the time it is closed. The cached page is used to handle
the case of the user space buffer being smaller than what was read from
the ring buffer. The left over buffer is held in the cache so that the
next read will continue where the data left off.
After EOF is returned (no more data in the buffer), the index of
the cached page is set to zero. If a user app reads the page again
after EOF, the check in the buffer will see that the cached page
is less than page size and will return the cached page again. This
will cause reading the trace_pipe_raw again after EOF to return
duplicate data, making the output look like the time went backwards
but instead data is just repeated.
The fix is to not reset the index right after all data is read
from the cache, but to reset it after all data is read and more
data exists in the ring buffer.
Cc: stable <stable@kernel.org>
Reported-by: Jeremy Eder <jeder@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing_enabled option is deprecated.
To start/stop tracing, write to /sys/kernel/debug/tracing/tracing_on
without tracing_enabled. This patch is based on Linux 3.1.0-rc1
Signed-off-by: Geunsik Lim <geunsik.lim@samsung.com>
Link: http://lkml.kernel.org/r/1313127022-23830-1-git-send-email-leemgs1@gmail.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When doing intense tracing, the kmalloc inside trace_marker can
introduce side effects to what is being traced.
As trace_marker() is used by userspace to inject data into the
kernel ring buffer, it needs to do so with the least amount
of intrusion to the operations of the kernel or the user space
application.
As the ring buffer is designed to write directly into the buffer
without the need to make a temporary buffer, and userspace already
went through the hassle of knowing how big the write will be,
we can simply pin the userspace pages and write the data directly
into the buffer. This improves the impact of tracing via trace_marker
tremendously!
Thanks to Peter Zijlstra and Thomas Gleixner for pointing out the
use of get_user_pages_fast() and kmap_atomic().
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As the function tracer is very intrusive, lots of self checks are
performed on the tracer and if something is found to be strange
it will shut itself down keeping it from corrupting the rest of the
kernel. This shutdown may still allow functions to be traced, as the
tracing only stops new modifications from happening. Trying to stop
the function tracer itself can cause more harm as it requires code
modification.
Although a WARN_ON() is executed, a user may not notice it. To help
the user see that something isn't right with the tracing of the system
a big warning is added to the output of the tracer that lets the user
know that their data may be incomplete.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix kprobe-tracer not to delete a probe if the probe is in use.
In that case, delete operation will return -EBUSY.
This bug can cause a kernel panic if enabled probes are deleted
during perf record.
(Add some probes on functions)
sh-4.2# perf probe --del probe:\*
sh-4.2# exit
(kernel panic)
This is originally reported on the fedora bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=742383
I've also checked that this problem doesn't happen on
tracepoints when module removing because perf event
locks target module.
$ sudo ./perf record -e xfs:\* -aR sh
sh-4.2# rmmod xfs
ERROR: Module xfs is in use
sh-4.2# exit
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.203 MB perf.data (~8862 samples) ]
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/20111004104438.14591.6553.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* pm-runtime:
PM / Tracing: build rpm-traces.c only if CONFIG_PM_RUNTIME is set
PM / Runtime: Replace dev_dbg() with trace_rpm_*()
PM / Runtime: Introduce trace points for tracing rpm_* functions
PM / Runtime: Don't run callbacks under lock for power.irq_safe set
USB: Add wakeup info to debugging messages
PM / Runtime: pm_runtime_idle() can be called in atomic context
PM / Runtime: Add macro to test for runtime PM events
PM / Runtime: Add might_sleep() to runtime PM functions
Do not build kernel/trace/rpm-traces.c if CONFIG_PM_RUNTIME is not
set, which avoids a build failure.
[rjw: Added the changelog and modified the subject slightly.]
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
This patch introduces 3 trace points to prepare for tracing
rpm_idle/rpm_suspend/rpm_resume functions, so we can use these
trace points to replace the current dev_dbg().
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
If irqs are disabled when preemption count reaches zero, the
preemptirqsoff tracer should not flag that as the end.
When interrupts are enabled and preemption count is not zero
the preemptirqsoff correctly continues its tracing.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When debugging tight race conditions, it can be helpful to have a
synchronized tracing method. Although in most cases the global clock
provides this functionality, if timings is not the issue, it is more
comforting to know that the order of events really happened in a precise
order.
Instead of using a clock, add a "counter" that is simply an incrementing
atomic 64bit counter that orders the events as they are perceived to
happen.
The trace_clock_counter() is added from the attempt by Peter Zijlstra
trying to convert the trace_clock_global() to it. I took Peter's counter
code and made trace_clock_counter() instead, and added it to the choice
of clocks. Just echo counter > /debug/tracing/trace_clock to activate
it.
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Requested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-By: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing locks can be taken in atomic context and therefore
cannot be preempted on -rt - annotate it.
In mainline this change documents the low level nature of
the lock - otherwise there's no functional difference. Lockdep
and Sparse checking will work as usual.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The stats file under per_cpu folder provides the number of entries,
overruns and other statistics about the CPU ring buffer. However, the
numbers do not provide any indication of how full the ring buffer is in
bytes compared to the overall size in bytes. Also, it is helpful to know
the rate at which the cpu buffer is filling up.
This patch adds an entry "bytes: " in printed stats for per_cpu ring
buffer which provides the actual bytes consumed in the ring buffer. This
field includes the number of bytes used by recorded events and the
padding bytes added when moving the tail pointer to next page.
It also adds the following time stamps:
"oldest event ts:" - the oldest timestamp in the ring buffer
"now ts:" - the timestamp at the time of reading
The field "now ts" provides a consistent time snapshot to the userspace
when being read. This is read from the same trace clock used by tracing
event timestamps.
Together, these values provide the rate at which the buffer is filling
up, from the formula:
bytes / (now_ts - oldest_event_ts)
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1313531179-9323-3-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The current file "buffer_size_kb" reports the size of per-cpu buffer and
not the overall memory allocated which could be misleading. A new file
"buffer_total_size_kb" adds up all the enabled CPU buffer sizes and
reports it. This is only a readonly entry.
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1313531179-9323-2-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The self testing for event filters does not really need preemption
disabled as there are no races at the time of testing, but the functions
it calls uses rcu_dereference_sched() which will complain if preemption
is not disabled.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding automated tests running as late_initcall. Tests are
compiled in with CONFIG_FTRACE_STARTUP_TEST option.
Adding test event "ftrace_test_filter" used to simulate
filter processing during event occurance.
String filters are compiled and tested against several
test events with different values.
Also testing that evaluation of explicit predicates is ommited
due to the lazy filter evaluation.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-11-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adding walk_pred_tree function to be used for walking throught
the filter predicates.
For each predicate the callback function is called, allowing
users to add their own functionality or customize their way
through the filter predicates.
Changing check_pred_tree function to use walk_pred_tree.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-6-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We dont need to perform lookup through the ftrace_events list,
instead we can use the 'tp_event' field.
Each perf_event contains tracepoint event field 'tp_event', which
got initialized during the tracepoint event initialization.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-5-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The field_name was used just for finding event's fields. This way we
don't need to care about field_name allocation/free.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-4-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Making the code cleaner by having one function to fully prepare
the predicate (create_pred), and another to add the predicate to
the filter (filter_add_pred).
As a benefit, this way the dry_run flag stays only inside the
replace_preds function and is not passed deeper.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-3-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Don't dynamically allocate filter_pred struct, use static memory.
This way we can get rid of the code managing the dynamic filter_pred
struct object.
The create_pred function integrates create_logical_pred function.
This way the static predicate memory is returned only from
one place.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-2-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
Revert "cfq: Remove special treatment for metadata rqs."
block: fix flush machinery for stacking drivers with differring flush flags
block: improve rq_affinity placement
blktrace: add FLUSH/FUA support
Move some REQ flags to the common bio/request area
allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
xen/blkback: Make description more obvious.
cfq-iosched: Add documentation about idling
block: Make rq_affinity = 1 work as expected
block: swim3: fix unterminated of_device_id table
block/genhd.c: remove useless cast in diskstats_show()
drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
bsg-lib: add module.h include
cfq-iosched: Reduce linked group count upon group destruction
blk-throttle: correctly determine sync bio
loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
loop: add management interface for on-demand device allocation
loop: replace linked list of allocated devices with an idr index
...
Add FLUSH/FUA support to blktrace. As FLUSH precedes WRITE and/or
FUA follows WRITE, use the same 'F' flag for both cases and
distinguish them by their (relative) position. The end results
look like (other flags might be shown also):
- WRITE: W
- WRITE_FLUSH: FW
- WRITE_FUA: WF
- WRITE_FLUSH_FUA: FWF
Note that we reuse TC_BARRIER due to lack of bit space of act_mask
so that the older versions of blktrace tools will report flush
requests as barriers from now on.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
gcc incorrectly states that the variable "fmt" is uninitialized when
CC_OPITMIZE_FOR_SIZE is set.
Instead of just blindly setting fmt to NULL, the code is cleaned up
a little to be a bit easier for humans to follow, as well as gcc
to know the variables are initialized.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
What was scheduled to be 2.6.41 is now going to be 3.1 .
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1107250929370.8080@swampdragon.chaosbits.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the address of a module-local variable can only be
solved after the target module is loaded, the symbol
fetch-argument should be updated when loading target
module.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/20110627072703.6528.75042.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To support probing module init functions, kprobe-tracer allows
user to define a probe on non-existed function when it is given
with a module name. This also enables user to set a probe on
a function on a specific module, even if a same name (but different)
function is locally defined in another module.
The module name must be in the front of function name and separated
by a ':'. e.g. btrfs:btrfs_init_sysfs
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/20110627072656.6528.89970.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Enabling function tracer to trace all functions, then load a module and
then disable function tracing will cause ftrace to fail.
This can also happen by enabling function tracing on the command line:
ftrace=function
and during boot up, modules are loaded, then you disable function tracing
with 'echo nop > current_tracer' you will trigger a bug in ftrace that
will shut itself down.
The reason is, the new ftrace code keeps ref counts of all ftrace_ops that
are registered for tracing. When one or more ftrace_ops are registered,
all the records that represent the functions that the ftrace_ops will
trace have a ref count incremented. If this ref count is not zero,
when the code modification runs, that function will be enabled for tracing.
If the ref count is zero, that function will be disabled from tracing.
To make sure the accounting was working, FTRACE_WARN_ON()s were added
to updating of the ref counts.
If the ref count hits its max (> 2^30 ftrace_ops added), or if
the ref count goes below zero, a FTRACE_WARN_ON() is triggered which
disables all modification of code.
Since it is common for ftrace_ops to trace all functions in the kernel,
instead of creating > 20,000 hash items for the ftrace_ops, the hash
count is just set to zero, and it represents that the ftrace_ops is
to trace all functions. This is where the issues arrise.
If you enable function tracing to trace all functions, and then add
a module, the modules function records do not get the ref count updated.
When the function tracer is disabled, all function records ref counts
are subtracted. Since the modules never had their ref counts incremented,
they go below zero and the FTRACE_WARN_ON() is triggered.
The solution to this is rather simple. When modules are loaded, and
their functions are added to the the ftrace pool, look to see if any
ftrace_ops are registered that trace all functions. And for those,
update the ref count for the module function records.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Rename probe_* to trace_probe_* for avoiding namespace
confliction. This also fixes improper names of find_probe_event()
and cleanup_all_probes() to find_trace_probe() and
release_all_trace_probes() respectively.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20110627072636.6528.60374.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the stack trace per event in ftace is only 8 frames.
This can be quite limiting and sometimes useless. Especially when
the "ignore frames" is wrong and we also use up stack frames for
the event processing itself.
Change this to be dynamic by adding a percpu buffer that we can
write a large stack frame into and then copy into the ring buffer.
For interrupts and NMIs that come in while another event is being
process, will only get to use the 8 frame stack. That should be enough
as the task that it interrupted will have the full stack frame anyway.
Requested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Archs that do not implement CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST, will
fail the dynamic ftrace selftest.
The function tracer has a quick 'off' variable that will prevent
the call back functions from being called. This variable is called
function_trace_stop. In x86, this is implemented directly in the mcount
assembly, but for other archs, an intermediate function is used called
ftrace_test_stop_func().
In dynamic ftrace, the function pointer variable ftrace_trace_function is
used to update the caller code in the mcount caller. But for archs that
do not have CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST set, it only calls
ftrace_test_stop_func() instead, which in turn calls __ftrace_trace_function.
When more than one ftrace_ops is registered, the function it calls is
ftrace_ops_list_func(), which will iterate over all registered ftrace_ops
and call the callbacks that have their hash matching.
The issue happens when two ftrace_ops are registered for different functions
and one is then unregistered. The __ftrace_trace_function is then pointed
to the remaining ftrace_ops callback function directly. This mean it will
be called for all functions that were registered to trace by both ftrace_ops
that were registered.
This is not an issue for archs with CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST,
because the update of ftrace_trace_function doesn't happen until after all
functions have been updated, and then the mcount caller is updated. But
for those archs that do use the ftrace_test_stop_func(), the update is
immediate.
The dynamic selftest fails because it hits this situation, and the
ftrace_ops that it registers fails to only trace what it was suppose to
and instead traces all other functions.
The solution is to delay the setting of __ftrace_trace_function until
after all the functions have been updated according to the registered
ftrace_ops. Also, function_trace_stop is set during the update to prevent
function tracing from calling code that is caused by the function tracer
itself.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, if set_ftrace_filter() is called when the ftrace_ops is
active, the function filters will not be updated. They will only be updated
when tracing is disabled and re-enabled.
Update the functions immediately during set_ftrace_filter().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Whenever the hash of the ftrace_ops is updated, the record counts
must be balance. This requires disabling the records that are set
in the original hash, and then enabling the records that are set
in the updated hash.
Moving the update into ftrace_hash_move() removes the bug where the
hash was updated but the records were not, which results in ftrace
triggering a warning and disabling itself because the ftrace_ops filter
is updated while the ftrace_ops was registered, and then the failure
happens when the ftrace_ops is unregistered.
The current code will not trigger this bug, but new code will.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When I mounted an NFS directory, it caused several modules to be loaded. At the
time I was running the preemptirqsoff tracer, and it showed the following
output:
# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 2.6.33.9-rt30-mrg-test
# --------------------------------------------------------------------
# latency: 1177 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: modprobe-19370 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: ftrace_module_notify
# => ended at: ftrace_module_notify
#
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| /_--=> lock-depth
# |||||/ delay
# cmd pid |||||| time | caller
# \ / |||||| \ | /
modprobe-19370 3d.... 0us!: ftrace_process_locs <-ftrace_module_notify
modprobe-19370 3d.... 1176us : ftrace_process_locs <-ftrace_module_notify
modprobe-19370 3d.... 1178us : trace_hardirqs_on <-ftrace_module_notify
modprobe-19370 3d.... 1178us : <stack trace>
=> ftrace_process_locs
=> ftrace_module_notify
=> notifier_call_chain
=> __blocking_notifier_call_chain
=> blocking_notifier_call_chain
=> sys_init_module
=> system_call_fastpath
That's over 1ms that interrupts are disabled on a Real-Time kernel!
Looking at the cause (being the ftrace author helped), I found that the
interrupts are disabled before the code modification of mcounts into nops. The
interrupts only need to be disabled on start up around this code, not when
modules are being loaded.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If a function is set to be traced by the set_graph_function, but the
option funcgraph-irqs is zero, and the traced function happens to be
called from a interrupt, it will not be traced.
The point of funcgraph-irqs is to not trace interrupts when we are
preempted by an irq, not to not trace functions we want to trace that
happen to be *in* a irq.
Luckily the current->trace_recursion element is perfect to add a flag
to help us be able to trace functions within an interrupt even when
we are not tracing interrupts that preempt the trace.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The "enable" file for the event system can be removed when a module
is unloaded and the event system only has events from that module.
As the event system nr_events count goes to zero, it may be freed
if its ref_count is also set to zero.
Like the "filter" file, the "enable" file may be opened by a task and
referenced later, after a module has been unloaded and the events for
that event system have been removed.
Although the "filter" file referenced the event system structure,
the "enable" file only references a pointer to the event system
name. Since the name is freed when the event system is removed,
it is possible that an access to the "enable" file may reference
a freed pointer.
Update the "enable" file to use the subsystem_open() routine that
the "filter" file uses, to keep a reference to the event system
structure while the "enable" file is opened.
Cc: <stable@kernel.org>
Reported-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The event system is freed when its nr_events is set to zero. This happens
when a module created an event system and then later the module is
removed. Modules may share systems, so the system is allocated when
it is created and freed when the modules are unloaded and all the
events under the system are removed (nr_events set to zero).
The problem arises when a task opened the "filter" file for the
system. If the module is unloaded and it removed the last event for
that system, the system structure is freed. If the task that opened
the filter file accesses the "filter" file after the system has
been freed, the system will access an invalid pointer.
By adding a ref_count, and using it to keep track of what
is using the event system, we can free it after all users
are finished with the event system.
Cc: <stable@kernel.org>
Reported-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix to support kernel stack trace correctly on kprobe-tracer.
Since the execution path of kprobe-based dynamic events is different
from other tracepoint-based events, normal ftrace_trace_stack() doesn't
work correctly. To fix that, this introduces ftrace_trace_stack_regs()
which traces stack via pt_regs instead of current stack register.
e.g.
# echo p schedule+4 > /sys/kernel/debug/tracing/kprobe_events
# echo 1 > /sys/kernel/debug/tracing/options/stacktrace
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
# head -n 20 /sys/kernel/debug/tracing/trace
bash-2968 [000] 10297.050245: p_schedule_4: (schedule+0x4/0x4ca)
bash-2968 [000] 10297.050247: <stack trace>
=> schedule_timeout
=> n_tty_read
=> tty_read
=> vfs_read
=> sys_read
=> system_call_fastpath
kworker/0:1-2940 [000] 10297.050265: p_schedule_4: (schedule+0x4/0x4ca)
kworker/0:1-2940 [000] 10297.050266: <stack trace>
=> worker_thread
=> kthread
=> kernel_thread_helper
sshd-1132 [000] 10297.050365: p_schedule_4: (schedule+0x4/0x4ca)
sshd-1132 [000] 10297.050365: <stack trace>
=> sysret_careful
Note: Even with this fix, the first entry will be skipped
if the probe is put on the function entry area before
the frame pointer is set up (usually, that is 4 bytes
(push %bp; mov %sp %bp) on x86), because stack unwinder
depends on the frame pointer.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Namhyung Kim <namhyung@gmail.com>
Link: http://lkml.kernel.org/r/20110608070934.17777.17116.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing ring buffer is allocated from kernel memory. While
allocating a large chunk of memory, OOM might happen which destabilizes
the system. Thus random processes might get killed during the
allocation.
This patch adds __GFP_NORETRY flag to the ring buffer allocation calls
to make it fail more gracefully if the system will not be able to
complete the allocation request.
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1307491302-9236-1-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch replaces the code for getting an unsigned long from a
userspace buffer by a simple call to kstroul_from_user.
This makes it easier to read and less error prone.
Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Link: http://lkml.kernel.org/r/1307476707-14762-1-git-send-email-peterhuewe@gmx.de
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function_graph tracer does not follow global context-info option.
Adding TRACE_ITER_CONTEXT_INFO trace_flags check to enable it.
With following commands:
# echo function_graph > ./current_tracer
# echo 0 > options/context-info
# cat trace
This is what it looked like before:
# tracer: function_graph
#
# TIME CPU DURATION FUNCTION CALLS
# | | | | | | | |
1) 0.079 us | } /* __vma_link_rb */
1) 0.056 us | copy_page_range();
1) | security_vm_enough_memory() {
...
This is what it looks like now:
# tracer: function_graph
#
} /* update_ts_time_stats */
timekeeping_max_deferment();
...
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1307113131-10045-6-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The header display of function tracer does not follow
the context-info option, so field names are displayed even
if this option is off.
Added check for TRACE_ITER_CONTEXT_INFO trace_flags.
With following commands:
# echo function > ./current_tracer
# echo 0 > options/context-info
# cat trace
This is what it looked like before:
# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
add_preempt_count <-schedule
rcu_note_context_switch <-schedule
...
This is what it looks like now:
# tracer: function
#
_raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
...
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1307113131-10045-4-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Functions print_graph_overhead() and print_graph_duration() displays
data for one field - DURATION.
I merged them into single function print_graph_duration(),
and added a way to display the empty parts of the field.
This way the print_graph_irq() function can use this column to display
the IRQ signs if needed and the DURATION field details stays inside
the print_graph_duration() function.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1307113131-10045-3-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The display of absolute time and duration fields is based on the
latency field. This was added during the irqsoff/wakeup tracers
graph support changes.
It's causing confusion in what fields will be displayed for the
function_graph tracer itself. So I'm removing this depency, and
adding absolute time and duration fields to the preemptirqsoff
preemptoff irqsoff wakeup tracers.
With following commands:
# echo function_graph > ./current_tracer
# cat trace
This is what it looked like before:
# tracer: function_graph
#
# TIME CPU DURATION FUNCTION CALLS
# | | | | | | | |
0) 0.068 us | } /* page_add_file_rmap */
0) | _raw_spin_unlock() {
...
This is what it looks like now:
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) 0.068 us | } /* add_preempt_count */
0) 0.993 us | } /* vfsmount_lock_local_lock */
...
For preemptirqsoff preemptoff irqsoff wakeup tracers,
this is what it looked like before:
SNIP
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> lock-depth
# |||| /
# CPU TASK/PID ||||| DURATION FUNCTION CALLS
# | | | ||||| | | | | | |
1) <idle>-0 | d..1 0.000 us | acpi_idle_enter_simple();
...
This is what it looks like now:
SNIP
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| /
# TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
# | | | | |||| | | | | | |
19.847735 | 1) <idle>-0 | d..1 0.000 us | acpi_idle_enter_simple();
...
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1307113131-10045-2-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a trace option to disable tracing on free. When this option is
set, a write into the free_buffer file will not only shrink the
ring buffer down to zero, but it will also disable tracing.
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The proc file entry buffer_size_kb is used to set the size of tracing
buffer. The memory to expand the buffer size is kernel memory. Consider
a use case where tracing is handled by a user space utility, which acts
as a gate keeper for tracing requests. In an OOM condition, tracing is
considered a low priority task and if the utility gets killed the ring
buffer memory cannot be released back to the kernel.
This patch adds a proc file called "free_buffer" whose purpose is to
stop tracing and free up the ring buffer when it is closed.
The user space process can then set the desired size in buffer_size_kb
file and open the fd to the "free_buffer" file. Under OOM condition, if
the process gets killed, the kernel closes the file descriptor. The
release handler stops the tracing and releases the kernel memory
automatically.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Link: http://lkml.kernel.org/r/1308012717-11148-1-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing ring buffer is a group of per-cpu ring buffers where
allocation and logging is done on a per-cpu basis. The events that are
generated on a particular CPU are logged in the corresponding buffer.
This is to provide wait-free writes between CPUs and good NUMA node
locality while accessing the ring buffer.
However, the allocation routines consider NUMA locality only for buffer
page metadata and not for the actual buffer page. This causes the pages
to be allocated on the NUMA node local to the CPU where the allocation
routine is running at the time.
This patch fixes the problem by using a NUMA node specific allocation
routine so that the pages are allocated from a NUMA node local to the
logging CPU.
I tested with the getuid_microbench from autotest. It is a simple binary
that calls getuid() in a loop and measures the average time for the
syscall to complete. The following command was used to test:
$ getuid_microbench 1000000
Compared the numbers found on kernel with and without this patch and
found that logging latency decreases by 30-50 ns/call.
tracing with non-NUMA allocation - 569 ns/call
tracing with NUMA allocation - 512 ns/call
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1304470602-20366-1-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In using syscall tracing by concurrent processes, the wakeup() that is
called in the event commit function causes contention on the spin lock
of the waitqueue. I enabled sys_enter_getuid and sys_exit_getuid
tracepoints, and by running getuid_microbench from autotest in parallel
I found that the contention causes exponential latency increase in the
tracing path.
The autotest binary getuid_microbench calls getuid() in a tight loop for
the given number of iterations and measures the average time required to
complete a single invocation of syscall.
The patch schedules a delayed work after 2 ms once an event commit calls
to wake up the trace wait_queue. This removes the delay caused by
contention on spin lock in wakeup() and amortizes the wakeup() calls
scheduled over the 2 ms period.
In the following example, the script enables the sys_enter_getuid and
sys_exit_getuid tracepoints and runs the getuid_microbench in parallel
with the given number of processes. The output clearly shows the latency
increase caused by contentions.
$ ~/getuid.sh 1
1000000 calls in 0.720974253 s (720.974253 ns/call)
$ ~/getuid.sh 2
1000000 calls in 1.166457554 s (1166.457554 ns/call)
1000000 calls in 1.168933765 s (1168.933765 ns/call)
$ ~/getuid.sh 3
1000000 calls in 1.783827516 s (1783.827516 ns/call)
1000000 calls in 1.795553270 s (1795.553270 ns/call)
1000000 calls in 1.796493376 s (1796.493376 ns/call)
$ ~/getuid.sh 4
1000000 calls in 4.483041796 s (4483.041796 ns/call)
1000000 calls in 4.484165388 s (4484.165388 ns/call)
1000000 calls in 4.484850762 s (4484.850762 ns/call)
1000000 calls in 4.485643576 s (4485.643576 ns/call)
$ ~/getuid.sh 5
1000000 calls in 6.497521653 s (6497.521653 ns/call)
1000000 calls in 6.502000236 s (6502.000236 ns/call)
1000000 calls in 6.501709115 s (6501.709115 ns/call)
1000000 calls in 6.502124100 s (6502.124100 ns/call)
1000000 calls in 6.502936358 s (6502.936358 ns/call)
After the patch, the latencies scale better.
1000000 calls in 0.728720455 s (728.720455 ns/call)
1000000 calls in 0.842782857 s (842.782857 ns/call)
1000000 calls in 0.883803135 s (883.803135 ns/call)
1000000 calls in 0.902077764 s (902.077764 ns/call)
1000000 calls in 0.902838202 s (902.838202 ns/call)
1000000 calls in 0.908896885 s (908.896885 ns/call)
1000000 calls in 0.932523515 s (932.523515 ns/call)
1000000 calls in 0.958009672 s (958.009672 ns/call)
1000000 calls in 0.986188020 s (986.188020 ns/call)
1000000 calls in 0.989771102 s (989.771102 ns/call)
1000000 calls in 0.933518391 s (933.518391 ns/call)
1000000 calls in 0.958897947 s (958.897947 ns/call)
1000000 calls in 1.031038897 s (1031.038897 ns/call)
1000000 calls in 1.089516025 s (1089.516025 ns/call)
1000000 calls in 1.141998347 s (1141.998347 ns/call)
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1305059241-7629-1-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The fix to fix the printk_formats of modules broke the
printk_formats of trace_printks in the kernel.
The update of what to show via the seq_file was only updated
if the passed in fmt was NULL, which happens only on the first
iteration. The result was showing the first format every time
instead of iterating through the available formats.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Revert the commit that removed the disabling of interrupts around
the initial modifying of mcount callers to nops, and update the comment.
The original comment was outdated and stated that the interrupts were
being disabled to prevent kstop machine, which was required with the
old ftrace daemon, but was no longer the case.
What the comment failed to mention was that interrupts needed to be
disabled to keep interrupts from preempting the modifying of the code
and then executing the code that was partially modified.
Revert the commit and update the comment.
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With gcc 4.6, the self test kprobe function:
kprobe_trace_selftest_target()
is optimized such that kallsyms does not list it. The kprobes
test uses this function to insert a probe and test it. But
it will fail the test if the function is not listed in kallsyms.
Adding a __used annotation keeps the symbol in the kallsyms table.
Suggested-by: David Daney <ddaney@caviumnetworks.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
kernel/trace/ftrace.c: In function 'ftrace_regex_write.clone.15':
kernel/trace/ftrace.c:2743:6: warning: 'ret' may be used uninitialized in this
function
Signed-off-by: GuoWen Li <guowen.li.linux@gmail.com>
Link: http://lkml.kernel.org/r/201106011918.47939.guowen.li.linux@gmail.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Witold reported a reboot caused by the selftests of the dynamic function
tracer. He sent me a config and I used ktest to do a config_bisect on it
(as my config did not cause the crash). It pointed out that the problem
config was CONFIG_PROVE_RCU.
What happened was that if multiple callbacks are attached to the
function tracer, we iterate a list of callbacks. Because the list is
managed by synchronize_sched() and preempt_disable, the access to the
pointers uses rcu_dereference_raw().
When PROVE_RCU is enabled, the rcu_dereference_raw() calls some
debugging functions, which happen to be traced. The tracing of the debug
function would then call rcu_dereference_raw() which would then call the
debug function and then... well you get the idea.
I first wrote two different patches to solve this bug.
1) add a __rcu_dereference_raw() that would not do any checks.
2) add notrace to the offending debug functions.
Both of these patches worked.
Talking with Paul McKenney on IRC, he suggested to add recursion
detection instead. This seemed to be a better solution, so I decided to
implement it. As the task_struct already has a trace_recursion to detect
recursion in the ring buffer, and that has a very small number it
allows, I decided to use that same variable to add flags that can detect
the recursion inside the infrastructure of the function tracer.
I plan to change it so that the task struct bit can be checked in
mcount, but as that requires changes to all archs, I will hold that off
to the next merge window.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1306348063.1465.116.camel@gandalf.stny.rr.com
Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Filesystem, like Btrfs, has some "ULL" macros, and when these macros are passed
to tracepoints'__print_symbolic(), there will be 64->32 truncate WARNINGS during
compiling on 32bit box.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/4DACE6E0.7000507@cn.fujitsu.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When dynamic ftrace is not configured, the ops->flags still needs
to have its FTRACE_OPS_FL_ENABLED bit set in ftrace_startup().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The self tests for event tracer does not check if the function
tracing was successfully activated. It needs to before it continues
the tests, otherwise the wrong errors may be reported.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The register_ftrace_function() returns an error code on failure
except if the call to ftrace_startup() fails. Add a error return to
ftrace_startup() if it fails to start, allowing register_ftrace_funtion()
to return a proper error value.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (60 commits)
sched: Fix and optimise calculation of the weight-inverse
sched: Avoid going ahead if ->cpus_allowed is not changed
sched, rt: Update rq clock when unthrottling of an otherwise idle CPU
sched: Remove unused parameters from sched_fork() and wake_up_new_task()
sched: Shorten the construction of the span cpu mask of sched domain
sched: Wrap the 'cfs_rq->nr_spread_over' field with CONFIG_SCHED_DEBUG
sched: Remove unused 'this_best_prio arg' from balance_tasks()
sched: Remove noop in alloc_rt_sched_group()
sched: Get rid of lock_depth
sched: Remove obsolete comment from scheduler_tick()
sched: Fix sched_domain iterations vs. RCU
sched: Next buddy hint on sleep and preempt path
sched: Make set_*_buddy() work on non-task entities
sched: Remove need_migrate_task()
sched: Move the second half of ttwu() to the remote cpu
sched: Restructure ttwu() some more
sched: Rename ttwu_post_activation() to ttwu_do_wakeup()
sched: Remove rq argument from ttwu_stat()
sched: Remove rq->lock from the first half of ttwu()
sched: Drop rq->lock from sched_exec()
...
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix rt_rq runtime leakage bug
Since users of the function tracer can now pick and choose which
functions they want to trace agnostically from other users of the
function tracer, we need to pass the ops struct to the ftrace_set_filter()
functions.
The functions ftrace_set_global_filter() and ftrace_set_global_notrace()
is added to keep the old filter functions which are used to modify
the generic function tracers.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that functions may be selected individually, it only makes sense
that we should allow dynamically allocated trace structures to
be traced. This will allow perf to allocate a ftrace_ops structure
at runtime and use it to pick and choose which functions that
structure will trace.
Note, a dynamically allocated ftrace_ops will always be called
indirectly instead of being called directly from the mcount in
entry.S. This is because there's no safe way to prevent mcount
from being preempted before calling the function, unless we
modify every entry.S to do so (not likely). Thus, dynamically allocated
functions will now be called by the ftrace_ops_list_func() that
loops through the ops that are allocated if there are more than
one op allocated at a time. This loop is protected with a
preempt_disable.
To determine if an ftrace_ops structure is allocated or not, a new
util function was added to the kernel/extable.c called
core_kernel_data(), which returns 1 if the address is between
_sdata and _edata.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ftrace_ops that are registered to trace functions can now be
agnostic to each other in respect to what functions they trace.
Each ops has their own hash of the functions they want to trace
and a hash to what they do not want to trace. A empty hash for
the functions they want to trace denotes all functions should
be traced that are not in the notrace hash.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a hash is modified and might be in use, we need to perform
a schedule RCU operation on it, as the hashes will soon be used
directly in the function tracer callback.
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This is a step towards each ops structure defining its own set
of functions to trace. As the current code with pid's and such
are specific to the global_ops, it is restructured to be used
with the global ops.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to allow different ops to enable different functions,
the ftrace_startup() and ftrace_shutdown() functions need the
ops parameter passed to them.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add the enabled_functions file that is used to show all the
functions that have been enabled for tracing as well as their
ref counts. This helps seeing if any function has been registered
and what functions are being traced.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every function has its own record that stores the instruction
pointer and flags for the function to be traced. There are only
two flags: enabled and free. The enabled flag states that tracing
for the function has been enabled (actively traced), and the free
flag states that the record no longer points to a function and can
be used by new functions (loaded modules).
These flags are now moved to the MSB of the flags (actually just
the top 32bits). The rest of the bits (30 bits) are now used as
a ref counter. Everytime a tracer register functions to trace,
those functions will have its counter incremented.
When tracing is enabled, to determine if a function should be traced,
the counter is examined, and if it is non-zero it is set to trace.
When a ftrace_ops is registered to trace functions, its hashes
are examined. If the ftrace_ops filter_hash count is zero, then
all functions are set to be traced, otherwise only the functions
in the hash are to be traced. The exception to this is if a function
is also in the ftrace_ops notrace_hash. Then that function's counter
is not incremented for this ftrace_ops.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When filtering, allocate a hash to insert the function records.
After the filtering is complete, assign it to the ftrace_ops structure.
This allows the ftrace_ops structure to have a much smaller array of
hash buckets instead of wasting a lot of memory.
A read only empty_hash is created to be the minimum size that any ftrace_ops
can point to.
When a new hash is created, it has the following steps:
o Allocate a default hash.
o Walk the function records assigning the filtered records to the hash
o Allocate a new hash with the appropriate size buckets
o Move the entries from the default hash to the new hash.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Combine the filter and notrace hashes to be accessed by a single entity,
the global_ops. The global_ops is a ftrace_ops structure that is passed
to different functions that can read or modify the filtering of the
function tracer.
The ftrace_ops structure was modified to hold a filter and notrace
hashes so that later patches may allow each ftrace_ops to have its own
set of rules to what functions may be filtered.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When multiple users are allowed to have their own set of functions
to trace, having the FTRACE_FL_FILTER flag will not be enough to
handle the accounting of those users. Each user will need their own
set of functions.
Replace the FTRACE_FL_FILTER with a filter_hash instead. This is
temporary until the rest of the function filtering accounting
gets in.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To prepare for the accounting system that will allow multiple users of
the function tracer, having the FTRACE_FL_NOTRACE as a flag in the
dyn_trace record does not make sense.
All ftrace_ops will soon have a hash of functions they should trace
and not trace. By making a global hash of functions not to trace makes
this easier for the transition.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This partially reverts commit e6e1e25935.
That commit changed the structure layout of the trace structure, which
in turn broke PowerTOP (1.9x generation) quite badly.
I appreciate not wanting to expose the variable in question, and
PowerTOP was not using it, so I've replaced the variable with just a
padding field - that way if in the future a new field is needed it can
just use this padding field.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code used for matching functions is almost identical between normal
selecting of functions and using the :mod: feature of set_ftrace_notrace.
Consolidate the two users into one function.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are three locations that perform almost identical functions in order
to update the ftrace_trace_function (the ftrace function variable that gets
called by mcount).
Consolidate these into a single function called update_ftrace_function().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The updating of a function record is moved to a single function. This will allow
us to add specific changes in one location for both modules and kernel
functions.
Later patches will determine if the function record itself needs to be updated
(which enables the mcount caller), or just the ftrace_ops needs the update.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since we disable all function tracer processing if we detect
that a modification of a instruction had failed, we do not need
to track that the record has failed. No more ftrace processing
is allowed, and the FTRACE_FL_CONVERTED flag is pointless.
The FTRACE_FL_CONVERTED flag was used to denote records that were
successfully converted from mcount calls into nops. But if a single
record fails, all of ftrace is disabled.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since we disable all function tracer processing if we detect
that a modification of a instruction had failed, we do not need
to track that the record has failed. No more ftrace processing
is allowed, and the FTRACE_FL_FAILED flag is pointless.
Removing this flag simplifies some of the code, but some ftrace_disabled
checks needed to be added or move around a little.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The failures file in the debugfs tracing directory would list the
functions that failed to convert when the old dead ftrace daemon
tried to update code but failed. Since this code is now dead along
with the daemon the failures file is useless. Remove it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The disabling of interrupts around ftrace_update_code() was used
to protect against the evil ftrace daemon from years past. But that
daemon has long been killed. It is safe to keep interrupts enabled
while updating the initial mcount into nops.
The ftrace_mutex is also held which keeps other users at bay.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Let FTRACE_WARN_ON() be used as a stand alone statement or
inside a conditional: if (FTRACE_WARN_ON(x))
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If function tracing is enabled, a read of the filter files will
cause the call to stop_machine to update the function trace sites.
It should only call stop_machine on write.
Cc: stable@kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Conflicts:
include/linux/perf_event.h
Merge reason: pick up the latest jump-label enhancements, they are cooked ready.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Neil Brown pointed out that lock_depth somehow escaped the BKL
removal work. Let's get rid of it now.
Note that the perf scripting utilities still have a bunch of
code for dealing with common_lock_depth in tracepoints; I have
left that in place in case anybody wants to use that code with
older kernels.
Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110422111910.456c0e84@bike.lwn.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's a pretty close match to what we had before - the timer triggering
would mean that nobody unplugged the plug in due time, in the new
scheme this matches very closely what the schedule() unplug now is.
It's essentially the difference between an explicit unplug (IO unplug)
or an implicit unplug (timer unplug, we scheduled with pending IO
queued).
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It was removed with the on-stack plugging, readd it and track the
depth of requests added when flushing the plug.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
running following commands:
# enable the binary option
echo 1 > ./options/bin
# disable context info option
echo 0 > ./options/context-info
# tracing only events
echo 1 > ./events/enable
cat trace_pipe
plus forcing system to generate many tracing events,
is causing lockup (in NON preemptive kernels) inside
tracing_read_pipe function.
The issue is also easily reproduced by running ltp stress test.
(ftrace_stress_test.sh)
The reasons are:
- bin/hex/raw output functions for events are set to
trace_nop_print function, which prints nothing and
returns TRACE_TYPE_HANDLED value
- LOST EVENT trace do not handle trace_seq overflow
These reasons force the while loop in tracing_read_pipe
function never to break.
The attached patch fixies handling of lost event trace, and
changes trace_nop_print to print minimal info, which is needed
for the correct tracing_read_pipe processing.
v2 changes:
- omit the cond_resched changes by trace_nop_print changes
- WARN changed to WARN_ONCE and added info to be able
to find out the culprit
v3 changes:
- make more accurate patch comment
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20110325110518.GC1922@jolsa.brq.redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The file debugfs/tracing/printk_formats maps the addresses
to the formats that are used by trace_bprintk() so that userspace
tools can read the buffer and be able to decode trace_bprintk events
to get the format saved when reading the ring buffer directly.
This is because trace_bprintk() does not store the format into the
buffer, but just the address of the format, which is hidden in
the kernel memory.
But currently it only exports trace_bprintk()s from the kernel core
and not for modules. The modules need their formats exported
as well.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace_printk() formats for modules do not show up in the
debugfs/tracing/printk_formats file. Only the formats that are
for trace_printk()s that are in the kernel core.
To facilitate the change to add trace_printk() formats from modules
into that file as well, we need to convert the structure that
holds the formats from char fmt[], into const char *fmt,
and allocate them separately.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, x86: Complain louder about BIOSen corrupting CPU/PMU state and continue
perf, x86: P4 PMU - Read proper MSR register to catch unflagged overflows
perf symbols: Look at .dynsym again if .symtab not found
perf build-id: Add quirk to deal with perf.data file format breakage
perf session: Pass evsel in event_ops->sample()
perf: Better fit max unprivileged mlock pages for tools needs
perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()
perf top: Fix uninitialized 'counter' variable
tracing: Fix set_ftrace_filter probe function display
perf, x86: Fix Intel fixed counters base initialization
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
If one or more function probes (like traceon) are enabled,
and there's no other function filter, the first probe
func is skipped (which one depends on the position in the hash).
$ echo sys_open:traceon sys_close:traceon > ./set_ftrace_filter
$ cat set_ftrace_filter
#### all functions enabled ####
sys_close:traceon:unlimited
$
The reason was, that in the case of no other function filter,
the func_pos was not properly updated before calling t_hash_start.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1297874134-7008-1-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
trace, filters: Initialize the match variable in process_ops() properly
trace, documentation: Fix branch profiling location in debugfs
oprofile, s390: Cleanups
oprofile, s390: Remove hwsampler_files.c and merge it into init.c
perf: Fix tear-down of inherited group events
perf: Reorder & optimize perf_event_context to remove alignment padding on 64 bit builds
perf: Handle stopped state with tracepoints
perf: Fix the software events state check
perf, powerpc: Handle events that raise an exception without overflowing
perf, x86: Use INTEL_*_CONSTRAINT() for all PEBS event constraints
perf, x86: Clean up SandyBridge PEBS events
perf lock: Fix sorting by wait_min
perf tools: Version incorrect with some versions of grep
perf evlist: New command to list the names of events present in a perf.data file
perf script: Add support for H/W and S/W events
perf script: Add support for dumping symbols
perf script: Support custom field selection for output
perf script: Move printing of 'common' data from print_event and rename
perf tracing: Remove print_graph_cpu and print_graph_proc from trace-event-parse
perf script: Change process_event prototype
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
Make sure the 'match' variable always has a value.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The debugfs interface for branch profiling is through
/sys/kernel/debug/tracing/trace_stat/branch_annotated
/sys/kernel/debug/tracing/trace_stat/branch_all
so update the Kconfig accordingly.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <alpine.DEB.2.00.1103161716320.11407@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In blk_add_trace_rq, we only chose the minor 2 bits from
request's cmd_flags and did some check for discard.
so most of other flags(e.g, REQ_SYNC) are missing.
For example, with a sync write after blkparse we get:
8,16 1 1 0.001776503 7509 A WS 1349632 + 1024 <- (8,17) 1347584
8,16 1 2 0.001776813 7509 Q WS 1349632 + 1024 [dd]
8,16 1 3 0.001780395 7509 G WS 1349632 + 1024 [dd]
8,16 1 5 0.001783186 7509 I W 1349632 + 1024 [dd]
8,16 1 11 0.001816987 7509 D W 1349632 + 1024 [dd]
8,16 0 2 0.006218192 0 C W 1349632 + 1024 [0]
Since now we have integrated the flags of both bio and request,
it is safe to pass rq->cmd_flags directly to __blk_add_trace.
With this patch, after a sync write we get:
8,16 1 1 0.001776900 5425 A WS 1189888 + 1024 <- (8,17) 1187840
8,16 1 2 0.001777179 5425 Q WS 1189888 + 1024 [dd]
8,16 1 3 0.001780797 5425 G WS 1189888 + 1024 [dd]
8,16 1 5 0.001783402 5425 I WS 1189888 + 1024 [dd]
8,16 1 11 0.001817468 5425 D WS 1189888 + 1024 [dd]
8,16 0 2 0.005640709 0 C WS 1189888 + 1024 [0]
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If the kernel command line declares a tracer "ftrace=sometracer" and
that tracer is either not defined or is enabled after irqsoff,
then the irqs off selftest will fail with the following error:
Testing tracer irqsoff:
------------[ cut here ]------------
WARNING: at /home/rostedt/work/autotest/nobackup/linux-test.git/kernel/trace/tra
ce.c:713 update_max_tr_single+0xfa/0x11b()
Hardware name:
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.38-rc8-test #1
Call Trace:
[<c0441d9d>] ? warn_slowpath_common+0x65/0x7a
[<c049adb2>] ? update_max_tr_single+0xfa/0x11b
[<c0441dc1>] ? warn_slowpath_null+0xf/0x13
[<c049adb2>] ? update_max_tr_single+0xfa/0x11b
[<c049e454>] ? stop_critical_timing+0x154/0x204
[<c049b54b>] ? trace_selftest_startup_irqsoff+0x5b/0xc1
[<c049b54b>] ? trace_selftest_startup_irqsoff+0x5b/0xc1
[<c049b54b>] ? trace_selftest_startup_irqsoff+0x5b/0xc1
[<c049e529>] ? time_hardirqs_on+0x25/0x28
[<c0468bca>] ? trace_hardirqs_on_caller+0x18/0x12f
[<c0468cec>] ? trace_hardirqs_on+0xb/0xd
[<c049b54b>] ? trace_selftest_startup_irqsoff+0x5b/0xc1
[<c049b6b8>] ? register_tracer+0xf8/0x1a3
[<c14e93fe>] ? init_irqsoff_tracer+0xd/0x11
[<c040115e>] ? do_one_initcall+0x71/0x121
[<c14e93f1>] ? init_irqsoff_tracer+0x0/0x11
[<c14ce3a9>] ? kernel_init+0x13a/0x1b6
[<c14ce26f>] ? kernel_init+0x0/0x1b6
[<c0403842>] ? kernel_thread_helper+0x6/0x10
---[ end trace e93713a9d40cd06c ]---
.. no entries found ..FAILED!
What happens is the "ftrace=..." will expand the ring buffer to its
default size (from its minimum size) but it will not expand the
max ring buffer (the ring buffer to store maximum latencies).
When the irqsoff test runs, it will call the ring buffer swap routine
that checks if the max ring buffer is the same size as the normal
ring buffer, and will fail if it is not. This causes the test to fail.
The solution is to expand the max ring buffer before running the self
test if the max ring buffer is used by that tracer and the normal ring
buffer is expanded. The max ring buffer should be shrunk again after
the test is done to save space.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Trace events belonging to a module only exists when the module is
loaded. Well, we can use trace_set_clr_event funtion to enable some
trace event at the module init routine, so that we will not miss
something while loading then module.
So, Export the trace_set_clr_event function so that module can use it.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
LKML-Reference: <1289196312-25323-1-git-send-email-yuanhan.liu@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The "Delta way too big" warning might appear on a system with a
unstable shed clock right after the system is resumed and tracing
was enabled at time of suspend.
Since it's not realy a bug, and the unstable sched clock is working
fast and reliable otherwise, Steven suggested to keep using the
sched clock in any case and just to make note in the warning itself.
v2 changes:
- added #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20110218145219.GD2604@jolsa.brq.redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Formatting change only to improve code readability. No code changes except to
introduce intermediate variables.
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1291421609-14665-13-git-send-email-dhsharp@google.com>
[ Keep variable declarations and assignment separate ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1291421609-14665-6-git-send-email-dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The lock_depth field in the event headers was added as a temporary
data point for help in removing the BKL. Now that the BKL is pretty
much been removed, we can remove this field.
This in turn changes the header from 12 bytes to 8 bytes,
removing the 4 byte buffer that gcc would insert if the first field
in the data load was 8 bytes in size.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1291421609-14665-3-git-send-email-dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add an "overwrite" trace_option for ftrace to control whether the buffer should
be overwritten on overflow or not. The default remains to overwrite old events
when the buffer is full. This patch adds the option to instead discard newest
events when the buffer is full. This is useful to get a snapshot of traces just
after enabling traces. Dropping the current event is also a simpler code path.
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1291844807-15481-1-git-send-email-dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If we enable trace events to trace block actions, We use
blk_fill_rwbs_rq to analyze the corresponding actions
in request's cmd_flags, but we only choose the minor 2 bits
from it, so most of other flags(e.g, REQ_SYNC) are missing.
For example, with a sync write we get:
write_test-2409 [001] 160.013869: block_rq_insert: 3,64 W 0 () 258135 + =
8 [write_test]
Since now we have integrated the flags of both bio and request,
it is safe to pass rq->cmd_flags directly to blk_fill_rwbs and
blk_fill_rwbs_rq isn't needed any more.
With this patch, after a sync write we get:
write_test-2417 [000] 226.603878: block_rq_insert: 3,64 WS 0 () 258135 +=
8 [write_test]
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This reverts commit 5e38ca8f3e.
Breaks the build of several !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
architectures.
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Message-ID: <20110217171823.GB17058@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add NULL check for avoiding NULL pointer deref.
This bug has been introduced by:
1ff511e35e: tracing/kprobes: Add bitfield type
which causes a null pointer dereference bug when kprobe-tracer
parses an argument without type.
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: 2nddept-manager@sdl.hitachi.co.jp
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20110214054807.8919.69740.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
When the fuction graph tracer starts, it needs to make a special
stack for each task to save the real return values of the tasks.
All running tasks have this stack created, as well as any new
tasks.
On CPU hot plug, the new idle task will allocate a stack as well
when init_idle() is called. The problem is that cpu hotplug does
not create a new idle_task. Instead it uses the idle task that
existed when the cpu went down.
ftrace_graph_init_task() will add a new ret_stack to the task
that is given to it. Because a clone will make the task
have a stack of its parent it does not check if the task's
ret_stack is already NULL or not. When the CPU hotplug code
starts a CPU up again, it will allocate a new stack even
though one already existed for it.
The solution is to treat the idle_task specially. In fact, the
function_graph code already does, just not at init_idle().
Instead of using the ftrace_graph_init_task() for the idle task,
which that function expects the task to be a clone, have a
separate ftrace_graph_init_idle_task(). Also, we will create a
per_cpu ret_stack that is used by the idle task. When we call
ftrace_graph_init_idle_task() it will check if the idle task's
ret_stack is NULL, if it is, then it will assign it the per_cpu
ret_stack.
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stable Tree <stable@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cdrom: support devices that have check_events but not media_changed
cfq-iosched: Don't wait if queue already has requests.
blkio-throttle: Avoid calling blkiocg_lookup_group() for root group
cfq: rename a function to give it more appropriate name
cciss: make cciss_revalidate not loop through CISS_MAX_LUNS volumes unnecessarily.
drivers/block/aoe/Makefile: replace the use of <module>-objs with <module>-y
loop: queue_lock NULL pointer derefence in blk_throtl_exit
drivers/block/Makefile: replace the use of <module>-objs with <module>-y
blktrace: Don't output messages if NOTIFY isn't set.
tracing_enabled should not be used, it is heavy weight and does not
do much in helping lower the overhead.
tracing_on should be used instead. Warn users to use tracing_on
when tracing_enabled is used as it will soon be removed from the
tracing directory.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The trace events sched_switch and sched_wakeup do the same thing
as the stand alone sched_switch tracer does. It is no longer needed.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The warning "Delta way too big" warning might appear on a system with
unstable shed clock right after the system is resumed and tracing
was enabled during the suspend.
Since it's not realy bug, and the unstable sched clock is working
fast and reliable otherwise, Steven suggested to keep using the
sched clock in any case and just to make note in the warning itself.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1296649698-6003-1-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Many system calls are unimplemented and mapped to sys_ni_syscall, but at
boot ftrace would still search through every syscall metadata entry for
a match which wouldn't be there.
This patch adds causes the search to terminate early if the system call
is not mapped.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
LKML-Reference: <1296703645-18718-7-git-send-email-imunsie@au1.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some architectures have unusual symbol names and the generic code to
match the symbol name with the function name for the syscall metadata
will fail. For example, symbols on PPC64 start with a period and the
generic code will fail to match them.
This patch moves the match logic out into a separate function which an
arch can override by defining ARCH_HAS_SYSCALL_MATCH_SYM_NAME in
asm/ftrace.h and implementing arch_syscall_match_sym_name.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
LKML-Reference: <1296703645-18718-5-git-send-email-imunsie@au1.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Some architectures use non-trivial system call tables and will not work
with the generic arch_syscall_addr code. For example, PowerPC64 uses a
table of twin long longs.
This patch makes the generic arch_syscall_addr weak to allow
architectures with non-trivial system call tables to override it.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
LKML-Reference: <1296703645-18718-4-git-send-email-imunsie@au1.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the ftrace events now checking if the syscall_nr is valid upon
initialisation it should no longer be possible to register or unregister
a syscall event without a valid syscall_nr since they should not be
created. This adds a WARN_ON_ONCE in the register and unregister
functions to locate potential regressions in the future.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
LKML-Reference: <1296703645-18718-3-git-send-email-imunsie@au1.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
FTRACE_SYSCALLS would create events for each and every system call, even
if it had failed to map the system call's name with it's number. This
resulted in a number of events being created that would not behave as
expected.
This could happen, for example, on architectures who's symbol names are
unusual and will not match the system call name. It could also happen
with system calls which were mapped to sys_ni_syscall.
This patch changes the default system call number in the metadata to -1.
If the system call name from the metadata is not successfully mapped to
a system call number during boot, than the event initialisation routine
will now return an error, preventing the event from being created.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
LKML-Reference: <1296703645-18718-2-git-send-email-imunsie@au1.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Because the filters are processed first and then activated
(added to the call), we no longer need to worry about the preds
of the filter in __alloc_preds() being used. As the filter that
is allocating preds is not activated yet.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When creating a new filter, instead of allocating the filter to the
event call first and then processing the filter, it is easier to
process a temporary filter and then just swap it with the call filter.
By doing this, it simplifies the code.
A filter is allocated and processed, when it is done, it is
swapped with the call filter, synchronize_sched() is called to make
sure all callers are done with the old filter (filters are called
with premption disabled), and then the old filter is freed.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that the filter logic does not require to save the pred results
on the stack, we can increase the max number of preds we allow.
As the preds are index by a short value, and we use the MSBs as flags
we can increase the max preds to 2^14 (16384) which should be way
more than enough.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The MAX_FILTER_PRED is only needed by the kernel/trace/*.c files.
Move it to kernel/trace/trace.h.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are many cases that a filter will contain multiple ORs or
ANDs together near the leafs. Walking up and down the tree to get
to the next compare can be a waste.
If there are several ORs or ANDs together, fold them into a single
pred and allocate an array of the conditions that they check.
This will speed up the filter by linearly walking an array
and can still break out if a short circuit condition is met.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the filter walks a tree to determine if a match is made or not,
if the tree was incorrectly created, it could cause an infinite loop.
Add a check to walk the entire tree before assigning it as a filter
to make sure the tree is correct.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The test if we should break out early for OR and AND operations
can be optimized by comparing the current result with
(pred->op == OP_OR)
That is if the result is true and the op is an OP_OR, or
if the result is false and the op is not an OP_OR (thus an OP_AND)
we can break out early in either case. Otherwise we continue
processing.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the filter_match_preds() requires a stack to push
and pop the preds to determine if the filter matches the record or not.
This has two drawbacks:
1) It requires a stack to store state information. As this is done
in fast paths we can't allocate the storage for this stack, and
we can't use a global as it must be re-entrant. The stack is stored
on the kernel stack and this greatly limits how many preds we
may allow.
2) All conditions are calculated even when a short circuit exists.
a || b will always calculate a and b even though a was determined
to be true.
Using a tree we can walk a constant structure that will save
the state as we go. The algorithm is simply:
pred = root;
do {
switch (move) {
case MOVE_DOWN:
if (OR or AND) {
pred = left;
continue;
}
if (pred == root)
break;
match = pred->fn();
pred = pred->parent;
move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT;
continue;
case MOVE_UP_FROM_LEFT:
/* Only OR or AND can be a parent */
if (match && OR || !match && AND) {
/* short circuit */
if (pred == root)
break;
pred = pred->parent;
move = left child ?
MOVE_UP_FROM_LEFT :
MOVE_UP_FROM_RIGHT;
continue;
}
pred = pred->right;
move = MOVE_DOWN;
continue;
case MOVE_UP_FROM_RIGHT:
if (pred == root)
break;
pred = pred->parent;
move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT;
continue;
}
done = 1;
} while (!done);
This way there's no strict limit to how many preds we allow
and it also will short circuit the logical operations when possible.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently we allocate an array of pointers to filter_preds, and then
allocate a separate filter_pred for each item in the array.
This adds slight overhead in the filters as it needs to derefernce
twice to get to the op condition.
Allocating the preds themselves in a single array removes a dereference
as well as helps on the cache footprint.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
By separating out the reseting of the filter->n_preds to zero from
the reallocation of preds for the filter, we can reset groups of
filters first, call synchronize_sched() just once, and then reallocate
each of the filters in the system group.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
For every filter that is made, we create predicates to hold every
operation within the filter. We have a max of 32 predicates that we
can hold. Currently, we allocate all 32 even if we only need to
use one.
Part of the reason we do this is that the filter can be used at
any moment by any event. Fortunately, the filter is only used
with preemption disabled. By reseting the count of preds used "n_preds"
to zero, then performing a synchronize_sched(), we can safely
free and reallocate a new array of preds.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The ops OR and AND act different from the other ops, as they
are the only ones to take other ops as their arguements.
These ops als change the logic of the filter_match_preds.
By removing the OR and AND fn's we can also remove the val1 and val2
that is passed to all other fn's and are unused.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The n_preds field of a file can change at anytime, and even can become
zero, just as the filter is about to be processed by an event.
In the case that is zero on entering the filter, return 1, telling
the caller the event matchs and should be trace.
Also use a variable and assign it with ACCESS_ONCE() such that the
count stays consistent within the function.
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add bitfield type for tracing arguments on kprobe-tracer. The syntax of
a bitfield type is:
b<bit-size>@<bit-offset>/<container-size>
e.g.
Accessing 2 bits-width field with 4 bits-offset in 32 bits-width data at
4 bytes offseted from the address pointed by AX register:
+4(%ax):b2@4/32
Since the width of container data depends on the arch, so I just added
the container-size at the end.
Cc: 2nddept-manager@sdl.hitachi.co.jp
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20110204125205.9507.11363.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Since strict_strtol() accepts minus digits started with '-', it doesn't
need to invert after converting.
Cc: 2nddept-manager@sdl.hitachi.co.jp
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20110204125153.9507.49335.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Currently the syscall_meta structures for the syscall tracepoints are
placed in the __syscall_metadata section, and at link time, the linker
makes one large array of all these syscall metadata structures. On boot
up, this array is read (much like the initcall sections) and the syscall
data is processed.
The problem is that there is no guarantee that gcc will place complex
structures nicely together in an array format. Two structures in the
same file may be placed awkwardly, because gcc has no clue that they
are suppose to be in an array.
A hack was used previous to force the alignment to 4, to pack the
structures together. But this caused alignment issues with other
architectures (sparc).
Instead of packing the structures into an array, the structures' addresses
are now put into the __syscall_metadata section. As pointers are always the
natural alignment, gcc should always pack them tightly together
(otherwise initcall, extable, etc would also fail).
By having the pointers to the structures in the section, we can still
iterate the trace_events without causing unnecessary alignment problems
with other architectures, or depending on the current behaviour of
gcc that will likely change in the future just to tick us kernel developers
off a little more.
The __syscall_metadata section is also moved into the .init.data section
as it is now only needed at boot up.
Suggested-by: David Miller <davem@davemloft.net>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the trace_event structures are placed in the _ftrace_events
section, and at link time, the linker makes one large array of all
the trace_event structures. On boot up, this array is read (much like
the initcall sections) and the events are processed.
The problem is that there is no guarantee that gcc will place complex
structures nicely together in an array format. Two structures in the
same file may be placed awkwardly, because gcc has no clue that they
are suppose to be in an array.
A hack was used previous to force the alignment to 4, to pack the
structures together. But this caused alignment issues with other
architectures (sparc).
Instead of packing the structures into an array, the structures' addresses
are now put into the _ftrace_event section. As pointers are always the
natural alignment, gcc should always pack them tightly together
(otherwise initcall, extable, etc would also fail).
By having the pointers to the structures in the section, we can still
iterate the trace_events without causing unnecessary alignment problems
with other architectures, or depending on the current behaviour of
gcc that will likely change in the future just to tick us kernel developers
off a little more.
The _ftrace_event section is also moved into the .init.data section
as it is now only needed at boot up.
Suggested-by: David Miller <davem@davemloft.net>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
During early boot, local IRQ is disabled until IRQ subsystem is
properly initialized. During this time, no one should enable
local IRQ and some operations which usually are not allowed with
IRQ disabled, e.g. operations which might sleep or require
communications with other processors, are allowed.
lockdep tracked this with early_boot_irqs_off/on() callbacks.
As other subsystems need this information too, move it to
init/main.c and make it generally available. While at it,
toggle the boolean to early_boot_irqs_disabled instead of
enabled so that it can be initialized with %false and %true
indicates the exceptional condition.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20110120110635.GB6036@htj.dyndns.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now if we enable blktrace, cfq has too many messages output to the
trace buffer. It is fine if we don't specify any action mask.
But if I do like this:
blktrace /dev/sdb -a issue -a complete -o - | blkparse -i -
I only want to see 'D' and 'C', while with the following command
dd if=/mnt/ocfs2/test of=/dev/null bs=4k count=1 iflag=direct
I will get(with a 2.6.37 vanilla kernel):
8,16 0 0 0.000000000 0 m N cfq3805 alloced
8,16 0 0 0.000004126 0 m N cfq3805 insert_request
8,16 0 0 0.000004884 0 m N cfq3805 add_to_rr
8,16 0 0 0.000008417 0 m N cfq workload slice:300
8,16 0 0 0.000009557 0 m N cfq3805 set_active wl_prio:0 wl_type:2
8,16 0 0 0.000010640 0 m N cfq3805 fifo= (null)
8,16 0 0 0.000011193 0 m N cfq3805 dispatch_insert
8,16 0 0 0.000012221 0 m N cfq3805 dispatched a request
8,16 0 0 0.000012802 0 m N cfq3805 activate rq, drv=1
8,16 0 1 0.000013181 3805 D R 114759 + 8 [dd]
8,16 0 2 0.000164244 0 C R 114759 + 8 [0]
8,16 0 0 0.000167997 0 m N cfq3805 complete rqnoidle 0
8,16 0 0 0.000168782 0 m N cfq3805 set_slice=100
8,16 0 0 0.000169874 0 m N cfq3805 arm_idle: 8 group_idle: 0
8,16 0 0 0.000170189 0 m N cfq schedule dispatch
8,16 0 0 0.000397938 0 m N cfq3805 slice expired t=0
8,16 0 0 0.000399763 0 m N cfq3805 sl_used=1 disp=1 charge=1 iops=0 sect=8
8,16 0 0 0.000400227 0 m N cfq3805 del_from_rr
8,16 0 0 0.000400882 0 m N cfq3805 put_queue
See, there are 19 lines while I only need 2. I don't think it is
appropriate for a user.
So this patch will disable any messages if the BLK_TC_NOTIFY isn't set.
Now the output for the same command will look like:
8,16 0 1 0.000000000 4908 D R 114759 + 8 [dd]
8,16 0 2 0.000146827 0 C R 114759 + 8 [0]
Yes, it is what I want to see.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Fix a bunch of
warning: ‘inline’ is not at beginning of declaration
messages when building a 'make allyesconfig' kernel with -Wextra.
These warnings are trivial to kill, yet rather annoying when building with
-Wextra.
The more we can cut down on pointless crap like this the better (IMHO).
A previous patch to do this for a 'allnoconfig' build has already been
merged. This just takes the cleanup a little further.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: avoid pointless blocked-task warnings
rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status
rtmutex: Fix comment about why new_owner can be NULL in wake_futex_pi()
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, olpc: Add missing Kconfig dependencies
x86, mrst: Set correct APB timer IRQ affinity for secondary cpu
x86: tsc: Fix calibration refinement conditionals to avoid divide by zero
x86, ia64, acpi: Clean up x86-ism in drivers/acpi/numa.c
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timekeeping: Make local variables static
time: Rename misnamed minsec argument of clocks_calc_mult_shift()
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Remove syscall_exit_fields
tracing: Only process module tracepoints once
perf record: Add "nodelay" mode, disabled by default
perf sched: Fix list of events, dropping unsupported ':r' modifier
Revert "perf tools: Emit clearer message for sys_perf_event_open ENOENT return"
perf top: Fix annotate segv
perf evsel: Fix order of event list deletion
There is no need for syscall_exit_fields as the syscall
exit event class can already host the fields in its structure,
like most other trace events do by default. Use that
default behavior instead.
Following this scheme, we don't need anymore to override the
get_fields() callback of the syscall exit event class either.
Hence both syscall_exit_fields and syscall_get_exit_fields() can
be removed.
Also changed some indentation to keep the following under 80
characters:
".fields = LIST_HEAD_INIT(event_class_syscall_exit.fields),"
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4D301C0E.8090408@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
We normally just use the BIO_UPTODATE flag to signal 0/-EIO. If
we have more information available, we should pass that along to
the trace output.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
DEFINE_TRACE should also exist when CONFIG_EVENT_TRACING=n. Otherwise, setting
only TRACEPOINTS=y is broken.
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20101028153117.GA4051@Krystal>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While running my ftrace stress test, this showed up:
BUG: sleeping function called from invalid context at mm/mmap.c:233
...
note: cat[3293] exited with preempt_count 1
The bug was introduced by commit 91e86e560d
("tracing: Fix recursive user stack trace")
Cc: <stable@kernel.org>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4D0089AC.1020802@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Function-scope statics are discouraged because they are
easily overlooked and can cause subtle bugs/races due to
their global (non-SMP safe) nature.
Linus noticed that we did this for sched_param - at minimum
make the const.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: Message-ID: <AANLkTinotRxScOHEb0HgFgSpGPkq_6jKTv5CfvnQM=ee@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
blktrace.c block bio complete callback needs to gain a new argument to reflect
the newly added "error" tracepoint argument. This is needed to match the new
block_bio_complete TRACE_EVENT as of
commit de983a7bfcb7c020901ca6e2314cf55a4207ab5a.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Jeff Moyer <jmoyer@redhat.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
sched: Change wait_for_completion_*_timeout() to return a signed long
sched, autogroup: Fix reference leak
sched, autogroup: Fix potential access to freed memory
sched: Remove redundant CONFIG_CGROUP_SCHED ifdef
sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight
sched: Move periodic share updates to entity_tick()
printk: Use this_cpu_{read|write} api on printk_pending
sched: Make pushable_tasks CONFIG_SMP dependant
sched: Add 'autogroup' scheduling feature: automated per session task groups
sched: Fix unregister_fair_sched_group()
sched: Remove unused argument dest_cpu to migrate_task()
mutexes, sched: Introduce arch_mutex_cpu_relax()
sched: Add some clock info to sched_debug
cpu: Remove incorrect BUG_ON
cpu: Remove unused variable
sched: Fix UP build breakage
sched: Make task dump print all 15 chars of proc comm
sched: Update tg->shares after cpu.shares write
sched: Allow update_cfs_load() to update global load
sched: Implement demand based update_cfs_load()
...
Add these new power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend
The old C-state/idle accounting events:
power:power_start
power:power_end
Have now a replacement (but we are still keeping the old
tracepoints for compatibility):
power:cpu_idle
and
power:power_frequency
is replaced with:
power:cpu_frequency
power:machine_suspend is newly introduced.
Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.
the type= field got removed from both, it was never
used and the type is differed by the event type itself.
perf timechart userspace tool gets adjusted in a separate patch.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rjw@sisk.pl
LKML-Reference: <1294073445-14812-3-git-send-email-trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
power_frequency moved to drivers/cpufreq/cpufreq.c which has
to be compiled in, no need to export it.
intel_idle can a be module though...
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
Cc: Jean Pihet <j-pihet@ti.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: rjw@sisk.pl
LKML-Reference: <1294073445-14812-2-git-send-email-trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
Fix two related problems in the event-copying loop of
ring_buffer_read_page.
The loop condition for copying events is off-by-one.
"len" is the remaining space in the caller-supplied page.
"size" is the size of the next event (or two events).
If len == size, then there is just enough space for the next event.
size was set to rb_event_ts_length, which may include the size of two
events if the first event is a time-extend, in order to assure time-
extends are kept together with the event after it. However,
rb_advance_reader always advances by one event. This would result in the
event after any time-extend being duplicated. Instead, get the size of
a single event for the memcpy, but use rb_event_ts_length for the loop
condition.
Signed-off-by: David Sharp <dhsharp@google.com>
LKML-Reference: <1293064704-8101-1-git-send-email-dhsharp@google.com>
LKML-Reference: <AANLkTin7nLrRPc9qGjdjHbeVDDWiJjAiYyb-L=gH85bx@mail.gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.c
Needed to update to apply fixes for which the old branch was too
outdated.
The file_ops struct for the "trace" special file defined llseek as seq_lseek().
However, if the file was opened for writing only, seq_open() was not called,
and the seek would dereference a null pointer, file->private_data.
This patch introduces a new wrapper for seq_lseek() which checks if the file
descriptor is opened for reading first. If not, it does nothing.
Cc: <stable@kernel.org>
Signed-off-by: Slava Pestov <slavapestov@google.com>
LKML-Reference: <1290640396-24179-1-git-send-email-slavapestov@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf symbols: Remove incorrect open-coded container_of()
perf record: Handle restrictive permissions in /proc/{kallsyms,modules}
x86/kprobes: Prevent kprobes to probe on save_args()
irq_work: Drop cmpxchg() result
perf: Fix owner-list vs exit
x86, hw_nmi: Move backtrace_mask declaration under ARCH_HAS_NMI_WATCHDOG
tracing: Fix recursive user stack trace
perf,hw_breakpoint: Initialize hardware api earlier
x86: Ignore trap bits on single step exceptions
tracing: Force arch_local_irq_* notrace for paravirt
tracing: Fix module use of trace_bprintk()
Currently we have in something like the sched_switch event:
field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
When a userspace tool such as perf tries to parse this, the
TASK_COMM_LEN is meaningless. This is done because the TRACE_EVENT() macro
simply uses a #len to show the string of the length. When the length is
an enum, we get a string that means nothing for tools.
By adding a static buffer and a mutex to protect it, we can store the
string into that buffer with snprintf and show the actual number.
Now we get:
field:char prev_comm[16]; offset:12; size:16; signed:1;
Something much more useful.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This adds a new trace event internal flag that allows them to be
used in perf by non privileged users in case of task bound tracing.
This is desired for syscalls tracepoint because they don't leak
global system informations, like some other tracepoints.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jason Baron <jbaron@redhat.com>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.
Remove this too as a cleanup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
[S390] kprobes: Fix the return address of multiple kretprobes
[S390] kprobes: disable interrupts throughout
[S390] ftrace: build without frame pointers on s390
[S390] mm: add devmem_is_allowed() for STRICT_DEVMEM checking
[S390] vmlogrdr: purge after recording is switched off
[S390] cio: fix incorrect ccw_device_init_count
[S390] tape: add medium state notifications
[S390] fix get_user_pages_fast
The user stack trace can fault when examining the trace. Which
would call the do_page_fault handler, which would trace again,
which would do the user stack trace, which would fault and call
do_page_fault again ...
Thus this is causing a recursive bug. We need to have a recursion
detector here.
[ Resubmitted by Jiri Olsa ]
[ Eric Dumazet recommended using __this_cpu_* instead of __get_cpu_* ]
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1289390172-9730-3-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
REQ_HARDBARRIER is dead now, so remove the leftovers. What's left
at this point is:
- various checks inside the block layer.
- sanity checks in bio based drivers.
- now unused bio_empty_barrier helper.
- Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
but Xen really needs to sort out it's barrier situaton.
- setting of ordered tags in uas - dead code copied from old scsi
drivers.
- scsi different retry for barriers - it's dead and should have been
removed when flushes were converted to FS requests.
- blktrace handling of barriers - removed. Someone who knows blktrace
better should add support for REQ_FLUSH and REQ_FUA, though.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
s390 doesn't need FRAME_POINTERS in order to have a working function tracer.
We don't need frame pointers in order to get strack traces since we always
have valid backchains by using the -mkernel-backchain gcc option.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Andrew Morton pointed out almost all sched_setscheduler() callers are
using fixed parameters and can be converted to static. It reduces runtime
memory use a little.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that include/linux/kdb.h properly exports all the functions
required to dynamically add a kdb shell command, the reference to the
private kdb header can be removed.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek
* 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
BKL: introduce CONFIG_BKL.
dabusb: remove the BKL
sunrpc: remove the big kernel lock
init/main.c: remove BKL notations
blktrace: remove the big kernel lock
rtmutex-tester: make it build without BKL
dvb-core: kill the big kernel lock
dvb/bt8xx: kill the big kernel lock
tlclk: remove big kernel lock
fix rawctl compat ioctls breakage on amd64 and itanic
uml: kill big kernel lock
parisc: remove big kernel lock
cris: autoconvert trivial BKL users
alpha: kill big kernel lock
isapnp: BKL removal
s390/block: kill the big kernel lock
hpet: kill BKL, add compat_ioctl
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (278 commits)
arm: remove machine_desc.io_pg_offst and .phys_io
arm: use addruart macro to establish debug mappings
arm: return both physical and virtual addresses from addruart
arm/debug: consolidate addruart macros for CONFIG_DEBUG_ICEDCC
ARM: make struct machine_desc definition coherent with its comment
eukrea_mbimxsd-baseboard: Pass the correct GPIO to gpio_free
cpuimx27: fix compile when ULPI is selected
mach-pcm037_eet: fix compile errors
Fixing ethernet driver compilation error for i.MX31 ADS board
cpuimx51: update board support
mx5: add cpuimx51sd module and its baseboard
iomux-mx51: fix GPIO_1_xx 's IOMUX configuration
imx-esdhc: update devices registration
mx51: add resources for SD/MMC on i.MX51
iomux-mx51: fix SD1 and SD2's iomux configuration
clock-mx51: rename CLOCK1 to CLOCK_CCGR for better readability
clock-mx51: factorize clk_set_parent and clk_get_rate
eukrea_mbimxsd: add support for DVI displays
cpuimx25 & cpuimx35: fix OTG port registration in host mode
i.MX31 and i.MX35 : fix errate TLSbo65953 and ENGcm09472
...
The tracing per_cpu buffers were limited to 999 CPUs for a mear
savings in stack space of a char array. Up the array to 30 characters
which is more than enough to hold a 64 bit number.
Reported-by: Robin Holt <holt@sgi.com>
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the binding of time extends to events we no longer need to use
the macro RB_TIMESTAMPS_PER_PAGE. Remove it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's a condition to check if we should add a time extend or
not in the fast path. But this condition is racey (in the sense
that we can add a unnecessary time extend, but nothing that
can break anything). We later check if the time or event time
delta should be zero or have real data in it (not racey), making
this first check redundant.
This check may help save space once in a while, but really is
not worth the hassle to try to save some space that happens at
most 134 ms at a time.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the time between two timestamps is greater than
2^27 nanosecs (~134 ms) a time extend event is added that extends
the time difference to 59 bits (~18 years). This is due to
events only having a 27 bit field to store time.
Currently this time extend is a separate event. We add it just before
the event data that is being written to the buffer. But before
the event data is committed, the event data can also be discarded (as
with the case of filters). But because the time extend has already been
committed, it will stay in the buffer.
If lots of events are being filtered and no event is being
written, then every 134ms a time extend can be added to the buffer
without any data attached. To keep from filling the entire buffer
with time extends, a time extend will never be the first event
in a page because the page timestamp can be used. Time extends can
only fill the rest of a page with some data at the beginning.
This patch binds the time extend with the data. The difference here
is that the time extend is not committed before the data is added.
Instead, when a time extend is needed, the space reserved on
the ring buffer is the time extend + the data event size. The
time extend is added to the first part of the reserved block and
the data is added to the second. The time extend event is passed
back to the reserver, but since the reserver also uses a function
to find the data portion of the reserved block, no changes to the
ring buffer interface need to be made.
When a commit is discarded, we now remove both the time extend and
the event. With this approach no more than one time extend can
be in the buffer in a row. Data must always follow a time extend.
Thanks to Mathieu Desnoyers for suggesting this idea.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The delta between events is passed to the timestamp code by reference
and the timestamp code will reset the value. But it can be reset
from the caller. No need to pass it in by reference.
By changing the call to pass by value, lets gcc optimize the code
a bit more where it can store the delta in a register and not
worry about updating the reference.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The original code for the ring buffer had locations that modified
the timestamp and that change was used by the callers. Now,
the timestamp is not reused by the callers and there is no reason
to pass it by reference.
By changing the call to pass by value, lets gcc optimize the code
a bit more where it can store the timestamp in a register and not
worry about updating the reference.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Gcc inlines the slow path of the ring buffer write which can
hurt performance. This patch simply forces the slow path function
rb_move_tail() to always be a function.
The ring_buffer_benchmark module with reader_disabled=1 shows that
this patch changes the time to record an event from 135 ns to
132 ns. (3 ns or 2.22% improvement)
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function start_func_tracer() was incorrectly added in the
#ifdef CONFIG_FUNCTION_TRACER condition, but is still used even
when function tracing is not enabled.
The calls to register_ftrace_function() and register_ftrace_graph()
become nops (and their arguments are even ignored), thus there is
no reason to hide start_func_tracer() when function tracing is
not enabled.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
According to Jens, this code does not need the BKL at all,
it is sufficiently serialized by bd_mutex.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Even though the parent is recorded with the normal function tracing
of the latency tracers (irqsoff and wakeup), the function graph
recording is bogus.
This is due to the function graph messing with the return stack.
The latency tracers pass in as the parent CALLER_ADDR0, which
works fine for plain function tracing. But this causes bogus output
with the graph tracer:
3) <idle>-0 | d.s3. 0.000 us | return_to_handler();
3) <idle>-0 | d.s3. 0.000 us | _raw_spin_unlock_irqrestore();
3) <idle>-0 | d.s3. 0.000 us | return_to_handler();
3) <idle>-0 | d.s3. 0.000 us | trace_hardirqs_on();
The "return_to_handle()" call is the trampoline of the
function graph tracer, and is meaningless in this context.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The preempt and irqsoff tracers have three types of function tracers.
Normal function tracer, function graph entry, and function graph return.
Each of these use a complex dance to prevent recursion and whether
to trace the data or not (depending if interrupts are enabled or not).
This patch moves the duplicate code into a single routine, to
prevent future mistakes with modifying duplicate complex code.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The wakeup tracer has three types of function tracers. Normal
function tracer, function graph entry, and function graph return.
Each of these use a complex dance to prevent recursion and whether
to trace the data or not (depending on the wake_task variable).
This patch moves the duplicate code into a single routine, to
prevent future mistakes with modifying duplicate complex code.
Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add function graph support for wakeup latency tracer.
The graph output is enabled by setting the 'display-graph'
trace option.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1285243253-7372-4-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Move trace_graph_function() and print_graph_headers_flags() functions
to the trace_function_graph.c to be globaly available.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1285243253-7372-3-git-send-email-jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The check_irq_entry and check_irq_return could be called
from graph event context. In such case there's no graph
private data allocated. Adding checks to handle this case.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20100924154102.GB1818@jolsa.brq.redhat.com>
[ Fixed some grammar in the comments ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
The config option used by archs to let the build system know that
the C version of the recordmcount works for said arch is currently
called HAVE_C_MCOUNT_RECORD which enables BUILD_C_RECORDMCOUNT. To
be more consistent with the name that all archs may use, it has been
renamed to HAVE_C_RECORDMCOUNT. This will be less confusing since
we are building a C recordmcount and not a mcount_record.
Suggested-by: Ingo Molnar <mingo@elte.hu>
Cc: <linux-arch@vger.kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: linux-kbuild@vger.kernel.org
Cc: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch adds the support for the C version of recordmcount and
compile times show ~ 12% improvement.
After verifying this works, other archs can add:
HAVE_C_MCOUNT_RECORD
in its Kconfig and it will use the C version of recordmcount
instead of the perl version.
Cc: <linux-arch@vger.kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: linux-kbuild@vger.kernel.org
Cc: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix
kernel/trace/trace_functions_graph.c: In function ‘trace_print_graph_duration’:
kernel/trace/trace_functions_graph.c:652: warning: comparison of distinct pointer types lacks a cast
when building 36-rc6 on a 32-bit due to the strict type check failing
in the min() macro.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Cc: Chase Douglas <chase.douglas@canonical.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20100929080823.GA13595@liondog.tnic>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Time stamps for the ring buffer are created by the difference between
two events. Each page of the ring buffer holds a full 64 bit timestamp.
Each event has a 27 bit delta stamp from the last event. The unit of time
is nanoseconds, so 27 bits can hold ~134 milliseconds. If two events
happen more than 134 milliseconds apart, a time extend is inserted
to add more bits for the delta. The time extend has 59 bits, which
is good for ~18 years.
Currently the time extend is committed separately from the event.
If an event is discarded before it is committed, due to filtering,
the time extend still exists. If all events are being filtered, then
after ~134 milliseconds a new time extend will be added to the buffer.
This can only happen till the end of the page. Since each page holds
a full timestamp, there is no reason to add a time extend to the
beginning of a page. Time extends can only fill a page that has actual
data at the beginning, so there is no fear that time extends will fill
more than a page without any data.
When reading an event, a loop is made to skip over time extends
since they are only used to maintain the time stamp and are never
given to the caller. As a paranoid check to prevent the loop running
forever, with the knowledge that time extends may only fill a page,
a check is made that tests the iteration of the loop, and if the
iteration is more than the number of time extends that can fit in a page
a warning is printed and the ring buffer is disabled (all of ftrace
is also disabled with it).
There is another event type that is called a TIMESTAMP which can
hold 64 bits of data in the theoretical case that two events happen
18 years apart. This code has not been implemented, but the name
of this event exists, as well as the structure for it. The
size of a TIMESTAMP is 16 bytes, where as a time extend is only
8 bytes. The macro used to calculate how many time extends can fit on
a page used the TIMESTAMP size instead of the time extend size
cutting the amount in half.
The following test case can easily trigger the warning since we only
need to have half the page filled with time extends to trigger the
warning:
# cd /sys/kernel/debug/tracing/
# echo function > current_tracer
# echo 'common_pid < 0' > events/ftrace/function/filter
# echo > trace
# echo 1 > trace_marker
# sleep 120
# cat trace
Enabling the function tracer and then setting the filter to only trace
functions where the process id is negative (no events), then clearing
the trace buffer to ensure that we have nothing in the buffer,
then write to trace_marker to add an event to the beginning of a page,
sleep for 2 minutes (only 35 seconds is probably needed, but this
guarantees the bug), and then finally reading the trace which will
trigger the bug.
This patch fixes the typo and prevents the false positive of that warning.
Reported-by: Hans J. Koch <hjk@linutronix.de>
Tested-by: Hans J. Koch <hjk@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stable Kernel <stable@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Initialize the workqueue data structures *before* they are registered
so that they are ready for callbacks.
Signed-off-by: Jason Baron <jbaron@redhat.com>
LKML-Reference: <e3a3383fc370ac7086625bebe89d9480d7caf372.1284733808.git.jbaron@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enums for FTRACE_ENABLE_MCOUNT and FTRACE_DISABLE_MCOUNT were
used as commands to ftrace_run_update_code(). But these commands
were used by the old nasty ftrace daemon that has long been slain.
This is a clean up patch to remove the references to these enums
and simplify the code a little.
Reported-by: Wu Zhangjin <wuzhangjin@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the function graph tracer funcgraph-irq option is zero, disable
tracing in IRQs. This makes the option have two effects.
1) When reading the trace file, do not display the functions that
happen in interrupt context (when detected)
2) [*new*] When recording a trace, skip those that are detected
to be in interrupt by the 'in_irq()' function
Note, in_irq() is updated at irq_enter() and irq_exit(). There are
still functions that are recorded by the function graph tracer that
is in interrupt context but outside the irq_enter/exit() routines.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It's handy to be able to disable the irq related output
and not to have to jump over each irq related code, when
you have no interrest in it.
The option is by default enabled, so there's no change to
current behaviour. It affects only the final output, so all
the irq related data stay in the ring buffer.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <20100907145344.GC1912@jolsa.brq.redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If we do:
# cd /sys/kernel/debug
# echo 'do_IRQ:traceon schedule:traceon sys_write:traceon' > \
set_ftrace_filter
# cat set_ftrace_filter
We get the following output:
#### all functions enabled ####
sys_write:traceon:unlimited
schedule:traceon:unlimited
do_IRQ:traceon:unlimited
This outputs two lists. One is the fact that all functions are
currently enabled for function tracing, the other has three probed
functions, which happen to have 'traceon' as their commands.
Currently, when reading the first list (functions enabled) the
seq_file code will receive a "NULL" from the t_next() function
causing it to exit early. This makes "read()" from userspace stop
reading the code at this boarder. Although read is allowed to do this,
some (broken) applications might consider this an end of file and
stop early.
This patch adds the start of the second list to t_next() when it
finishes the first list. It is a simple change and gives the
set_ftrace_filter file nicer reading ability.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch keeps track of the index within the elements of
set_ftrace_filter and if the position goes backwards, it nicely
resets and starts from the beginning again.
This allows for lseek and pread to work properly now.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The set_ftrace_filter uses seq_file and reads from two lists. The
pointer returned by t_next() can either be of type struct dyn_ftrace
or struct ftrace_func_probe. If there is a bug (there was one)
the wrong pointer may be used and the reference can cause an oops.
This patch makes t_next() and friends only return the iterator structure
which now has a pointer of type struct dyn_ftrace and struct
ftrace_func_probe. The t_show() can now test if the pointer is NULL or
not and if the pointer exists, it is guaranteed to be of the correct type.
Now if there's a bug, only wrong data will be shown but not an oops.
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
After the filtered functions are read, the probed functions are read
from the hash in set_ftrace_filter. When the hashed probed functions
are read, the *pos passed in is reset. Instead of modifying the pos
given to the read function, just record the pos where the filtered
functions ended and subtract from that.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: t_start: reset FTRACE_ITER_HASH in case of seek/pread
perf symbols: Fix multiple initialization of symbol system
perf: Fix CPU hotplug
perf, trace: Fix module leak
tracing/kprobe: Fix handling of C-unlike argument names
tracing/kprobes: Fix handling of argument names
perf probe: Fix handling of arguments names
perf probe: Fix return probe support
tracing/kprobe: Fix a memory leak in error case
tracing: Do not allow llseek to set_ftrace_filter
Be sure to avoid entering t_show() with FTRACE_ITER_HASH set without
having properly started the iterator to iterate the hash. This case is
degenerate and, as discovered by Robert Swiecki, can cause t_hash_show()
to misuse a pointer. This causes a NULL ptr deref with possible security
implications. Tracked as CVE-2010-3079.
Cc: Robert Swiecki <swiecki@google.com>
Cc: Eugene Teo <eugene@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Replace pmu::{enable,disable,start,stop,unthrottle} with
pmu::{add,del,start,stop}, all of which take a flags argument.
The new interface extends the capability to stop a counter while
keeping it scheduled on the PMU. We replace the throttled state with
the generic stopped state.
This also allows us to efficiently stop/start counters over certain
code paths (like IRQ handlers).
It also allows scheduling a counter without it starting, allowing for
a generic frozen state (useful for rotating stopped counters).
The stopped state is implemented in two different ways, depending on
how the architecture implemented the throttled state:
1) We disable the counter:
a) the pmu has per-counter enable bits, we flip that
b) we program a NOP event, preserving the counter state
2) We store the counter state and ignore all read/overflow events
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: paulus <paulus@samba.org>
Cc: stephane eranian <eranian@googlemail.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Yanmin <yanmin_zhang@linux.intel.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Michael Cree <mcree@orcon.net.nz>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
gcc-4.6: kernel/*: Fix unused but set warnings
mutex: Fix annotations to include it in kernel-locking docbook
pid: make setpgid() system call use RCU read-side critical section
MAINTAINERS: Add RCU's public git tree
Check the argument name whether it is invalid (not C-like symbol name). This
makes event format simple.
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20100827113912.22882.62313.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Set "argN" name for each argument automatically if it has no specified name.
Since dynamic trace event(kprobe_events) accepts special characters for its
argument, its format can show those special characters (e.g. '$', '%', '+').
However, perf can't parse those format because of the character (especially
'%') mess up the format. This sets "argX" name for those arguments if user
omitted the argument names.
E.g.
# echo 'p do_fork %ax IP=%ip $stack' > tracing/kprobe_events
# cat tracing/kprobe_events
p:kprobes/p_do_fork_0 do_fork arg1=%ax IP=%ip arg3=$stack
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20100827113906.22882.59312.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Fix a memory leak which happens when a field name conflicts with others. In
error case, free_trace_probe() will free all arguments until nr_args, so this
increments nr_args the begining of the loop instead of the end.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20100827113846.22882.12670.stgit@ltc236.sdl.hitachi.co.jp>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reading the file set_ftrace_filter does three things.
1) shows whether or not filters are set for the function tracer
2) shows what functions are set for the function tracer
3) shows what triggers are set on any functions
3 is independent from 1 and 2.
The way this file currently works is that it is a state machine,
and as you read it, it may change state. But this assumption breaks
when you use lseek() on the file. The state machine gets out of sync
and the t_show() may use the wrong pointer and cause a kernel oops.
Luckily, this will only kill the app that does the lseek, but the app
dies while holding a mutex. This prevents anyone else from using the
set_ftrace_filter file (or any other function tracing file for that matter).
A real fix for this is to rewrite the code, but that is too much for
a -rc release or stable. This patch simply disables llseek on the
set_ftrace_filter() file for now, and we can do the proper fix for the
next major release.
Reported-by: Robert Swiecki <swiecki@google.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Eugene Teo <eugene@redhat.com>
Cc: vendor-sec@lst.de
Cc: <stable@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
No real bugs I believe, just some dead code.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: andi@firstfloor.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With a new enough GCC, ARM function tracing can be supported without the
need for frame pointers. This is essential for Thumb-2 support, since
frame pointers aren't available then.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
While discussing the strictness of the 80 character limit on the
Kernel Summit Discussion mailing list, I showed examples that I
broke that limit slightly with some algorithms. In discussing with
John Linville, what looked better, I realized that two of the
80 char breaking culprits were an identical expression.
As a clean up, this patch moves the identical expression into its
own helper function and that is used instead. As a side effect,
the offending code is now under the 80 character limit. :-)
This clean up code also changes the expression from
(A - B) - C to A - (B + C)
This makes the code look a little nicer too.
Cc: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
While we are reading trace_stat/functionX and someone just
disabled function_profile at that time, we can trigger this:
divide error: 0000 [#1] PREEMPT SMP
...
EIP is at function_stat_show+0x90/0x230
...
This fix just takes the ftrace_profile_lock and checks if
rec->counter is 0. If it's 0, we know the profile buffer
has been reset.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: stable@kernel.org
LKML-Reference: <4C723644.4040708@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Instead of hardcoding the number of contexts for the recursions
barriers, define a cpp constant to make the code more
self-explanatory.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Remove the nasty hack that marks a pointer's LSB to distinguish common
fields from event fields. Replace it with a more sane approach.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4C6A23C2.9020606@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Two new events were added that broke the current format output.
Both from the SCSI system: scsi_dispatch_cmd_done and scsi_dispatch_cmd_timeout
The reason is that their print_fmt exceeded a page size. Since the output
of the format used simple_read_from_buffer and trace_seq, it was limited
to a page size in output.
This patch converts the printing of the format of an event into seq_file,
which allows greater than a page size to be shown.
I diffed all event formats comparing the output with and without this
patch. All matched except for the above two, which showed just:
FORMAT TOO BIG
without this patch, but now properly displays the output with this patch.
v2: Remove updating *pos in seq start function.
[ Thanks to Li Zefan for pointing that out ]
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Cc: James Bottomley <James.Bottomley@suse.de>
Cc: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com>
Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Secure discard is the same as discard except that all copies of the
discarded sectors (perhaps created by garbage collection) must also be
erased.
Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Kyungmin Park <kmpark@infradead.org>
Cc: Madhusudhan Chikkature <madhu.cr@ti.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Gardiner <bengardiner@nanometrics.ca>
Cc: <linux-mmc@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...
Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
* 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
do_coredump: Do not take BKL
init: Remove the BKL from startup code
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (55 commits)
workqueue: mark init_workqueues() as early_initcall()
workqueue: explain for_each_*cwq_cpu() iterators
fscache: fix build on !CONFIG_SYSCTL
slow-work: kill it
gfs2: use workqueue instead of slow-work
drm: use workqueue instead of slow-work
cifs: use workqueue instead of slow-work
fscache: drop references to slow-work
fscache: convert operation to use workqueue instead of slow-work
fscache: convert object to use workqueue instead of slow-work
workqueue: fix how cpu number is stored in work->data
workqueue: fix mayday_mask handling on UP
workqueue: fix build problem on !CONFIG_SMP
workqueue: fix locking in retry path of maybe_create_worker()
async: use workqueue for worker pool
workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead
workqueue: implement unbound workqueue
workqueue: prepare for WQ_UNBOUND implementation
libata: take advantage of cmwq and remove concurrency limitations
workqueue: fix worker management invocation without pending works
...
Fixed up conflicts in fs/cifs/* as per Tejun. Other trivial conflicts in
include/linux/workqueue.h, kernel/trace/Kconfig and kernel/workqueue.c
The blktrace driver currently needs the BKL, but
we should not need to take that in the block layer,
so just push it down into the driver itself.
It is quite likely that the BKL is not actually
required in blktrace code and could be removed
in a follow-on patch.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.
Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Remove all the trivial wrappers for the cmd_type and cmd_flags fields in
struct requests. This allows much easier grepping for different request
types instead of unwinding through macros.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'timers-timekeeping-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
um: Fix read_persistent_clock fallout
kgdb: Do not access xtime directly
powerpc: Clean up obsolete code relating to decrementer and timebase
powerpc: Rework VDSO gettimeofday to prevent time going backwards
clocksource: Add __clocksource_updatefreq_hz/khz methods
x86: Convert common clocksources to use clocksource_register_hz/khz
timekeeping: Make xtime and wall_to_monotonic static
hrtimer: Cleanup direct access to wall_to_monotonic
um: Convert to use read_persistent_clock
timkeeping: Fix update_vsyscall to provide wall_to_monotonic offset
powerpc: Cleanup xtime usage
powerpc: Simplify update_vsyscall
time: Kill off CONFIG_GENERIC_TIME
time: Implement timespec_add
x86: Fix vtime/file timestamp inconsistencies
Trivial conflicts in Documentation/feature-removal-schedule.txt
Much less trivial conflicts in arch/powerpc/kernel/time.c resolved as
per Thomas' earlier merge commit 47916be4e2 ("Merge branch
'powerpc.cherry-picks' into timers/clocksource")
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits)
sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug
sched: No need for bootmem special cases
sched: Revert nohz_ratelimit() for now
sched: Reduce update_group_power() calls
sched: Update rq->clock for nohz balanced cpus
sched: Fix spelling of sibling
sched, cpuset: Drop __cpuexit from cpu hotplug callbacks
sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check()
sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()
sched: thread_group_cputime: Simplify, document the "alive" check
sched: Remove the obsolete exit_state/signal hacks
sched: task_tick_rt: Remove the obsolete ->signal != NULL check
sched: __sched_setscheduler: Read the RLIMIT_RTPRIO value lockless
sched: Fix comments to make them DocBook happy
sched: Fix fix_small_capacity
powerpc: Exclude arch_sd_sibiling_asym_packing() on UP
powerpc: Enable asymmetric SMT scheduling on POWER7
sched: Add asymmetric group packing option for sibling domain
sched: Fix capacity calculations for SMT4
sched: Change nohz idle load balancing logic to push model
...
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (162 commits)
tracing/kprobes: unregister_trace_probe needs to be called under mutex
perf: expose event__process function
perf events: Fix mmap offset determination
perf, powerpc: fsl_emb: Restore setting perf_sample_data.period
perf, powerpc: Convert the FSL driver to use local64_t
perf tools: Don't keep unreferenced maps when unmaps are detected
perf session: Invalidate last_match when removing threads from rb_tree
perf session: Free the ref_reloc_sym memory at the right place
x86,mmiotrace: Add support for tracing STOS instruction
perf, sched migration: Librarize task states and event headers helpers
perf, sched migration: Librarize the GUI class
perf, sched migration: Make the GUI class client agnostic
perf, sched migration: Make it vertically scrollable
perf, sched migration: Parameterize cpu height and spacing
perf, sched migration: Fix key bindings
perf, sched migration: Ignore unhandled task states
perf, sched migration: Handle ignored migrate out events
perf: New migration tool overview
tracing: Drop cpparg() macro
perf: Use tracepoint_synchronize_unregister() to flush any pending tracepoint call
...
Fix up trivial conflicts in Makefile and drivers/cpufreq/cpufreq.c
With CONFIG_DEBUG_PAGEALLOC, I observed an unallocated memory access in
function_graph trace. It appears we find a small size entry in ring buffer,
but we access it as a big size entry. The access overflows the page size
and touches an unallocated page.
Cc: <stable@kernel.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <1280217994.32400.76.camel@sli10-desk.sh.intel.com>
[ Added a comment to explain the problem - SDR ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In systems with more than one processor it is desirable to look at the
per cpu trace buffers.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
Add in a helper function to allow the kdb shell to dump the ftrace
buffer.
Modify trace.c to expose the capability to iterate over the ftrace
buffer in a read only capacity.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
Comment in unregister_trace_probe() says probe_lock will be held when it
gets called. However there is a case where it might called without the
probe_lock being held. Also since we are traversing the probe_list and
deleting an element from the probe_list, probe_lock should be held.
This was first pointed in uprobes traceevent review by Frederic
Weisbecker here. (http://lkml.org/lkml/2010/5/12/106)
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100630084548.GA10325@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We use synchronize_sched() to ensure a tracepoint won't be called
while/after we release the perf buffers it references.
But the tracepoint API has its own API for that:
tracepoint_synchronize_unregister(). Use it instead as it's
self-explanatory and eases maintainance.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Now that all arches have been converted over to use generic time via
clocksources or arch_gettimeoffset(), we can remove the GENERIC_TIME
config option and simplify the generic code.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1279068988-21864-4-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We need to add one to the strlen() return because of the NULL
character. The type->name here generally comes from the kernel and I
don't think any of them come close to being MAX_TRACER_SIZE (100)
characters long so this is basically a cleanup.
Signed-off-by: Dan Carpenter <error27@gmail.com>
LKML-Reference: <20100710100644.GV19184@bicker>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Documentation/trace/ftrace.txt says
buffer_size_kb:
This sets or displays the number of kilobytes each CPU
buffer can hold. The tracer buffers are the same size
for each CPU. The displayed number is the size of the
CPU buffer and not total size of all buffers. The
trace buffers are allocated in pages (blocks of memory
that the kernel uses for allocation, usually 4 KB in size).
If the last page allocated has room for more bytes
than requested, the rest of the page will be used,
making the actual allocation bigger than requested.
( Note, the size may not be a multiple of the page size
due to buffer management overhead. )
This can only be updated when the current_tracer
is set to "nop".
But it's incorrect. currently total memory consumption is
'buffer_size_kb x CPUs x 2'.
Why two times difference is there? because ftrace implicitly allocate
the buffer for max latency too.
That makes sad result when admin want to use large buffer. (If admin
want full logging and makes detail analysis). example, If admin
have 24 CPUs machine and write 200MB to buffer_size_kb, the system
consume ~10GB memory (200MB x 24 x 2). umm.. 5GB memory waste is
usually unacceptable.
Fortunatelly, almost all users don't use max latency feature.
The max latency buffer can be disabled easily.
This patch shrink buffer size of the max latency buffer if
unnecessary.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20100701104554.DA2D.A69D9226@jp.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
__print_flags() and __print_symbolic() use percpu trace_seq:
1) Its memory is allocated at compile time, it wastes memory if we don't use tracing.
2) It is percpu data and it wastes more memory for multi-cpus system.
3) It disables preemption when it executes its core routine
"trace_seq_printf(s, "%s: ", #call);" and introduces latency.
So we move this trace_seq to struct trace_iterator.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4C078350.7090106@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reorder structure to remove 8 bytes of padding on 64 bit builds.
This shrinks the size to 128 bytes so allowing allocation from a smaller
slab & needed one fewer cache lines.
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
LKML-Reference: <1269516456.2054.8.camel@localhost>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We found that even enabling a single trace event that will rarely be
triggered can add big overhead to context switch.
(lmbench context switch test)
-------------------------------------------------
2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
------ ------ ------ ------ ------ ------- -------
2.19 2.3 2.21 2.56 2.13 2.54 2.07
2.39 2.51 2.35 2.75 2.27 2.81 2.24
The overhead is 6% ~ 11%.
It's because when a trace event is enabled 3 tracepoints (sched_switch,
sched_wakeup, sched_wakeup_new) will be activated to map pid to cmdname.
We'd like to avoid this overhead, so add a trace option '(no)record-cmd'
to allow to disable cmdline recording.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4C2D57F4.2050204@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The default for llseek will change to no_llseek,
so the tracing debugfs files need to add explicit
.llseek assignments. Since we're dealing with regular
files from a VFS perspective, use generic_file_llseek.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: John Kacur <jkacur@redhat.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <1278538820-1392-10-git-send-email-arnd@arndb.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Special traces type was only used by sysprof. Lets remove it now
that sysprof ftrace plugin has been dropped.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Soeren Sandmann <sandmann@daimi.au.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
The sysprof ftrace plugin doesn't seem to be seriously used
somewhere. There is a branch in the sysprof tree that makes
an interface to it, but the real sysprof tool uses either its
own module or perf events.
Drop the sysprof ftrace plugin then, as it's mostly useless.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Soeren Sandmann <sandmann@daimi.au.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
The ksym (breakpoint) ftrace plugin has been superseded by perf
tools that are much more poweful to use the cpu breakpoints.
This tracer doesn't bring more feature. It has been deprecated
for a while now, lets remove it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
I have shown by code review that no driver takes
the BKL at init time any more, so whatever the
init code was locking against is no longer there
and it is now safe to remove the BKL there.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Steven Rostedt <rostedt@goodmis>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Support string type tracing and printing in kprobe-tracer.
This allows user to trace string data in kernel including __user data. Note
that sometimes __user data may not be accessed if it is paged-out (sorry, but
kprobes operation should be done in atomic, we can not wait for page-in).
Commiter note: Fixed up conflicts with b7e2ece.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100519195724.2885.18788.stgit@localhost6.localdomain6>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Strip tracing code from workqueue and remove workqueue tracing. This
is temporary measure till concurrency managed workqueue is complete.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Because kprobes and syscalls need special processing to register
events, the class->reg() method was created to handle the differences.
But instead of creating a default ->reg for perf and ftrace events,
the code was scattered with:
if (class->reg)
class->reg();
else
default_reg();
This is messy and can also lead to bugs.
This patch cleans up this code and creates a default reg() entry for
the events allowing for the code to directly call the class->reg()
without the condition.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The nsecs_str string is a local variable defined as:
char nsecs_str[5];
It is possible for the snprintf call to use a size value larger than the
size of the string. This should not cause a buffer overrun as it is
written now due to the value for the string format "%03lu" can not be
larger than 1000. However, this change makes it correct. By making the
size correct we guard against potential future changes that could actually
cause a buffer overrun.
Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1276619355-18116-1-git-send-email-chase.douglas@canonical.com>
[ added 'UL' to number 8 to fix gcc warning comparing it to sizeof() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Let trace_module_add_events() and event_trace_init() call
__trace_add_event_call().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BFA37E9.1020106@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
raw_init callback is optional.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BFA37D4.7070500@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every event (or event class) has it's define_fields callback,
so the test is redundant.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BFA37BC.8080707@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Every event has the same common fields, so it's a big waste of
memory to have a copy of those fields for every event.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BFA3759.30105@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
All syscall exit events have the same fields.
The kernel size drops 2.5K:
text data bss dec hex filename
7018612 2034376 7251132 16304120 f8c7f8 vmlinux.o.orig
7018612 2031888 7251132 16301632 f8be40 vmlinux.o
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BFA3746.8070100@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With the addition of the code to shrink the kernel tracepoint
infrastructure, we lost kprobes being traced by perf. The reason
is that I tested if the "tp_event->class->perf_probe" existed before
enabling it. This prevents "ftrace only" events (like the function
trace events) from being enabled by perf.
Unfortunately, kprobe events do not use perf_probe. This causes
kprobes to be missed by perf. To fix this, we add the test to
see if "tp_event->class->reg" exists as well as perf_probe.
Normal trace events have only "perf_probe" but no "reg" function,
and kprobes and syscalls have the "reg" but no "perf_probe".
The ftrace unique events do not have either, so this is a valid
test. If a kprobe or syscall is not to be probed by perf, the
"reg" function is called anyway, and will return a failure and
prevent perf from probing it.
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We have been resisting new ftrace plugins and removing existing
ones, and kmemtrace has been superseded by kmem trace events
and perf-kmem, so we remove it.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
[ remove kmemtrace from the makefile, handle slob too ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
For people who otherwise get to write: cpu_clock(smp_processor_id()),
there is now: local_clock().
Also, as per suggestion from Andrew, provide some documentation on
the various clock interfaces, and minimize the unsigned long long vs
u64 mess.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jens Axboe <jaxboe@fusionio.com>
LKML-Reference: <1275052414.1645.52.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The boot tracer is useless. It simply logs the initcalls
but in fact these initcalls are also logged through printk
while using the initcall_debug kernel parameter.
Nobody seem to be using it so far. Then just remove it.
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Chase Douglas <chase.douglas@canonical.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <20100526105753.GA5677@cr0.nay.redhat.com>
[ remove the hooks in main.c, and the headers ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Drop this argument now that we always want to rewind only to the
state of the first caller.
It means frame pointers are not necessary anymore to reliably get
the source of an event. But this also means we need this helper
to be a macro now, as an inline function is not an option since
we need to know when to provide a default implentation.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
arch/x86/include/asm/stacktrace.h and arch/x86/kernel/dumpstack.h
declare headers of objects that deal with the same topic.
Actually most of the files that include stacktrace.h also include
dumpstack.h
Although dumpstack.h seems more reserved for internals of stack
traces, those are quite often needed to define specialized stack
trace operations. And perf event arch headers are going to need
access to such low level operations anyway. So don't continue to
bother with dumpstack.h as it's not anymore about isolated deep
internals.
v2: fix struct stack_frame definition conflict in sysprof
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Soeren Sandmann <sandmann@daimi.au.dk>
The ftrace_preempt_disable/enable functions were to address a
recursive race caused by the function tracer. The function tracer
traces all functions which makes it easily susceptible to recursion.
One area was preempt_enable(). This would call the scheduler and
the schedulre would call the function tracer and loop.
(So was it thought).
The ftrace_preempt_disable/enable was made to protect against recursion
inside the scheduler by storing the NEED_RESCHED flag. If it was
set before the ftrace_preempt_disable() it would not call schedule
on ftrace_preempt_enable(), thinking that if it was set before then
it would have already scheduled unless it was already in the scheduler.
This worked fine except in the case of SMP, where another task would set
the NEED_RESCHED flag for a task on another CPU, and then kick off an
IPI to trigger it. This could cause the NEED_RESCHED to be saved at
ftrace_preempt_disable() but the IPI to arrive in the the preempt
disabled section. The ftrace_preempt_enable() would not call the scheduler
because the flag was already set before entring the section.
This bug would cause a missed preemption check and cause lower latencies.
Investigating further, I found that the recusion caused by the function
tracer was not due to schedule(), but due to preempt_schedule(). Now
that preempt_schedule is completely annotated with notrace, the recusion
no longer is an issue.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix blktrace.c kernel-doc warnings:
Warning(kernel/trace/blktrace.c:858): No description found for parameter 'ignore'
Warning(kernel/trace/blktrace.c:890): No description found for parameter 'ignore'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100529114507.c466fc1e.randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steve spotted I forgot to do the destroy under event_mutex.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1274451913.1674.1707.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tracepoint_probe_unregister() does not synchronize against the probe
callbacks, so do that explicitly. This properly serializes the callbacks
and the free of the data used therein.
Also, use this_cpu_ptr() where possible.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1274438476.1674.1702.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf tui: Fix last use_browser problem related to .perfconfig
perf symbols: Add the build id cache to the vmlinux path
perf tui: Reset use_browser if stdout is not a tty
ring-buffer: Move zeroing out excess in page to ring buffer code
ring-buffer: Reset "real_end" when page is filled
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (61 commits)
tracing: Add __used annotation to event variable
perf, trace: Fix !x86 build bug
perf report: Support multiple events on the TUI
perf annotate: Fix up usage of the build id cache
x86/mmiotrace: Remove redundant instruction prefix checks
perf annotate: Add TUI interface
perf tui: Remove annotate from popup menu after failure
perf report: Don't start the TUI if -D is used
perf: Fix getline undeclared
perf: Optimize perf_tp_event_match()
perf: Remove more code from the fastpath
perf: Optimize the !vmalloc backed buffer
perf: Optimize perf_output_copy()
perf: Fix wakeup storm for RO mmap()s
perf-record: Share per-cpu buffers
perf-record: Remove -M
perf: Ensure that IOC_OUTPUT isn't used to create multi-writer buffers
perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events
perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction
perf tui: Allow disabling the TUI on a per command basis in ~/.perfconfig
...
Currently the trace splice code zeros out the excess bytes in the page before
sending it off to userspace.
This is to make sure userspace is not getting anything it should not be
when reading the pages, because the excess data was never initialized
to zero before writing (for perfomance reasons).
But the splice code has no business in doing this work, it should be
done by the ring buffer. With the latest changes for recording lost
events, the splice code gets it wrong anyway.
Move the zeroing out of excess bytes into the ring buffer code.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The code to store the "lost events" requires knowing the real end
of the page. Since the 'commit' includes the padding at the end of
a page a "real_end" variable was used to keep track of the end not
including the padding.
If events were lost, the reader can place the count of events in
the padded area if there is enough room.
The bug this patch fixes is that when we fill the page we do not
reset the real_end variable, and if the writer had wrapped a few
times, the real_end would be incorrect.
This patch simply resets the real_end if the page was filled.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Patch b7e2ecef92 (perf, trace: Optimize tracepoints by removing
IRQ-disable from perf/tracepoint interaction) made the
unfortunate mistake of assuming the world is x86 only, correct
this.
The problem was that perf_fetch_caller_regs() did
local_save_flags() into regs->flags, and I re-used that to
remove another local_save_flags(), forgetting !x86 doesn't have
regs->flags.
Do the reverse, remove the local_save_flags() from
perf_fetch_caller_regs() and let the ftrace site do the
local_save_flags() instead.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: acme@redhat.com
Cc: efault@gmx.de
Cc: fweisbec@gmail.com
Cc: rostedt@goodmis.org
LKML-Reference: <1274778175.5882.623.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds F_GETPIPE_SZ and F_SETPIPE_SZ fcntl() actions for
growing and shrinking the size of a pipe and adjusts pipe.c and splice.c
(and relay and network splice) usage to work with these larger (or smaller)
pipes.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (182 commits)
[SCSI] aacraid: add an ifdef'd device delete case instead of taking the device offline
[SCSI] aacraid: prohibit access to array container space
[SCSI] aacraid: add support for handling ATA pass-through commands.
[SCSI] aacraid: expose physical devices for models with newer firmware
[SCSI] aacraid: respond automatically to volumes added by config tool
[SCSI] fcoe: fix fcoe module ref counting
[SCSI] libfcoe: FIP Keep-Alive messages for VPorts are sent with incorrect port_id and wwn
[SCSI] libfcoe: Fix incorrect MAC address clearing
[SCSI] fcoe: fix a circular locking issue with rtnl and sysfs mutex
[SCSI] libfc: Move the port_id into lport
[SCSI] fcoe: move link speed checking into its own routine
[SCSI] libfc: Remove extra pointer check
[SCSI] libfc: Remove unused fc_get_host_port_type
[SCSI] fcoe: fixes wrong error exit in fcoe_create
[SCSI] libfc: set seq_id for incoming sequence
[SCSI] qla2xxx: Updates to ISP82xx support.
[SCSI] qla2xxx: Optionally disable target reset.
[SCSI] qla2xxx: ensure flash operation and host reset via sg_reset are mutually exclusive
[SCSI] qla2xxx: Silence bogus warning by gcc for wrap and did.
[SCSI] qla2xxx: T10 DIF support added.
...
Avoid the swevent hash-table by using per-tracepoint
hlists.
Also, avoid conditionals on the fast path by ordering
with probe unregister so that we should never get on
the callback path without the data being there.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <20100521090710.473188012@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>