SuperH is the last remaining user of arch_ftrace_nmi_{enter,exit}(),
remove it from the generic code and into the SuperH code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200505134101.248881738@linutronix.de
Force inlining and prevent instrumentation of all sorts by marking the
functions which are invoked from low level entry code with 'noinstr'.
Split the irqflags tracking into two parts. One which does the heavy
lifting while RCU is watching and the final one which can be invoked after
RCU is turned off.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.484532537@linutronix.de
trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer
core itself is safe, but the resulting tracepoints can be utilized by
e.g. BPF which is unsafe.
Provide variants which do not contain the lockdep invocation so the lockdep
and tracer invocations can be split at the call site and placed
properly. This is required because lockdep needs to be aware of the state
before switching away from RCU idle and after switching to RCU idle because
these transitions can take locks.
As these code pathes are going to be non-instrumentable the tracer can be
invoked after RCU is turned on and before the switch to RCU idle. So for
these new variants there is no need to invoke the rcuidle aware tracer
functions.
Name them so they match the lockdep counterparts.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134100.270771162@linutronix.de
Currently informational messages within block trace do not have PID
information of the process reporting the message included. With BFQ it
is sometimes useful to have the information and there's no good reason
to omit the information from the trace. So just fill in pid information
when generating note message.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the bpf verifier trace check into the new switch statement in
HEAD.
Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix sk_psock reference count leak on receive, from Xiyu Yang.
2) CONFIG_HNS should be invisible, from Geert Uytterhoeven.
3) Don't allow locking route MTUs in ipv6, RFCs actually forbid this,
from Maciej Żenczykowski.
4) ipv4 route redirect backoff wasn't actually enforced, from Paolo
Abeni.
5) Fix netprio cgroup v2 leak, from Zefan Li.
6) Fix infinite loop on rmmod in conntrack, from Florian Westphal.
7) Fix tcp SO_RCVLOWAT hangs, from Eric Dumazet.
8) Various bpf probe handling fixes, from Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits)
selftests: mptcp: pm: rm the right tmp file
dpaa2-eth: properly handle buffer size restrictions
bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier
bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range
bpf: Restrict bpf_probe_read{, str}() only to archs where they work
MAINTAINERS: Mark networking drivers as Maintained.
ipmr: Add lockdep expression to ipmr_for_each_table macro
ipmr: Fix RCU list debugging warning
drivers: net: hamradio: Fix suspicious RCU usage warning in bpqether.c
net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810
tcp: fix error recovery in tcp_zerocopy_receive()
MAINTAINERS: Add Jakub to networking drivers.
MAINTAINERS: another add of Karsten Graul for S390 networking
drivers: ipa: fix typos for ipa_smp2p structure doc
pppoe: only process PADT targeted at local interfaces
selftests/bpf: Enforce returning 0 for fentry/fexit programs
bpf: Enforce returning 0 for fentry/fexit progs
net: stmmac: fix num_por initialization
security: Fix the default value of secid_to_secctx hook
libbpf: Fix register naming in PT_REGS s390 macros
...
Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
env->allow_ptr_leaks = bpf_allow_ptr_leaks();
env->bypass_spec_v1 = bpf_bypass_spec_v1();
env->bypass_spec_v4 = bpf_bypass_spec_v4();
env->bpf_capable = bpf_capable();
The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.
'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.
That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
Usage of plain %s conversion specifier in bpf_trace_printk() suffers from the
very same issue as bpf_probe_read{,str}() helpers, that is, it is broken on
archs with overlapping address ranges.
While the helpers have been addressed through work in 6ae08ae3de ("bpf: Add
probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers"), we need
an option for bpf_trace_printk() as well to fix it.
Similarly as with the helpers, force users to make an explicit choice by adding
%pks and %pus specifier to bpf_trace_printk() which will then pick the corresponding
strncpy_from_unsafe*() variant to perform the access under KERNEL_DS or USER_DS.
The %pk* (kernel specifier) and %pu* (user specifier) can later also be extended
for other objects aside strings that are probed and printed under tracing, and
reused out of other facilities like bpf_seq_printf() or BTF based type printing.
Existing behavior of %s for current users is still kept working for archs where it
is not broken and therefore gated through CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
For archs not having this property we fall-back to pick probing under KERNEL_DS as
a sensible default.
Fixes: 8d3b7dce86 ("bpf: add support for %s specifier to bpf_trace_printk()")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Link: https://lore.kernel.org/bpf/20200515101118.6508-4-daniel@iogearbox.net
Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs
with overlapping address ranges, we should really take the next step to
disable them from BPF use there.
To generally fix the situation, we've recently added new helper variants
bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str().
For details on them, see 6ae08ae3de ("bpf: Add probe_read_{user, kernel}
and probe_read_{user,kernel}_str helpers").
Given bpf_probe_read{,str}() have been around for ~5 years by now, there
are plenty of users at least on x86 still relying on them today, so we
cannot remove them entirely w/o breaking the BPF tracing ecosystem.
However, their use should be restricted to archs with non-overlapping
address ranges where they are working in their current form. Therefore,
move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and
have x86, arm64, arm select it (other archs supporting it can follow-up
on it as well).
For the remaining archs, they can workaround easily by relying on the
feature probe from bpftool which spills out defines that can be used out
of BPF C code to implement the drop-in replacement for old/new kernels
via: bpftool feature probe macro
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net
- Fix a crash when having function tracing and function stack tracing on
the command line. The ftrace trampolines are created as executable and
read only. But the stack tracer tries to modify them with text_poke()
which expects all kernel text to still be writable at boot.
Keep the trampolines writable at boot, and convert them to read-only
with the rest of the kernel.
- A selftest was triggering in the ring buffer iterator code, that
is no longer valid with the update of keeping the ring buffer
writable while a iterator is reading. Just bail after three failed
attempts to get an event and remove the warning and disabling of the
ring buffer.
- While modifying the ring buffer code, decided to remove all the
unnecessary BUG() calls.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXr1CDhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsXcAQCoL229SBrtHsn4DUO7eAQRppUT3hNw
RuKzvQ56+1GccQEAh8VGCeg89uMSK6imrTujEl6VmOUdbgrD5R96yiKoGQw=
=vi+k
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing fixes from Steven Rostedt:
"Various tracing fixes:
- Fix a crash when having function tracing and function stack tracing
on the command line.
The ftrace trampolines are created as executable and read only. But
the stack tracer tries to modify them with text_poke() which
expects all kernel text to still be writable at boot. Keep the
trampolines writable at boot, and convert them to read-only with
the rest of the kernel.
- A selftest was triggering in the ring buffer iterator code, that is
no longer valid with the update of keeping the ring buffer writable
while a iterator is reading.
Just bail after three failed attempts to get an event and remove
the warning and disabling of the ring buffer.
- While modifying the ring buffer code, decided to remove all the
unnecessary BUG() calls"
* tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Remove all BUG() calls
ring-buffer: Don't deactivate the ring buffer on failed iterator reads
x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up
There's a lot of checks to make sure the ring buffer is working, and if an
anomaly is detected, it safely shuts itself down. But there's a few cases
that it will call BUG(), which defeats the point of being safe (it crashes
the kernel when an anomaly is found!). There's no reason for them. Switch
them all to either WARN_ON_ONCE() (when no ring buffer descriptor is present),
or to RB_WARN_ON() (when a ring buffer descriptor is present).
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If the function tracer is running and the trace file is read (which uses the
ring buffer iterator), the iterator can get in sync with the writes, and
caues it to fail to find a page with content it can read three times. This
causes a warning and deactivation of the ring buffer code.
Looking at the other cases of failure to get an event, it appears that
there's a chance that the writer could cause them too. Since the iterator is
a "best effort" to read the ring buffer if there's an active writer (the
consumer reader is made for this case "see trace_pipe"), if it fails to get
an event after three tries, simply give up and return NULL. Don't warn, nor
disable the ring buffer on this failure.
Link: https://lore.kernel.org/r/20200429090508.GG5770@shao2-debian
Reported-by: kernel test robot <lkp@intel.com>
Fixes: ff84c50cfb ("ring-buffer: Do not die if rb_iter_peek() fails more than thrice")
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Booting one of my machines, it triggered the following crash:
Kernel/User page tables isolation: enabled
ftrace: allocating 36577 entries in 143 pages
Starting tracer 'function'
BUG: unable to handle page fault for address: ffffffffa000005c
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061
Oops: 0003 [#1] PREEMPT SMP PTI
CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
RIP: 0010:text_poke_early+0x4a/0x58
Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df <f3> a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff
0 41 57 49
RSP: 0000:ffffffff82003d38 EFLAGS: 00010046
RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005
RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c
RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0
R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0
R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0
FS: 0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0
Call Trace:
text_poke_bp+0x27/0x64
? mutex_lock+0x36/0x5d
arch_ftrace_update_trampoline+0x287/0x2d5
? ftrace_replace_code+0x14b/0x160
? ftrace_update_ftrace_func+0x65/0x6c
__register_ftrace_function+0x6d/0x81
ftrace_startup+0x23/0xc1
register_ftrace_function+0x20/0x37
func_set_flag+0x59/0x77
__set_tracer_option.isra.19+0x20/0x3e
trace_set_options+0xd6/0x13e
apply_trace_boot_options+0x44/0x6d
register_tracer+0x19e/0x1ac
early_trace_init+0x21b/0x2c9
start_kernel+0x241/0x518
? load_ucode_intel_bsp+0x21/0x52
secondary_startup_64+0xa4/0xb0
I was able to trigger it on other machines, when I added to the kernel
command line of both "ftrace=function" and "trace_options=func_stack_trace".
The cause is the "ftrace=function" would register the function tracer
and create a trampoline, and it will set it as executable and
read-only. Then the "trace_options=func_stack_trace" would then update
the same trampoline to include the stack tracer version of the function
tracer. But since the trampoline already exists, it updates it with
text_poke_bp(). The problem is that text_poke_bp() called while
system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not
the page mapping, as it would think that the text is still read-write.
But in this case it is not, and we take a fault and crash.
Instead, lets keep the ftrace trampolines read-write during boot up,
and then when the kernel executable text is set to read-only, the
ftrace trampolines get set to read-only as well.
Link: https://lkml.kernel.org/r/20200430202147.4dc6e2de@oasis.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: stable@vger.kernel.org
Fixes: 768ae4406a ("x86/ftrace: Use text_poke()")
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Unfortunately, the last set of fixes introduced some minor bugs:
- The bootconfig apply_xbc() leak fix caused the application to return
a positive number on success, when it should have returned zero.
- The preempt_irq_delay_thread fix to make the creation code
wait for the kthread to finish to prevent it from executing after
module unload, can now cause the kthread to exit before it even
executes (preventing it to run its tests).
- The fix to the bootconfig that fixed the initrd to remove the
bootconfig from causing the kernel to panic, now prints a warning
that the bootconfig is not found, even when bootconfig is not
on the command line.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXrq2ehQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrdjAQDGNaJa7Ft13KTDTNTioKmOorOi38vF
ava4E3uBHl3StQD/anJmVq7Kk4WJFKGYemV6usbjDqy510PCFu/VQ1AbGQc=
=hJvk
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Fixes to previous fixes.
Unfortunately, the last set of fixes introduced some minor bugs:
- The bootconfig apply_xbc() leak fix caused the application to
return a positive number on success, when it should have returned
zero.
- The preempt_irq_delay_thread fix to make the creation code wait for
the kthread to finish to prevent it from executing after module
unload, can now cause the kthread to exit before it even executes
(preventing it to run its tests).
- The fix to the bootconfig that fixed the initrd to remove the
bootconfig from causing the kernel to panic, now prints a warning
that the bootconfig is not found, even when bootconfig is not on
the command line"
* tag 'trace-v5.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
bootconfig: Fix to prevent warning message if no bootconfig option
tracing: Wait for preempt irq delay thread to execute
tools/bootconfig: Fix apply_xbc() to return zero on success
A bug report was posted that running the preempt irq delay module on a slow
machine, and removing it quickly could lead to the thread created by the
modlue to execute after the module is removed, and this could cause the
kernel to crash. The fix for this was to call kthread_stop() after creating
the thread to make sure it finishes before allowing the module to be
removed.
Now this caused the opposite problem on fast machines. What now happens is
the kthread_stop() can cause the kthread never to execute and the test never
to run. To fix this, add a completion and wait for the kthread to execute,
then wait for it to end.
This issue caused the ftracetest selftests to fail on the preemptirq tests.
Link: https://lore.kernel.org/r/20200510114210.15d9e4af@oasis.local.home
Cc: stable@vger.kernel.org
Fixes: d16a8c3107 ("tracing: Wait for preempt irq delay thread to finish")
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Two helpers bpf_seq_printf and bpf_seq_write, are added for
writing data to the seq_file buffer.
bpf_seq_printf supports common format string flag/width/type
fields so at least I can get identical results for
netlink and ipv6_route targets.
For bpf_seq_printf and bpf_seq_write, return value -EOVERFLOW
specifically indicates a write failure due to overflow, which
means the object will be repeated in the next bpf invocation
if object collection stays the same. Note that if the object
collection is changed, depending how collection traversal is
done, even if the object still in the collection, it may not
be visited.
For bpf_seq_printf, format %s, %p{i,I}{4,6} needs to
read kernel memory. Reading kernel memory may fail in
the following two cases:
- invalid kernel address, or
- valid kernel address but requiring a major fault
If reading kernel memory failed, the %s string will be
an empty string and %p{i,I}{4,6} will be all 0.
Not returning error to bpf program is consistent with
what bpf_trace_printk() does for now.
bpf_seq_printf may return -EBUSY meaning that internal percpu
buffer for memory copy of strings or other pointees is
not available. Bpf program can return 1 to indicate it
wants the same object to be repeated. Right now, this should not
happen on no-RT kernels since migrate_disable(), which guards
bpf prog call, calls preempt_disable().
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175914.2476661-1-yhs@fb.com
We have some rather random rules about when we accept the
"maybe-initialized" warnings, and when we don't.
For example, we consider it unreliable for gcc versions < 4.9, but also
if -O3 is enabled, or if optimizing for size. And then various kernel
config options disabled it, because they know that they trigger that
warning by confusing gcc sufficiently (ie PROFILE_ALL_BRANCHES).
And now gcc-10 seems to be introducing a lot of those warnings too, so
it falls under the same heading as 4.9 did.
At the same time, we have a very straightforward way to _enable_ that
warning when wanted: use "W=2" to enable more warnings.
So stop playing these ad-hoc games, and just disable that warning by
default, with the known and straight-forward "if you want to work on the
extra compiler warnings, use W=123".
Would it be great to have code that is always so obvious that it never
confuses the compiler whether a variable is used initialized or not?
Yes, it would. In a perfect world, the compilers would be smarter, and
our source code would be simpler.
That's currently not the world we live in, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following sparse warning:
kernel/trace/trace.c:950:6: warning: symbol 'tracing_snapshot_instance_cond'
was not declared. Should it be static?
Link: http://lkml.kernel.org/r/1587614905-48692-1-git-send-email-zou_wei@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu
areas can be complex, to say the least. Mappings may happen at boot up, and
if nothing synchronizes the page tables, those page mappings may not be
synced till they are used. This causes issues for anything that might touch
one of those mappings in the path of the page fault handler. When one of
those unmapped mappings is touched in the page fault handler, it will cause
another page fault, which in turn will cause a page fault, and leave us in
a loop of page faults.
Commit 763802b53a ("x86/mm: split vmalloc_sync_all()") split
vmalloc_sync_all() into vmalloc_sync_unmappings() and
vmalloc_sync_mappings(), as on system exit, it did not need to do a full
sync on x86_64 (although it still needed to be done on x86_32). By chance,
the vmalloc_sync_all() would synchronize the page mappings done at boot up
and prevent the per cpu area from being a problem for tracing in the page
fault handler. But when that synchronization in the exit of a task became a
nop, it caused the problem to appear.
Link: https://lore.kernel.org/r/20200429054857.66e8e333@oasis.local.home
Cc: stable@vger.kernel.org
Fixes: 737223fbca ("tracing: Consolidate buffer allocation code")
Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Suggested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Running on a slower machine, it is possible that the preempt delay kernel
thread may still be executing if the module was immediately removed after
added, and this can cause the kernel to crash as the kernel thread might be
executing after its code has been removed.
There's no reason that the caller of the code shouldn't just wait for the
delay thread to finish, as the thread can also be created by a trigger in
the sysfs code, which also has the same issues.
Link: http://lore.kernel.org/r/5EA2B0C8.2080706@cn.fujitsu.com
Cc: stable@vger.kernel.org
Fixes: 793937236d ("lib: Add module for testing preemptoff/irqsoff latency tracers")
Reported-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Reviewed-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Reviewed-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reject the new event which has NULL location for kprobes.
For kprobes, user must specify at least the location.
Link: http://lkml.kernel.org/r/158779376597.6082.1411212055469099461.stgit@devnote2
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix boottime kprobe events to use API correctly for
multiple events.
For example, when we set a multiprobe kprobe events in
bootconfig like below,
ftrace.event.kprobes.myevent {
probes = "vfs_read $arg1 $arg2", "vfs_write $arg1 $arg2"
}
This cause an error;
trace_boot: Failed to add probe: p:kprobes/myevent (null) vfs_read $arg1 $arg2 vfs_write $arg1 $arg2
This shows the 1st argument becomes NULL and multiprobes
are merged to 1 probe.
Link: http://lkml.kernel.org/r/158779375766.6082.201939936008972838.stgit@devnote2
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 29a1548105 ("tracing: Change trace_boot to use kprobe_event interface")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix a typo that resulted in an unnecessary double
initialization to addr.
Link: http://lkml.kernel.org/r/158779374968.6082.2337484008464939919.stgit@devnote2
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: c7411a1a12 ("tracing/kprobe: Check whether the non-suffixed symbol is notrace")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
This commit replaces the schedule_on_each_cpu(ftrace_sync) instances
with synchronize_rcu_tasks_rude().
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
[ paulmck: Make Kconfig adjustments noted by kbuild test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.
As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On a device like a cellphone which is constantly suspending
and resuming CLOCK_MONOTONIC is not particularly useful for
keeping track of or reacting to external network events.
Instead you want to use CLOCK_BOOTTIME.
Hence add bpf_ktime_get_boot_ns() as a mirror of bpf_ktime_get_ns()
based around CLOCK_BOOTTIME instead of CLOCK_MONOTONIC.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fix the following sparse warning:
kernel/trace/tracing_map.c:286:6: warning: symbol
'tracing_map_array_clear' was not declared. Should it be static?
kernel/trace/tracing_map.c:297:6: warning: symbol
'tracing_map_array_free' was not declared. Should it be static?
kernel/trace/tracing_map.c:319:26: warning: symbol
'tracing_map_array_alloc' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20200410073312.38855-1-yanaijie@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
kernel + tools/perf:
Alexey Budankov:
- Introduce CAP_PERFMON to kernel and user space.
callchains:
Adrian Hunter:
- Allow using Intel PT to synthesize callchains for regular events.
Kan Liang:
- Stitch LBR records from multiple samples to get deeper backtraces,
there are caveats, see the csets for details.
perf script:
Andreas Gerstmayr:
- Add flamegraph.py script
BPF:
Jiri Olsa:
- Synthesize bpf_trampoline/dispatcher ksymbol events.
perf stat:
Arnaldo Carvalho de Melo:
- Honour --timeout for forked workloads.
Stephane Eranian:
- Force error in fallback on :k events, to avoid counting nothing when
the user asks for kernel events but is not allowed to.
perf bench:
Ian Rogers:
- Add event synthesis benchmark.
tools api fs:
Stephane Eranian:
- Make xxx__mountpoint() more scalable
libtraceevent:
He Zhe:
- Handle return value of asprintf.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXp2LlQAKCRCyPKLppCJ+
J95oAP0ZihVUhESv/gdeX0IDE5g6Rd2V6LNcRj+jb7gX9NlQkwD/UfS454WV1ftQ
qTwrkKPzY/5Tm2cLuVE7r7fJ6naDHgU=
=FHm4
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo-5.8-20200420' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core fixes and improvements from Arnaldo Carvalho de Melo:
kernel + tools/perf:
Alexey Budankov:
- Introduce CAP_PERFMON to kernel and user space.
callchains:
Adrian Hunter:
- Allow using Intel PT to synthesize callchains for regular events.
Kan Liang:
- Stitch LBR records from multiple samples to get deeper backtraces,
there are caveats, see the csets for details.
perf script:
Andreas Gerstmayr:
- Add flamegraph.py script
BPF:
Jiri Olsa:
- Synthesize bpf_trampoline/dispatcher ksymbol events.
perf stat:
Arnaldo Carvalho de Melo:
- Honour --timeout for forked workloads.
Stephane Eranian:
- Force error in fallback on :k events, to avoid counting nothing when
the user asks for kernel events but is not allowed to.
perf bench:
Ian Rogers:
- Add event synthesis benchmark.
tools api fs:
Stephane Eranian:
- Make xxx__mountpoint() more scalable
libtraceevent:
He Zhe:
- Handle return value of asprintf.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Open access to bpf_trace monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without the
rest of CAP_SYS_ADMIN credentials, excludes chances to misuse the
credentials and makes operation more secure.
CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)
For backward compatibility reasons access to bpf_trace monitoring
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure bpf_trace monitoring is discouraged with respect to
CAP_PERFMON capability.
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-man@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Link: http://lore.kernel.org/lkml/c0a0ae47-8b6e-ff3e-416b-3cd1faaf71c0@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl6TbaUeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhgkH/iWpiKvosA20HJjC
rBqYeJPxQsgZTuBieWJ+MeVxbpcF7RlM4c+glyvg3QJhHwIEG58dl6LBrQbAyBAR
aFHNojr1iAYOruVCGnU3pA008YZiwUIDv/ZQ4DF8fmIU2vI2mJ6qHBv3XDl4G2uR
Nwz8Eu9AgIwZM5coomVOSmoWyFy7Vxmb7W+3t5VmKsvOWx4ib9kyQtOIkvQDEl7j
XCbWfI0xDQr6LFOm4jnCi5R/LhJ2LIqqIvHHrunbpszM8IwK797jCXz4im+dmd5Y
+km46N7a8pDqri36xXz1gdBAU3eG7Pt1NyvfjwRVTdX4GquQ2MT0GoojxbLxUP3y
3pEsQuE=
=whbL
-----END PGP SIGNATURE-----
Merge tag 'v5.7-rc1' into locking/kcsan, to resolve conflicts and refresh
Resolve these conflicts:
arch/x86/Kconfig
arch/x86/kernel/Makefile
Do a minor "evil merge" to move the KCSAN entry up a bit by a few lines
in the Kconfig to reduce the probability of future conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- The ring buffer is no longer disabled when reading the trace file.
The trace_pipe file was made to be used for live tracing and reading
as it acted like the normal producer/consumer. As the trace file
would not consume the data, the easy way of handling it was to just
disable writes to the ring buffer. This came to a surprise to the
BPF folks who complained about lost events due to reading.
This is no longer an issue. If someone wants to keep the old disabling
there's a new option "pause-on-trace" that can be set.
- New set_ftrace_notrace_pid file. PIDs in this file will not be traced
by the function tracer. Similar to set_ftrace_pid, which makes the
function tracer only trace those tasks with PIDs in the file, the
set_ftrace_notrace_pid does the reverse.
- New set_event_notrace_pid file. PIDs in this file will cause events
not to be traced if triggered by a task with a matching PID.
Similar to the set_event_pid file but will not be traced.
Note, sched_waking and sched_switch events may still be trace if
one of the tasks referenced by those events contains a PID that
is allowed to be traced.
Tracing related features:
- New bootconfig option, that is attached to the initrd file.
If bootconfig is on the command line, then the initrd file
is searched looking for a bootconfig appended at the end.
- New GPU tracepoint infrastructure to help the gfx drivers to get
off debugfs (acked by Greg Kroah-Hartman)
Other minor updates and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXokgWRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgrHAP0UkKs/52JY4oWa3OIh/OqK+vnCrIwz
zGvDFOYM0fKbwgD9FZWgzlcaYK5m2Cxlhp4VoraZveHMLJUhnEHtdX6X0wk=
=Rebj
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New tracing features:
- The ring buffer is no longer disabled when reading the trace file.
The trace_pipe file was made to be used for live tracing and
reading as it acted like the normal producer/consumer. As the trace
file would not consume the data, the easy way of handling it was to
just disable writes to the ring buffer.
This came to a surprise to the BPF folks who complained about lost
events due to reading. This is no longer an issue. If someone wants
to keep the old disabling there's a new option "pause-on-trace"
that can be set.
- New set_ftrace_notrace_pid file. PIDs in this file will not be
traced by the function tracer.
Similar to set_ftrace_pid, which makes the function tracer only
trace those tasks with PIDs in the file, the set_ftrace_notrace_pid
does the reverse.
- New set_event_notrace_pid file. PIDs in this file will cause events
not to be traced if triggered by a task with a matching PID.
Similar to the set_event_pid file but will not be traced. Note,
sched_waking and sched_switch events may still be traced if one of
the tasks referenced by those events contains a PID that is allowed
to be traced.
Tracing related features:
- New bootconfig option, that is attached to the initrd file.
If bootconfig is on the command line, then the initrd file is
searched looking for a bootconfig appended at the end.
- New GPU tracepoint infrastructure to help the gfx drivers to get
off debugfs (acked by Greg Kroah-Hartman)
And other minor updates and fixes"
* tag 'trace-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (27 commits)
tracing: Do not allocate buffer in trace_find_next_entry() in atomic
tracing: Add documentation on set_ftrace_notrace_pid and set_event_notrace_pid
selftests/ftrace: Add test to test new set_event_notrace_pid file
selftests/ftrace: Add test to test new set_ftrace_notrace_pid file
tracing: Create set_event_notrace_pid to not trace tasks
ftrace: Create set_ftrace_notrace_pid to not trace tasks
ftrace: Make function trace pid filtering a bit more exact
ftrace/kprobe: Show the maxactive number on kprobe_events
tracing: Have the document reflect that the trace file keeps tracing enabled
ring-buffer/tracing: Have iterator acknowledge dropped events
tracing: Do not disable tracing when reading the trace file
ring-buffer: Do not disable recording when there is an iterator
ring-buffer: Make resize disable per cpu buffer instead of total buffer
ring-buffer: Optimize rb_iter_head_event()
ring-buffer: Do not die if rb_iter_peek() fails more than thrice
ring-buffer: Have rb_iter_head_event() handle concurrent writer
ring-buffer: Add page_stamp to iterator for synchronization
ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()
ring-buffer: Have ring_buffer_empty() not depend on tracing stopped
tracing: Save off entry when peeking at next entry
...
When dumping out the trace data in latency format, a check is made to peek
at the next event to compare its timestamp to the current one, and if the
delta is of a greater size, it will add a marker showing so. But to do this,
it needs to save the current event otherwise peeking at the next event will
remove the current event. To save the event, a temp buffer is used, and if
the event is bigger than the temp buffer, the temp buffer is freed and a
bigger buffer is allocated.
This allocation is a problem when called in atomic context. The only way
this gets called via atomic context is via ftrace_dump(). Thus, use a static
buffer of 128 bytes (which covers most events), and if the event is bigger
than that, simply return NULL. The callers of trace_find_next_entry() need
to handle a NULL case, as that's what would happen if the allocation failed.
Link: https://lore.kernel.org/r/20200326091256.GR11705@shao2-debian
Fixes: ff895103a8 ("tracing: Save off entry when peeking at next entry")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Introduce types and configs for bpf programs that can be attached to
LSM hooks. The programs can be enabled by the config option
CONFIG_BPF_LSM.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Reviewed-by: Florent Revest <revest@google.com>
Reviewed-by: Thomas Garnier <thgarnie@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org
There's currently a way to select a task that should only have its events
traced, but there's no way to select a task not to have itsevents traced.
Add a set_event_notrace_pid file that acts the same as set_event_pid (and is
also affected by event-fork), but the task pids in this file will not be
traced even if they are listed in the set_event_pid file. This makes it easy
for tools like trace-cmd to "hide" itself from beint traced by events when
it is recording other tasks.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There's currently a way to select a task that should only be traced by
functions, but there's no way to select a task not to be traced by the
function tracer. Add a set_ftrace_notrace_pid file that acts the same as
set_ftrace_pid (and is also affected by function-fork), but the task pids in
this file will not be traced even if they are listed in the set_ftrace_pid
file. This makes it easy for tools like trace-cmd to "hide" itself from the
function tracer when it is recording other tasks.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The set_ftrace_pid file is used to filter function tracing to only trace
tasks that are listed in that file. Instead of testing the pids listed in
that file (it's a bitmask) at each function trace event, the logic is done
via a sched_switch hook. A flag is set when the next task to run is in the
list of pids in the set_ftrace_pid file. But the sched_switch hook is not at
the exact location of when the task switches, and the flag gets set before
the task to be traced actually runs. This leaves a residue of traced
functions that do not belong to the pid that should be filtered on.
By changing the logic slightly, where instead of having a boolean flag to
test, record the pid that should be traced, with special values for not to
trace and always trace. Then at each function call, a check will be made to
see if the function should be ignored, or if the current pid matches the
function that should be traced, and only trace if it matches (or if it has
the special value to always trace).
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Have the ring_buffer_iterator set a flag if events were dropped as it were
to go and peek at the next event. Have the trace file display this fact if
it happened with a "LOST EVENTS" message.
Link: http://lkml.kernel.org/r/20200317213417.045858900@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When opening the "trace" file, it is no longer necessary to disable tracing.
Note, a new option is created called "pause-on-trace", when set, will cause
the trace file to emulate its original behavior.
Link: http://lkml.kernel.org/r/20200317213416.903351225@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now that the iterator can handle a concurrent writer, do not disable writing
to the ring buffer when there is an iterator present.
Link: http://lkml.kernel.org/r/20200317213416.759770696@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When the ring buffer becomes writable for even when the trace file is read,
it must still not be resized. But since tracers can be activated while the
trace file is being read, the irqsoff tracer can modify the per CPU buffers,
and this can cause the reader of the trace file to update the wrong buffer's
resize disable bit, as the irqsoff tracer swaps out cpu buffers.
By making the resize disable per cpu_buffer, it makes the update follow the
per cpu_buffer even if it's swapped out with the snapshot buffer and keeps
the release of the trace file modifying the same data as the open did.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Overlapping header include additions in macsec.c
A bug fix in 'net' overlapping with the removal of 'version'
string in ena_netdev.c
Overlapping test additions in selftests Makefile
Overlapping PCI ID table adjustments in iwlwifi driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously the system would lock up if ftrace was enabled together with
KCSAN. This is due to recursion on reporting if the tracer code is
instrumented with KCSAN.
To avoid this for all types of tracing, disable KCSAN instrumentation
for all of kernel/trace.
Furthermore, since KCSAN relies on udelay() to introduce delay, we have
to disable ftrace for udelay() (currently done for x86) in case KCSAN is
used together with lockdep and ftrace. The reason is that it may corrupt
lockdep IRQ flags tracing state due to a peculiar case of recursion
(details in Makefile comment).
Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Qian Cai <cai@lca.pw>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As it is fine to perform several "peeks" of event data in the ring buffer
via the iterator before moving it forward, do not re-read the event, just
return what was read before. Otherwise, it can cause inconsistent results,
especially when testing multiple CPU buffers to interleave them.
Link: http://lkml.kernel.org/r/20200317213416.592032170@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As the iterator will be reading a live buffer, and if the event being read
is on a page that a writer crosses, it will fail and try again, the
condition in rb_iter_peek() that only allows a retry to happen three times
is no longer valid. Allow rb_iter_peek() to retry more than three times
without killing the ring buffer, but only if rb_iter_head_event() had failed
at least once.
Link: http://lkml.kernel.org/r/20200317213416.452888193@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Have the ring_buffer_iter structure have a place to store an event, such
that it can not be overwritten by a writer, and load it in such a way via
rb_iter_head_event() that it will return NULL and reset the iter to the
start of the current page if a writer updated the page.
Link: http://lkml.kernel.org/r/20200317213416.306959216@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Have the ring_buffer_iter structure contain a page_stamp, such that it can
be used to see if the writer entered the page the iterator is on. When going
to a new page, the iterator will record the time stamp of that page. When
reading events, it can copy the event to an internal buffer on the iterator
(to be implemented later), then check the page's time stamp with its own to
see if the writer entered the page. If so, it will need to try to read the
event again.
Link: http://lkml.kernel.org/r/20200317213416.163549674@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When the ring buffer was first created, the iterator followed the normal
producer/consumer operations where it had both a peek() operation, that just
returned the event at the current location, and a read(), that would return
the event at the current location and also increment the iterator such that
the next peek() or read() will return the next event.
The only use of the ring_buffer_read() is currently to move the iterator to
the next location and nothing now actually reads the event it returns.
Rename this function to its actual use case to ring_buffer_iter_advance(),
which also adds the "iter" part to the name, which is more meaningful. As
the timestamp returned by ring_buffer_read() was never used, there's no
reason that this new version should bother having returning it. It will also
become a void function.
Link: http://lkml.kernel.org/r/20200317213416.018928618@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
It was complained about that when the trace file is read, that the tracing
is disabled, as the iterator expects writing to the buffer it reads is not
updated. Several steps are needed to make the iterator handle a writer,
by testing if things have changed as it reads.
This step is to make ring_buffer_empty() expect the buffer to be changing.
Note if the current location of the iterator is overwritten, then it will
return false as new data is being added. Note, that this means that data
will be skipped.
Link: http://lkml.kernel.org/r/20200317213415.870741809@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In order to have the iterator read the buffer even when it's still updating,
it requires that the ring buffer iterator saves each event in a separate
location outside the ring buffer such that its use is immutable.
There's one use case that saves off the event returned from the ring buffer
interator and calls it again to look at the next event, before going back to
use the first event. As the ring buffer iterator will only have a single
copy, this use case will no longer be supported.
Instead, have the one use case create its own buffer to store the first
event when looking at the next event. This way, when looking at the first
event again, it wont be corrupted by the second read.
Link: http://lkml.kernel.org/r/20200317213415.722539921@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Clang warns:
../kernel/trace/trace.c:9335:33: warning: array comparison always
evaluates to true [-Wtautological-compare]
if (__stop___trace_bprintk_fmt != __start___trace_bprintk_fmt)
^
1 warning generated.
These are not true arrays, they are linker defined symbols, which are
just addresses. Using the address of operator silences the warning and
does not change the runtime result of the check (tested with some print
statements compiled in with clang + ld.lld and gcc + ld.bfd in QEMU).
Link: http://lkml.kernel.org/r/20200220051011.26113-1-natechancellor@gmail.com
Link: https://github.com/ClangBuiltLinux/linux/issues/893
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-03-13
The following pull-request contains BPF updates for your *net-next* tree.
We've added 86 non-merge commits during the last 12 day(s) which contain
a total of 107 files changed, 5771 insertions(+), 1700 deletions(-).
The main changes are:
1) Add modify_return attach type which allows to attach to a function via
BPF trampoline and is run after the fentry and before the fexit programs
and can pass a return code to the original caller, from KP Singh.
2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher
objects to be visible in /proc/kallsyms so they can be annotated in
stack traces, from Jiri Olsa.
3) Extend BPF sockmap to allow for UDP next to existing TCP support in order
in order to enable this for BPF based socket dispatch, from Lorenz Bauer.
4) Introduce a new bpftool 'prog profile' command which attaches to existing
BPF programs via fentry and fexit hooks and reads out hardware counters
during that period, from Song Liu. Example usage:
bpftool prog profile id 337 duration 3 cycles instructions llc_misses
4228 run_cnt
3403698 cycles (84.08%)
3525294 instructions # 1.04 insn per cycle (84.05%)
13 llc_misses # 3.69 LLC misses per million isns (83.50%)
5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition
of a new bpf_link abstraction to keep in particular BPF tracing programs
attached even when the applicaion owning them exits, from Andrii Nakryiko.
6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering
and which returns the PID as seen by the init namespace, from Carlos Neira.
7) Refactor of RISC-V JIT code to move out common pieces and addition of a
new RV32G BPF JIT compiler, from Luke Nelson.
8) Add gso_size context member to __sk_buff in order to be able to know whether
a given skb is GSO or not, from Willem de Bruijn.
9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output
implementation but can be called from tracepoint programs, from Eelco Chaudron.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf 2020-03-12
The following pull-request contains BPF updates for your *net* tree.
We've added 12 non-merge commits during the last 8 day(s) which contain
a total of 12 files changed, 161 insertions(+), 15 deletions(-).
The main changes are:
1) Andrii fixed two bugs in cgroup-bpf.
2) John fixed sockmap.
3) Luke fixed x32 jit.
4) Martin fixed two issues in struct_ops.
5) Yonghong fixed bpf_send_signal.
6) Yoshiki fixed BTF enum.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce new helper that reuses existing xdp perf_event output
implementation, but can be called from raw_tracepoint programs
that receive 'struct xdp_buff *' as a tracepoint argument.
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial
New bpf helper bpf_get_ns_current_pid_tgid,
This helper will return pid and tgid from current task
which namespace matches dev_t and inode number provided,
this will allows us to instrument a process inside a container.
Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200304204157.58695-3-cneirabustos@gmail.com
can break live patching.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXmj5ExQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6quQGAQDO35RBAQDGmpxnSCQPNwrzqokM8p8d
1e1xshwOVnwqgAEA7csC4u1n5Z8ncIl5Pd8ygt4nXeqw4AenHLeZIdhfegc=
=+AeW
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Have ftrace lookup_rec() return a consistent record otherwise it can
break live patching"
* tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Return the first found result in lookup_rec()
It appears that ip ranges can overlap so. In that case lookup_rec()
returns whatever results it got last even if it found nothing in last
searched page.
This breaks an obscure livepatch late module patching usecase:
- load livepatch
- load the patched module
- unload livepatch
- try to load livepatch again
To fix this return from lookup_rec() as soon as it found the record
containing searched-for ip. This used to be this way prior lookup_rec()
introduction.
Link: http://lkml.kernel.org/r/20200306174317.21699-1-asavkov@redhat.com
Cc: stable@vger.kernel.org
Fixes: 7e16f581a8 ("ftrace: Separate out functionality from ftrace_location_range()")
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5j8hwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnjID/4/XVrqtVNUzVoVOtkOyxyesBrJVMHEQEpJ
PZssv835IStw0ENhxQJfGjPaIFc9Ff6PMkeN5KRAlMoEc+NkrJShF3owGf+6Bps7
rxpblPxaw+CJFa31YBDZVjMCvbVkDm40G5SsJh+xzdIjlWz7MppkkMPdrErPwY8V
0vnrIc+mKBKfBMZTwVkycYtp17LVgfXguledoWzxM1y47IW5UasKh8jdzhbu8Hvt
zztdQrigUdb+9XnLGCZIY0JQOyrhJ5zQpZ40FzbvxdYrQZXOoYT8L7iFu/z0Wi7K
p3a+G+B4WowtLYW78me4Uut5RrHq2XOehSypfujanQlpgXPGjS3TdHT3an2T8XPQ
NyGsZsn/eLm3btNbhGUd8vqpQy5EmWhqmwvYk9tFAoSFLiLcvCC624b/TCYPL+gk
3ZiI7mXBMjHnUZ0J/RF6kZWTAZDvr/tE7UZt1f8r1eEr8VDzCNp5Pst+HCVIguYD
g9eWF8oH6wYoj39UKf1k+vW2GjXGFsnfivObaxhyz03sAPXK2wQlzAe/4jZ24XNr
TRtOXh97c3CbLAwdUHehlzzdR3U7h0n2KsmrTC5AGmLABmR79s7BJ0+pexuZituO
LwU8+gpf7AugHTrLg1eNXAmBHW44I1ticXYiWcT4iSPn99kNIhlW+Jb1iTGoiu7n
nXyS3b5SCw==
=xwKl
-----END PGP SIGNATURE-----
Merge tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Here are a few fixes that should go into this release. This contains:
- Revert of a bad bcache patch from this merge window
- Removed unused function (Daniel)
- Fixup for the blktrace fix from Jan from this release (Cengiz)
- Fix of deeper level bfqq overwrite in BFQ (Carlo)"
* tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()
blktrace: fix dereference after null check
Revert "bcache: ignore pending signals when creating gc and allocator thread"
block: Remove used kblockd_schedule_work_on()
test_run.o is not built when CONFIG_NET is not set and
bpf_prog_test_run_tracing being referenced in bpf_trace.o causes the
linker error:
ld: kernel/trace/bpf_trace.o:(.rodata+0x38): undefined reference to
`bpf_prog_test_run_tracing'
Add a __weak function in bpf_trace.c to handle this.
Fixes: da00d2f117 ("bpf: Add test ops for BPF_PROG_TYPE_TRACING")
Signed-off-by: KP Singh <kpsingh@google.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305220127.29109-1-kpsingh@chromium.org
When experimenting with bpf_send_signal() helper in our production
environment (5.2 based), we experienced a deadlock in NMI mode:
#5 [ffffc9002219f770] queued_spin_lock_slowpath at ffffffff8110be24
#6 [ffffc9002219f770] _raw_spin_lock_irqsave at ffffffff81a43012
#7 [ffffc9002219f780] try_to_wake_up at ffffffff810e7ecd
#8 [ffffc9002219f7e0] signal_wake_up_state at ffffffff810c7b55
#9 [ffffc9002219f7f0] __send_signal at ffffffff810c8602
#10 [ffffc9002219f830] do_send_sig_info at ffffffff810ca31a
#11 [ffffc9002219f868] bpf_send_signal at ffffffff8119d227
#12 [ffffc9002219f988] bpf_overflow_handler at ffffffff811d4140
#13 [ffffc9002219f9e0] __perf_event_overflow at ffffffff811d68cf
#14 [ffffc9002219fa10] perf_swevent_overflow at ffffffff811d6a09
#15 [ffffc9002219fa38] ___perf_sw_event at ffffffff811e0f47
#16 [ffffc9002219fc30] __schedule at ffffffff81a3e04d
#17 [ffffc9002219fc90] schedule at ffffffff81a3e219
#18 [ffffc9002219fca0] futex_wait_queue_me at ffffffff8113d1b9
#19 [ffffc9002219fcd8] futex_wait at ffffffff8113e529
#20 [ffffc9002219fdf0] do_futex at ffffffff8113ffbc
#21 [ffffc9002219fec0] __x64_sys_futex at ffffffff81140d1c
#22 [ffffc9002219ff38] do_syscall_64 at ffffffff81002602
#23 [ffffc9002219ff50] entry_SYSCALL_64_after_hwframe at ffffffff81c00068
The above call stack is actually very similar to an issue
reported by Commit eac9153f2b ("bpf/stackmap: Fix deadlock with
rq_lock in bpf_get_stack()") by Song Liu. The only difference is
bpf_send_signal() helper instead of bpf_get_stack() helper.
The above deadlock is triggered with a perf_sw_event.
Similar to Commit eac9153f2b, the below almost identical reproducer
used tracepoint point sched/sched_switch so the issue can be easily caught.
/* stress_test.c */
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <pthread.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define THREAD_COUNT 1000
char *filename;
void *worker(void *p)
{
void *ptr;
int fd;
char *pptr;
fd = open(filename, O_RDONLY);
if (fd < 0)
return NULL;
while (1) {
struct timespec ts = {0, 1000 + rand() % 2000};
ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
usleep(1);
if (ptr == MAP_FAILED) {
printf("failed to mmap\n");
break;
}
munmap(ptr, 4096 * 64);
usleep(1);
pptr = malloc(1);
usleep(1);
pptr[0] = 1;
usleep(1);
free(pptr);
usleep(1);
nanosleep(&ts, NULL);
}
close(fd);
return NULL;
}
int main(int argc, char *argv[])
{
void *ptr;
int i;
pthread_t threads[THREAD_COUNT];
if (argc < 2)
return 0;
filename = argv[1];
for (i = 0; i < THREAD_COUNT; i++) {
if (pthread_create(threads + i, NULL, worker, NULL)) {
fprintf(stderr, "Error creating thread\n");
return 0;
}
}
for (i = 0; i < THREAD_COUNT; i++)
pthread_join(threads[i], NULL);
return 0;
}
and the following command:
1. run `stress_test /bin/ls` in one windown
2. hack bcc trace.py with the following change:
--- a/tools/trace.py
+++ b/tools/trace.py
@@ -513,6 +513,7 @@ BPF_PERF_OUTPUT(%s);
__data.tgid = __tgid;
__data.pid = __pid;
bpf_get_current_comm(&__data.comm, sizeof(__data.comm));
+ bpf_send_signal(10);
%s
%s
%s.perf_submit(%s, &__data, sizeof(__data));
3. in a different window run
./trace.py -p $(pidof stress_test) t:sched:sched_switch
The deadlock can be reproduced in our production system.
Similar to Song's fix, the fix is to delay sending signal if
irqs is disabled to avoid deadlocks involving with rq_lock.
With this change, my above stress-test in our production system
won't cause deadlock any more.
I also implemented a scale-down version of reproducer in the
selftest (a subsequent commit). With latest bpf-next,
it complains for the following potential deadlock.
[ 32.832450] -> #1 (&p->pi_lock){-.-.}:
[ 32.833100] _raw_spin_lock_irqsave+0x44/0x80
[ 32.833696] task_rq_lock+0x2c/0xa0
[ 32.834182] task_sched_runtime+0x59/0xd0
[ 32.834721] thread_group_cputime+0x250/0x270
[ 32.835304] thread_group_cputime_adjusted+0x2e/0x70
[ 32.835959] do_task_stat+0x8a7/0xb80
[ 32.836461] proc_single_show+0x51/0xb0
...
[ 32.839512] -> #0 (&(&sighand->siglock)->rlock){....}:
[ 32.840275] __lock_acquire+0x1358/0x1a20
[ 32.840826] lock_acquire+0xc7/0x1d0
[ 32.841309] _raw_spin_lock_irqsave+0x44/0x80
[ 32.841916] __lock_task_sighand+0x79/0x160
[ 32.842465] do_send_sig_info+0x35/0x90
[ 32.842977] bpf_send_signal+0xa/0x10
[ 32.843464] bpf_prog_bc13ed9e4d3163e3_send_signal_tp_sched+0x465/0x1000
[ 32.844301] trace_call_bpf+0x115/0x270
[ 32.844809] perf_trace_run_bpf_submit+0x4a/0xc0
[ 32.845411] perf_trace_sched_switch+0x10f/0x180
[ 32.846014] __schedule+0x45d/0x880
[ 32.846483] schedule+0x5f/0xd0
...
[ 32.853148] Chain exists of:
[ 32.853148] &(&sighand->siglock)->rlock --> &p->pi_lock --> &rq->lock
[ 32.853148]
[ 32.854451] Possible unsafe locking scenario:
[ 32.854451]
[ 32.855173] CPU0 CPU1
[ 32.855745] ---- ----
[ 32.856278] lock(&rq->lock);
[ 32.856671] lock(&p->pi_lock);
[ 32.857332] lock(&rq->lock);
[ 32.857999] lock(&(&sighand->siglock)->rlock);
Deadlock happens on CPU0 when it tries to acquire &sighand->siglock
but it has been held by CPU1 and CPU1 tries to grab &rq->lock
and cannot get it.
This is not exactly the callstack in our production environment,
but sympotom is similar and both locks are using spin_lock_irqsave()
to acquire the lock, and both involves rq_lock. The fix to delay
sending signal when irq is disabled also fixed this issue.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200304191104.2796501-1-yhs@fb.com
There was a recent change in blktrace.c that added a RCU protection to
`q->blk_trace` in order to fix a use-after-free issue during access.
However the change missed an edge case that can lead to dereferencing of
`bt` pointer even when it's NULL:
Coverity static analyzer marked this as a FORWARD_NULL issue with CID
1460458.
```
/kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store()
1898 ret = 0;
1899 if (bt == NULL)
1900 ret = blk_trace_setup_queue(q, bdev);
1901
1902 if (ret == 0) {
1903 if (attr == &dev_attr_act_mask)
>>> CID 1460458: Null pointer dereferences (FORWARD_NULL)
>>> Dereferencing null pointer "bt".
1904 bt->act_mask = value;
1905 else if (attr == &dev_attr_pid)
1906 bt->pid = value;
1907 else if (attr == &dev_attr_start_lba)
1908 bt->start_lba = value;
1909 else if (attr == &dev_attr_end_lba)
```
Added a reassignment with RCU annotation to fix the issue.
Fixes: c780e86dd4 ("blktrace: Protect q->blk_trace with RCU")
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Cengiz Can <cengiz@kernel.wtf>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current fexit and fentry tests rely on a different program to
exercise the functions they attach to. Instead of doing this, implement
the test operations for tracing which will also be used for
BPF_MODIFY_RETURN in a subsequent patch.
Also, clean up the fexit test to use the generated skeleton.
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-7-kpsingh@chromium.org
Commit 567cd4da54 ("ring-buffer: User context bit recursion checking")
added the TRACE_BUFFER bits to be used in the current task's trace_recursion
field. But the final submission of the logic removed the use of those bits,
but never removed the bits themselves (they were never used in upstream
Linux). These can be safely removed.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The hwlat tracer runs a loop of width time during a given window. It then
reports the max latency over a given threshold and records a timestamp. But
this timestamp is the time after the width has finished, and not the time it
actually triggered.
Record the actual time when the latency was greater than the threshold as
well as the number of times it was greater in a given width per window.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-02-28
The following pull-request contains BPF updates for your *net-next* tree.
We've added 41 non-merge commits during the last 7 day(s) which contain
a total of 49 files changed, 1383 insertions(+), 499 deletions(-).
The main changes are:
1) BPF and Real-Time nicely co-exist.
2) bpftool feature improvements.
3) retrieve bpf_sk_storage via INET_DIAG.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The mptcp conflict was overlapping additions.
The SMC conflict was an additional and removal happening at the same
time.
Signed-off-by: David S. Miller <davem@davemloft.net>
Change in API of bootconfig (before it comes live in a release)
- Have a magic value "BOOTCONFIG" in initrd to know a bootconfig exists
- Set CONFIG_BOOT_CONFIG to 'n' by default
- Show error if "bootconfig" on cmdline but not compiled in
- Prevent redefining the same value
- Have a way to append values
- Added a SELECT BLK_DEV_INITRD to fix a build failure
Synthetic event fixes:
- Switch to raw_smp_processor_id() for recording CPU value in preempt
section. (No care for what the value actually is)
- Fix samples always recording u64 values
- Fix endianess
- Check number of values matches number of fields
- Fix a printing bug
Fix of trace_printk() breaking postponed start up tests
Make a function static that is only used in a single file.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXlW4vxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtioAP0WLEm3dWO0z3321h/a0DSshC+Bslu3
HDPTsGVGrXmvggEA/lr1ikRHd8PsO7zW8BfaZMxoXaTqXiuSrzEWxnMlFw0=
=O8PM
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing and bootconfig updates:
"Fixes and changes to bootconfig before it goes live in a release.
Change in API of bootconfig (before it comes live in a release):
- Have a magic value "BOOTCONFIG" in initrd to know a bootconfig
exists
- Set CONFIG_BOOT_CONFIG to 'n' by default
- Show error if "bootconfig" on cmdline but not compiled in
- Prevent redefining the same value
- Have a way to append values
- Added a SELECT BLK_DEV_INITRD to fix a build failure
Synthetic event fixes:
- Switch to raw_smp_processor_id() for recording CPU value in preempt
section. (No care for what the value actually is)
- Fix samples always recording u64 values
- Fix endianess
- Check number of values matches number of fields
- Fix a printing bug
Fix of trace_printk() breaking postponed start up tests
Make a function static that is only used in a single file"
* tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue
bootconfig: Add append value operator support
bootconfig: Prohibit re-defining value on same key
bootconfig: Print array as multiple commands for legacy command line
bootconfig: Reject subkey and value on same parent key
tools/bootconfig: Remove unneeded error message silencer
bootconfig: Add bootconfig magic word for indicating bootconfig explicitly
bootconfig: Set CONFIG_BOOT_CONFIG=n by default
tracing: Clear trace_state when starting trace
bootconfig: Mark boot_config_checksum() static
tracing: Disable trace_printk() on post poned tests
tracing: Have synthetic event test use raw_smp_processor_id()
tracing: Fix number printing bug in print_synth_event()
tracing: Check that number of vals matches number of synth event fields
tracing: Make synth_event trace functions endian-correct
tracing: Make sure synth_event_trace() example always uses u64
Since commit d8a953ddde ("bootconfig: Set CONFIG_BOOT_CONFIG=n by
default") also changed the CONFIG_BOOTTIME_TRACING to select
CONFIG_BOOT_CONFIG to show the boot-time tracing on the menu,
it introduced wrong dependencies with BLK_DEV_INITRD as below.
WARNING: unmet direct dependencies detected for BOOT_CONFIG
Depends on [n]: BLK_DEV_INITRD [=n]
Selected by [y]:
- BOOTTIME_TRACING [=y] && TRACING_SUPPORT [=y] && FTRACE [=y] && TRACING [=y]
This makes the CONFIG_BOOT_CONFIG selects CONFIG_BLK_DEV_INITRD to
fix this error and make CONFIG_BOOTTIME_TRACING=n by default, so
that both boot-time tracing and boot configuration off but those
appear on the menu list.
Link: http://lkml.kernel.org/r/158264140162.23842.11237423518607465535.stgit@devnote2
Fixes: d8a953ddde ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Compiled-tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
KASAN is reporting that __blk_add_trace() has a use-after-free issue
when accessing q->blk_trace. Indeed the switching of block tracing (and
thus eventual freeing of q->blk_trace) is completely unsynchronized with
the currently running tracing and thus it can happen that the blk_trace
structure is being freed just while __blk_add_trace() works on it.
Protect accesses to q->blk_trace by RCU during tracing and make sure we
wait for the end of RCU grace period when shutting down tracing. Luckily
that is rare enough event that we can afford that. Note that postponing
the freeing of blk_trace to an RCU callback should better be avoided as
it could have unexpected user visible side-effects as debugfs files
would be still existing for a short while block tracing has been shut
down.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711
CC: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Tristan Madani <tristmd@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is
invoked from a trace point via __DO_TRACE() which already disables
preemption _before_ invoking any of the functions which are attached to a
trace point.
Remove it and add a cant_sleep() check.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.059995527@linutronix.de
trace_call_bpf() no longer disables preemption on its own.
All callers of this function has to do it explicitly.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
All callers are built in. No point to export this.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
__bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation.
This is redundant because __bpf_trace_run() is invoked from a trace point
via __DO_TRACE() which already disables preemption _before_ invoking any of
the functions which are attached to a trace point.
Remove it and add a cant_sleep() check.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145642.847220186@linutronix.de
Set CONFIG_BOOT_CONFIG=n by default. This also warns
user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given
in the kernel command line.
Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Clear trace_state data structure when starting trace
in __synth_event_trace_start() internal function.
Currently trace_state is initialized only in the
synth_event_trace_start() API, but the trace_state
in synth_event_trace() and synth_event_trace_array()
are on the stack without initialization.
This means those APIs will see wrong parameters and
wil skip closing process in __synth_event_trace_end()
because trace_state->disabled may be !0.
Link: http://lkml.kernel.org/r/158193315899.8868.1781259176894639952.stgit@devnote2
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The tracing seftests checks various aspects of the tracing infrastructure,
and one is filtering. If trace_printk() is active during a self test, it can
cause the filtering to fail, which will disable that part of the trace.
To keep the selftests from failing because of trace_printk() calls,
trace_printk() checks the variable tracing_selftest_running, and if set, it
does not write to the tracing buffer.
As some tracers were registered earlier in boot, the selftest they triggered
would fail because not all the infrastructure was set up for the full
selftest. Thus, some of the tests were post poned to when their
infrastructure was ready (namely file system code). The postpone code did
not set the tracing_seftest_running variable, and could fail if a
trace_printk() was added and executed during their run.
Cc: stable@vger.kernel.org
Fixes: 9afecfbb95 ("tracing: Postpone tracer start-up tests till the system is more robust")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The test code that tests synthetic event creation pushes in as one of its
test fields the current CPU using "smp_processor_id()". As this is just
something to see if the value is correctly passed in, and the actual CPU
used does not matter, use raw_smp_processor_id(), otherwise with debug
preemption enabled, a warning happens as the smp_processor_id() is called
without preemption enabled.
Link: http://lkml.kernel.org/r/20200220162950.35162579@gandalf.local.home
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix a varargs-related bug in print_synth_event() which resulted in
strange output and oopses on 32-bit x86 systems. The problem is that
trace_seq_printf() expects the varargs to match the format string, but
print_synth_event() was always passing u64 values regardless. This
results in unspecified behavior when unpacking with va_arg() in
trace_seq_printf().
Add a function that takes the size into account when calling
trace_seq_printf().
Before:
modprobe-1731 [003] .... 919.039758: gen_synth_test: next_pid_field=777(null)next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=3(null)my_string_field=thneed my_int_field=598(null)
After:
insmod-1136 [001] .... 36.634590: gen_synth_test: next_pid_field=777 next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=1 my_string_field=thneed my_int_field=598
Link: http://lkml.kernel.org/r/a9b59eb515dbbd7d4abe53b347dccf7a8e285657.1581720155.git.zanussi@kernel.org
Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 7276531d4036('tracing: Consolidate trace() functions')
inadvertently dropped the synth_event_trace() and
synth_event_trace_array() checks that verify the number of values
passed in matches the number of fields in the synthetic event being
traced, so add them back.
Link: http://lkml.kernel.org/r/32819cac708714693669e0dfe10fe9d935e94a16.1581720155.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
synth_event_trace(), synth_event_trace_array() and
__synth_event_add_val() write directly into the trace buffer and need
to take endianness into account, like trace_event_raw_event_synth()
does.
Link: http://lkml.kernel.org/r/2011354355e405af9c9d28abba430d1f5ff7771a.1581720155.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
synth_event_trace() is the varargs version of synth_event_trace_array(),
which takes an array of u64, as do synth_event_add_val() et al.
To not only be consistent with those, but also to address the fact
that synth_event_trace() expects every arg to be of the same type
since it doesn't also pass in e.g. a format string, the caller needs
to make sure all args are of the same type, u64. u64 is used because
it needs to accomodate the largest type available in synthetic events,
which is u64.
This fixes the bug reported by the kernel test robot/Rong Chen.
Link: https://lore.kernel.org/lkml/20200212113444.GS12867@shao2-debian/
Link: http://lkml.kernel.org/r/894c4e955558b521210ee0642ba194a9e603354c.1581720155.git.zanussi@kernel.org
Fixes: 9fe41efaca ("tracing: Add synth event generation test module")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Branch records are a CPU feature that can be configured to record
certain branches that are taken during code execution. This data is
particularly interesting for profile guided optimizations. perf has had
branch record support for a while but the data collection can be a bit
coarse grained.
We (Facebook) have seen in experiments that associating metadata with
branch records can improve results (after postprocessing). We generally
use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
support for branch records is useful.
Aside from this particular use case, having branch data available to bpf
progs can be useful to get stack traces out of userspace applications
that omit frame pointers.
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200218030432.4600-2-dxu@dxuuu.xyz
bpf_perf_event_read_value() is NMI safe. Enable it for all BPF programs.
This can be used in fentry/fexit to profile BPF program and individual
kernel function with hardware counters.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200214234146.2910011-1-songliubraving@fb.com
- Fix an uninitialized variable
- Fix compile bug to bootconfig userspace tool (in tools directory)
- Suppress some error messages of bootconfig userspace tool
- Remove unneded CONFIG_LIBXBC from bootconfig
- Allocate bootconfig xbc_nodes dynamically.
To ease complaints about taking up static memory at boot up
- Use of parse_args() to parse bootconfig instead of strstr() usage
Prevents issues of double quotes containing the interested string
- Fix missing ring_buffer_nest_end() on synthetic event error path
- Return zero not -EINVAL on soft disabled synthetic event
(soft disabling must be the same as hard disabling, which returns zero)
- Consolidate synthetic event code (remove duplicate code)
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXkMTwRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnJQAQD5aKiD6jx+zGtLsFTuZMcEGvhhUuJ6
oUaSvhKO3UqezwD/V5avKuuC0wWt//gOWDY+0+4QNjmvn1GLnaNYohs6fg0=
=NLT9
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fixes:
- Fix an uninitialized variable
- Fix compile bug to bootconfig userspace tool (in tools directory)
- Suppress some error messages of bootconfig userspace tool
- Remove unneded CONFIG_LIBXBC from bootconfig
- Allocate bootconfig xbc_nodes dynamically. To ease complaints about
taking up static memory at boot up
- Use of parse_args() to parse bootconfig instead of strstr() usage
Prevents issues of double quotes containing the interested string
- Fix missing ring_buffer_nest_end() on synthetic event error path
- Return zero not -EINVAL on soft disabled synthetic event (soft
disabling must be the same as hard disabling, which returns zero)
- Consolidate synthetic event code (remove duplicate code)"
* tag 'trace-v5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Consolidate trace() functions
tracing: Don't return -EINVAL when tracing soft disabled synth events
tracing: Add missing nest end to synth_event_trace_start() error case
tools/bootconfig: Suppress non-error messages
bootconfig: Allocate xbc_nodes array dynamically
bootconfig: Use parse_args() to find bootconfig and '--'
tracing/kprobe: Fix uninitialized variable bug
bootconfig: Remove unneeded CONFIG_LIBXBC
tools/bootconfig: Fix wrong __VA_ARGS__ usage
Move the checking, buffer reserve and buffer commit code in
synth_event_trace_start/end() into inline functions
__synth_event_trace_start/end() so they can also be used by
synth_event_trace() and synth_event_trace_array(), and then have all
those functions use them.
Also, change synth_event_trace_state.enabled to disabled so it only
needs to be set if the event is disabled, which is not normally the
case.
Link: http://lkml.kernel.org/r/b1f3108d0f450e58192955a300e31d0405ab4149.1581374549.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There's no reason to return -EINVAL when tracing a synthetic event if
it's soft disabled - treat it the same as if it were hard disabled and
return normally.
Have synth_event_trace() and synth_event_trace_array() just return
normally, and have synth_event_trace_start set the trace state to
disabled and return.
Link: http://lkml.kernel.org/r/df5d02a1625aff97c9866506c5bada6a069982ba.1581374549.git.zanussi@kernel.org
Fixes: 8dcc53ad95 ("tracing: Add synth_event_trace() and related functions")
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If the ring_buffer reserve in synth_event_trace_start() fails, the
matching ring_buffer_nest_end() should be called in the error code,
since nothing else will ever call it in this case.
Link: http://lkml.kernel.org/r/20abc444b3eeff76425f895815380abe7aa53ff8.1581374549.git.zanussi@kernel.org
Fixes: 8dcc53ad95 ("tracing: Add synth_event_trace() and related functions")
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There is a potential execution path in which variable *ret* is returned
without being properly initialized, previously.
Fix this by initializing variable *ret* to 0.
Link: http://lkml.kernel.org/r/20200205223404.GA3379@embeddedor
Addresses-Coverity-ID: 1491142 ("Uninitialized scalar variable")
Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Added new "bootconfig".
Looks for a file appended to initrd to add boot config options.
This has been discussed thoroughly at Linux Plumbers.
Very useful for adding kprobes at bootup.
Only enabled if "bootconfig" is on the real kernel command line.
- Created dynamic event creation.
Merges common code between creating synthetic events and
kprobe events.
- Rename perf "ring_buffer" structure to "perf_buffer"
- Rename ftrace "ring_buffer" structure to "trace_buffer"
Had to rename existing "trace_buffer" to "array_buffer"
- Allow trace_printk() to work withing (some) tracing code.
- Sort of tracing configs to be a little better organized
- Fixed bug where ftrace_graph hash was not being protected properly
- Various other small fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXjtAURQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qshOAQDzopQmvAVrrI6oogghr8JQA30Z2yqT
i+Ld7vPWL2MV9wEA1S+zLGDSYrj8f/vsCq6BxRYT1ApO+YtmY6LTXiUejwg=
=WNds
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Added new "bootconfig".
This looks for a file appended to initrd to add boot config options,
and has been discussed thoroughly at Linux Plumbers.
Very useful for adding kprobes at bootup.
Only enabled if "bootconfig" is on the real kernel command line.
- Created dynamic event creation.
Merges common code between creating synthetic events and kprobe
events.
- Rename perf "ring_buffer" structure to "perf_buffer"
- Rename ftrace "ring_buffer" structure to "trace_buffer"
Had to rename existing "trace_buffer" to "array_buffer"
- Allow trace_printk() to work withing (some) tracing code.
- Sort of tracing configs to be a little better organized
- Fixed bug where ftrace_graph hash was not being protected properly
- Various other small fixes and clean ups
* tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits)
bootconfig: Show the number of nodes on boot message
tools/bootconfig: Show the number of bootconfig nodes
bootconfig: Add more parse error messages
bootconfig: Use bootconfig instead of boot config
ftrace: Protect ftrace_graph_hash with ftrace_sync
ftrace: Add comment to why rcu_dereference_sched() is open coded
tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
tracing: Annotate ftrace_graph_hash pointer with __rcu
bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline
tracing: Use seq_buf for building dynevent_cmd string
tracing: Remove useless code in dynevent_arg_pair_add()
tracing: Remove check_arg() callbacks from dynevent args
tracing: Consolidate some synth_event_trace code
tracing: Fix now invalid var_ref_vals assumption in trace action
tracing: Change trace_boot to use synth_event interface
tracing: Move tracing selftests to bottom of menu
tracing: Move mmio tracer config up with the other tracers
tracing: Move tracing test module configs together
tracing: Move all function tracing configs together
tracing: Documentation for in-kernel synthetic event API
...
As function_graph tracer can run when RCU is not "watching", it can not be
protected by synchronize_rcu() it requires running a task on each CPU before
it can be freed. Calling schedule_on_each_cpu(ftrace_sync) needs to be used.
Link: https://lore.kernel.org/r/20200205131110.GT2935@paulmck-ThinkPad-P72
Cc: stable@vger.kernel.org
Fixes: b9b0c831be ("ftrace: Convert graph filter to use hash tables")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Because the function graph tracer can execute in sections where RCU is not
"watching", the rcu_dereference_sched() for the has needs to be open coded.
This is fine because the RCU "flavor" of the ftrace hash is protected by
its own RCU handling (it does its own little synchronization on every CPU
and does not rely on RCU sched).
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix following instances of sparse error
kernel/trace/ftrace.c:5667:29: error: incompatible types in comparison
kernel/trace/ftrace.c:5813:21: error: incompatible types in comparison
kernel/trace/ftrace.c:5868:36: error: incompatible types in comparison
kernel/trace/ftrace.c:5870:25: error: incompatible types in comparison
Use rcu_dereference_protected to dereference the newly annotated pointer.
Link: http://lkml.kernel.org/r/20200205055701.30195-1-frextrite@gmail.com
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix following instances of sparse error
kernel/trace/ftrace.c:5664:29: error: incompatible types in comparison
kernel/trace/ftrace.c:5785:21: error: incompatible types in comparison
kernel/trace/ftrace.c:5864:36: error: incompatible types in comparison
kernel/trace/ftrace.c:5866:25: error: incompatible types in comparison
Use rcu_dereference_protected to access the __rcu annotated pointer.
Link: http://lkml.kernel.org/r/20200201072703.17330-1-frextrite@gmail.com
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull vfs recursive removal updates from Al Viro:
"We have quite a few places where synthetic filesystems do an
equivalent of 'rm -rf', with varying amounts of code duplication,
wrong locking, etc. That really ought to be a library helper.
Only debugfs (and very similar tracefs) are converted here - I have
more conversions, but they'd never been in -next, so they'll have to
wait"
* 'work.recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems
The dynevent_cmd commands that build up the command string don't need
to do that themselves - there's a seq_buf facility that does pretty
much the same thing those command are doing manually, so use it
instead.
Link: http://lkml.kernel.org/r/eb8a6e835c964d0ab8a38cbf5ffa60746b54a465.1580506712.git.zanussi@kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
It's kind of strange to have check_arg() callbacks as part of the arg
objects themselves; it makes more sense to just pass these in when the
args are added instead.
Remove the check_arg() callbacks from those objects which also means
removing the check_arg() args from the init functions, adding them to
the add functions and fixing up existing callers.
Link: http://lkml.kernel.org/r/c7708d6f177fcbe1a36b6e4e8e150907df0fa5d2.1580506712.git.zanussi@kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The synth_event trace code contains some almost identical functions
and some small functions that are called only once - consolidate the
common code into single functions and fold in the small functions to
simplify the code overall.
Link: http://lkml.kernel.org/r/d1c8d8ad124a653b7543afe801d38c199ca5c20e.1580506712.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The patch 'tracing: Fix histogram code when expression has same var as
value' added code to return an existing variable reference when
creating a new variable reference, which resulted in var_ref_vals
slots being reused instead of being duplicated.
The implementation of the trace action assumes that the end of the
var_ref_vals array starting at action_data.var_ref_idx corresponds to
the values that will be assigned to the trace params. The patch
mentioned above invalidates that assumption, which means that each
param needs to explicitly specify its index into var_ref_vals.
This fix changes action_data.var_ref_idx to an array of var ref
indexes to account for that.
Link: https://lore.kernel.org/r/1580335695.6220.8.camel@kernel.org
Fixes: 8bcebc77e8 ("tracing: Fix histogram code when expression has same var as value")
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Have trace_boot_add_synth_event() use the synth_event interface.
Also, rename synth_event_run_cmd() to synth_event_run_command() now
that trace_boot's version is gone.
Link: http://lkml.kernel.org/r/94f1fa0e31846d0bddca916b8663404b20559e34.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Move all the tracing selftest configs to the bottom of the tracing menu.
There's no reason for them to be interspersed throughout.
Also, move the bootconfig menu to the top.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Move the config that enables the mmiotracer with the other tracers such that
all the tracers are together.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The MMIO test module was by itself, move it to the other test modules. Also,
add the text "Test module" to PREEMPTIRQ_DELAY_TEST as that create a test
module as well.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The features that depend on the function tracer were spread out through the
tracing menu, pull them together as it is easier to manage.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a test module that checks the basic functionality of the in-kernel
kprobe event command generation API by creating kprobe events from a
module.
Link: http://lkml.kernel.org/r/97e502b204f9dba948e3fa3a4315448298218787.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Have trace_boot_add_kprobe_event() use the kprobe_event interface.
Also, rename kprobe_event_run_cmd() to kprobe_event_run_command() now
that trace_boot's version is gone.
Link: http://lkml.kernel.org/r/af5429d11291ab1e9a85a0ff944af3b2bcf193c7.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add functions used to generate kprobe event commands, built on top of
the dynevent_cmd interface.
kprobe_event_gen_cmd_start() is used to create a kprobe event command
using a variable arg list, and kretprobe_event_gen_cmd_start() does
the same for kretprobe event commands. kprobe_event_add_fields() can
be used to add single fields one by one or as a group. Once all
desired fields are added, kprobe_event_gen_cmd_end() or
kretprobe_event_gen_cmd_end() respectively are used to actually
execute the command and create the event.
Link: http://lkml.kernel.org/r/95cc4696502bb6017f9126f306a45ad19b4cc14f.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a test module that checks the basic functionality of the in-kernel
synthetic event generation API by generating and tracing synthetic
events from a module.
Link: http://lkml.kernel.org/r/fcb4dd9eb9eefb70ab20538d3529d51642389664.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add an exported function named synth_event_trace(), allowing modules
or other kernel code to trace synthetic events.
Also added are several functions that allow the same functionality to
be broken out in a piecewise fashion, which are useful in situations
where tracing an event from a full array of values would be
cumbersome. Those functions are synth_event_trace_start/end() and
synth_event_add_(next)_val().
Link: http://lkml.kernel.org/r/7a84de5f1854acf4144b57efe835ca645afa764f.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add functions used to generate synthetic event commands, built on top
of the dynevent_cmd interface.
synth_event_gen_cmd_start() is used to create a synthetic event
command using a variable arg list and
synth_event_gen_cmd_array_start() does the same thing but using an
array of field descriptors. synth_event_add_field(),
synth_event_add_field_str() and synth_event_add_fields() can be used
to add single fields one by one or as a group. Once all desired
fields are added, synth_event_gen_cmd_end() is used to actually
execute the command and create the event.
synth_event_create() does everything, including creating the event, in
a single call.
Link: http://lkml.kernel.org/r/38fef702fad5ef208009f459552f34a94befd860.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add an interface used to build up dynamic event creation commands,
such as synthetic and kprobe events. Interfaces specific to those
particular types of events and others can be built on top of this
interface.
Command creation is started by first using the dynevent_cmd_init()
function to initialize the dynevent_cmd object. Following that, args
are appended and optionally checked by the dynevent_arg_add() and
dynevent_arg_pair_add() functions, which use objects representing
arguments and pairs of arguments, initialized respectively by
dynevent_arg_init() and dynevent_arg_pair_init(). Finally, once all
args have been successfully added, the command is finalized and
actually created using dynevent_create().
The code here for actually printing into the dyn_event->cmd buffer
using snprintf() etc was adapted from v4 of Masami's 'tracing/boot:
Add synthetic event support' patch.
Link: http://lkml.kernel.org/r/1f65fa44390b6f238f6036777c3784ced1dcc6a0.1580323897.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
create_or_delete_synth_event() contains code to delete a synthetic
event, which would be useful on its own - specifically, it would be
useful to allow event-creating modules to call it separately.
Separate out the delete code from that function and create an exported
function named synth_event_delete().
Link: http://lkml.kernel.org/r/050db3b06df7f0a4b8a2922da602d1d879c7c1c2.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a function to get an event file and prevent it from going away on
module or instance removal.
trace_get_event_file() will find an event file in a given instance (if
instance is NULL, it assumes the top trace array) and return it,
pinning the instance's trace array as well as the event's module, if
applicable, so they won't go away while in use.
trace_put_event_file() does the matching release.
Link: http://lkml.kernel.org/r/bb31ac4bdda168d5ed3c4b5f5a4c8f633e8d9118.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
[ Moved trace_array_put() to end of trace_put_event_file() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a new trace_array_find() function that can be used to find a trace
array given the instance name, and replace existing code that does the
same thing with it. Also add trace_array_find_get() which does the
same but returns the trace array after upping its refcount.
Also make both available for use outside of trace.c.
Link: http://lkml.kernel.org/r/cb68528c975eba95bee4561ac67dd1499423b2e5.1580323897.git.zanussi@kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Without patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
# Available triggers:
# traceon traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
6+1 records in
6+1 records out
206 bytes copied, 0.00027916 s, 738 kB/s
Notice the printing of "# Available triggers:..." after the line.
With the patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
2+1 records in
2+1 records out
88 bytes copied, 0.000526867 s, 167 kB/s
It only prints the end of the file, and does not restart.
Link: http://lkml.kernel.org/r/3c35ee24-dd3a-8119-9c19-552ed253388a@virtuozzo.comhttps://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Link: http://lkml.kernel.org/r/7ad85b22-1866-977c-db17-88ac438bc764@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
[ This is not a bug fix, it just makes it "technically correct"
which is why I applied it. NULL is only returned on an anomaly
which triggers a WARN_ON ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.
Without patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
no pid
2+1 records in
2+1 records out
10 bytes copied, 0.000213285 s, 46.9 kB/s
Notice the "id" followed by "no pid".
With the patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
0+1 records in
0+1 records out
3 bytes copied, 0.000202112 s, 14.8 kB/s
Notice that it only prints "id" and not the "no pid" afterward.
Link: http://lkml.kernel.org/r/4f87c6ad-f114-30bb-8506-c32274ce2992@virtuozzo.comhttps://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reading the sched_cmdline_ref and sched_tgid_ref initial state within
tracing_start_sched_switch without holding the sched_register_mutex is
racy against concurrent updates, which can lead to tracepoint probes
being registered more than once (and thus trigger warnings within
tracepoint.c).
[ May be the fix for this bug ]
Link: https://lore.kernel.org/r/000000000000ab6f84056c786b93@google.com
Link: http://lkml.kernel.org/r/20190817141208.15226-1-mathieu.desnoyers@efficios.com
Cc: stable@vger.kernel.org
CC: Steven Rostedt (VMware) <rostedt@goodmis.org>
CC: Joel Fernandes (Google) <joel@joelfernandes.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul E. McKenney <paulmck@linux.ibm.com>
Reported-by: syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Move external function declarations into kernel/trace/trace.h
from trace_boot.c for tracing subsystem internal use.
Link: http://lkml.kernel.org/r/158029060405.12381.11944554430359702545.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The 'hist:' prefix gets stripped from the command text during command
processing, but should be added back when displaying the command
during error processing.
Not only because it's what should be displayed but also because not
having it means the test cases fail because the caret is miscalculated
by the length of the prefix string.
Link: http://lkml.kernel.org/r/449df721f560042e22382f67574bcc5b4d830d3d.1561743018.git.zanussi@kernel.org
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In the process of adding better error messages for sorting, I realized
that strsep was being used incorrectly and some of the error paths I
was expecting to be hit weren't and just fell through to the common
invalid key error case.
It also became obvious that for keyword assignments, it wasn't
necessary to save the full assignment and reparse it later, and having
a common empty-assignment check would also make more sense in terms of
error processing.
Change the code to fix these problems and simplify it for new error
message changes in a subsequent patch.
Link: http://lkml.kernel.org/r/1c3ef0b6655deaf345f6faee2584a0298ac2d743.1561743018.git.zanussi@kernel.org
Fixes: e62347d245 ("tracing: Add hist trigger support for user-defined sorting ('sort=' param)")
Fixes: 7ef224d1d0 ("tracing: Add 'hist' event trigger command")
Fixes: a4072fe85b ("tracing: Add a clock attribute for hist triggers")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
or user space. But the creating of the event format file only checks for
"string" to display string formats. "ustring" must also be handled.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXi8JRxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qvaiAP943Srl0C1NHuKtJGHpYkgHJRt4mPFO
569Wx82a2ODH4AEA/D8uda0+p0wJB/uDnd/VyhTeb1nAjqzhx4pfGPNjaw8=
=lu3h
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Kprobe events added 'ustring' to distinguish reading strings from
kernel space or user space.
But the creating of the event format file only checks for 'string' to
display string formats. 'ustring' must also be handled"
* tag 'trace-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Have uname use __get_str() in print_fmt
Pull networking updates from David Miller:
1) Add WireGuard
2) Add HE and TWT support to ath11k driver, from John Crispin.
3) Add ESP in TCP encapsulation support, from Sabrina Dubroca.
4) Add variable window congestion control to TIPC, from Jon Maloy.
5) Add BCM84881 PHY driver, from Russell King.
6) Start adding netlink support for ethtool operations, from Michal
Kubecek.
7) Add XDP drop and TX action support to ena driver, from Sameeh
Jubran.
8) Add new ipv4 route notifications so that mlxsw driver does not have
to handle identical routes itself. From Ido Schimmel.
9) Add BPF dynamic program extensions, from Alexei Starovoitov.
10) Support RX and TX timestamping in igc, from Vinicius Costa Gomes.
11) Add support for macsec HW offloading, from Antoine Tenart.
12) Add initial support for MPTCP protocol, from Christoph Paasch,
Matthieu Baerts, Florian Westphal, Peter Krystad, and many others.
13) Add Octeontx2 PF support, from Sunil Goutham, Geetha sowjanya, Linu
Cherian, and others.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1469 commits)
net: phy: add default ARCH_BCM_IPROC for MDIO_BCM_IPROC
udp: segment looped gso packets correctly
netem: change mailing list
qed: FW 8.42.2.0 debug features
qed: rt init valid initialization changed
qed: Debug feature: ilt and mdump
qed: FW 8.42.2.0 Add fw overlay feature
qed: FW 8.42.2.0 HSI changes
qed: FW 8.42.2.0 iscsi/fcoe changes
qed: Add abstraction for different hsi values per chip
qed: FW 8.42.2.0 Additional ll2 type
qed: Use dmae to write to widebus registers in fw_funcs
qed: FW 8.42.2.0 Parser offsets modified
qed: FW 8.42.2.0 Queue Manager changes
qed: FW 8.42.2.0 Expose new registers and change windows
qed: FW 8.42.2.0 Internal ram offsets modifications
MAINTAINERS: Add entry for Marvell OcteonTX2 Physical Function driver
Documentation: net: octeontx2: Add RVU HW and drivers overview
octeontx2-pf: ethtool RSS config support
octeontx2-pf: Add basic ethtool support
...
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Ftrace is one of the last W^X violators (after this only KLP is
left). These patches move it over to the generic text_poke()
interface and thereby get rid of this oddity. This requires a
surprising amount of surgery, by Peter Zijlstra.
- x86/AMD PMUs: add support for 'Large Increment per Cycle Events' to
count certain types of events that have a special, quirky hw ABI
(by Kim Phillips)
- kprobes fixes by Masami Hiramatsu
Lots of tooling updates as well, the following subcommands were
updated: annotate/report/top, c2c, clang, record, report/top TUI,
sched timehist, tests; plus updates were done to the gtk ui, libperf,
headers and the parser"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
perf/x86/amd: Add support for Large Increment per Cycle Events
perf/x86/amd: Constrain Large Increment per Cycle events
perf/x86/intel/rapl: Add Comet Lake support
tracing: Initialize ret in syscall_enter_define_fields()
perf header: Use last modification time for timestamp
perf c2c: Fix return type for histogram sorting comparision functions
perf beauty sockaddr: Fix augmented syscall format warning
perf/ui/gtk: Fix gtk2 build
perf ui gtk: Add missing zalloc object
perf tools: Use %define api.pure full instead of %pure-parser
libperf: Setup initial evlist::all_cpus value
perf report: Fix no libunwind compiled warning break s390 issue
perf tools: Support --prefix/--prefix-strip
perf report: Clarify in help that --children is default
tools build: Fix test-clang.cpp with Clang 8+
perf clang: Fix build with Clang 9
kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic
tools lib: Fix builds when glibc contains strlcpy()
perf report/top: Make 'e' visible in the help and make it toggle showing callchains
perf report/top: Do not offer annotation for symbols without samples
...
Thomas Richter reported:
> Test case 66 'Use vfs_getname probe to get syscall args filenames'
> is broken on s390, but works on x86. The test case fails with:
>
> [root@m35lp76 perf]# perf test -F 66
> 66: Use vfs_getname probe to get syscall args filenames
> :Recording open file:
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.004 MB /tmp/__perf_test.perf.data.TCdYj\
> (20 samples) ]
> Looking at perf.data file for vfs_getname records for the file we touched:
> FAILED!
> [root@m35lp76 perf]#
The root cause was the print_fmt of the kprobe event that referenced the
"ustring"
> Setting up the kprobe event using perf command:
>
> # ./perf probe "vfs_getname=getname_flags:72 pathname=filename:ustring"
>
> generates this format file:
> [root@m35lp76 perf]# cat /sys/kernel/debug/tracing/events/probe/\
> vfs_getname/format
> name: vfs_getname
> ID: 1172
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
>
> field:unsigned long __probe_ip; offset:8; size:8; signed:0;
> field:__data_loc char[] pathname; offset:16; size:4; signed:1;
>
> print fmt: "(%lx) pathname=\"%s\"", REC->__probe_ip, REC->pathname
Instead of using "__get_str(pathname)" it referenced it directly.
Link: http://lkml.kernel.org/r/20200124100742.4050c15e@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 88903c4643 ("tracing/probe: Add ustring type for user-space string")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As warnings can trigger panics, especially when "panic_on_warn" is set,
memory failure warnings can cause panics and fail fuzz testers that are
stressing memory.
Create a MEM_FAIL() macro to use instead of WARN() in the tracing code
(perhaps this should be a kernel wide macro?), and use that for memory
failure issues. This should stop failing fuzz tests due to warnings.
Link: https://lore.kernel.org/r/CACT4Y+ZP-7np20GVRu3p+eZys9GPtbu+JpfV+HtsufAzvTgJrg@mail.gmail.com
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The trace_array_get_by_name() creates a ftrace instance and
trace_array_put() is used to remove the reference. Even though the
trace_array_get_by_name() creates the instance, it also adds a reference
count to it, that prevents user space from removing it.
As the bootconfig just creates the instance on boot up, it should still be
used where it can be deleted by user space after boot. A trace_array_put()
is required to let that happen.
Also, change the documentation on trace_array_get_by_name() to make this not
be so confusing.
Link: https://lore.kernel.org/r/20200124205927.76128804@rorschach.local.home
Fixes: 4f712a4d04 ("tracing/boot: Add instance node support")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
We checked "iter->trace" earlier so there is no need to check here.
Link: http://lkml.kernel.org/r/20141122183012.GB6994@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
I noticed when trying to use the trace-cmd python interface that reading the raw
buffer wasn't working for kernel_stack events. This is because it uses a
stubbed version of __dynamic_array that doesn't do the __data_loc trick and
encode the length of the array into the field. Instead it just shows up as a
size of 0. So change this to __array and set the len to FTRACE_STACK_ENTRIES
since this is what we actually do in practice and matches how user_stack_trace
works.
Link: http://lkml.kernel.org/r/1411589652-1318-1-git-send-email-jbacik@fb.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
tracing_stat_init() was always returning '0', even on the error paths. It
now returns -ENODEV if tracing_init_dentry() fails or -ENOMEM if it fails
to created the 'trace_stat' debugfs directory.
Link: http://lkml.kernel.org/r/1410299381-20108-1-git-send-email-luis.henriques@canonical.com
Fixes: ed6f1c996b ("tracing: Check return value of tracing_init_dentry()")
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Looking through old emails in my INBOX, I came across a patch from Luis
Henriques that attempted to fix a race of two stat tracers registering the
same stat trace (extremely unlikely, as this is done in the kernel, and
probably doesn't even exist). The submitted patch wasn't quite right as it
needed to deal with clean up a bit better (if two stat tracers were the
same, it would have the same files).
But to make the code cleaner, all we needed to do is to keep the
all_stat_sessions_mutex held for most of the registering function.
Link: http://lkml.kernel.org/r/1410299375-20068-1-git-send-email-luis.henriques@canonical.com
Fixes: 002bb86d8d ("tracing/ftrace: separate events tracing and stats tracing engine")
Reported-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-01-22
The following pull-request contains BPF updates for your *net-next* tree.
We've added 92 non-merge commits during the last 16 day(s) which contain
a total of 320 files changed, 7532 insertions(+), 1448 deletions(-).
The main changes are:
1) function by function verification and program extensions from Alexei.
2) massive cleanup of selftests/bpf from Toke and Andrii.
3) batched bpf map operations from Brian and Yonghong.
4) tcp congestion control in bpf from Martin.
5) bulking for non-map xdp_redirect form Toke.
6) bpf_send_signal_thread helper from Yonghong.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 99c9a923e9 ("tracing/uprobe: Fix double perf_event
linking on multiprobe uprobe") moved trace_uprobe_filter on
trace_probe_event. However, since it introduced a flexible
data structure with char array and type casting, the
alignment of trace_uprobe_filter can be broken.
This changes the type of the array to trace_uprobe_filter
data strucure to fix it.
Link: http://lore.kernel.org/r/20200120124022.GA14897@hirez.programming.kicks-ass.net
Link: http://lkml.kernel.org/r/157966340499.5107.10978352478952144902.stgit@devnote2
Fixes: 99c9a923e9 ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This limitation are never lunched from introduce commit 970988e19e
("tracing/kprobe: Add kprobe_event= boot parameter")
Could we remove it if no intention to implement it?
Link: http://lkml.kernel.org/r/1579586075-45132-1-git-send-email-alex.shi@linux.alibaba.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If we exit due to a bad input to trace_printk() (highly unlikely), then the
buffer variable will not be initialized when we unnest the ring buffer.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When trace_clock option is not set and unstable clcok detected,
tracing_set_default_clock() sets trace_clock(ThinkPad A285 is one of
case). In that case, if lockdown is in effect, null pointer
dereference error happens in ring_buffer_set_clock().
Link: http://lkml.kernel.org/r/20200116131236.3866925-1-masami256@gmail.com
Cc: stable@vger.kernel.org
Fixes: 17911ff38a ("tracing: Add locked_down checks to the open calls of files created for tracefs")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1788488
Signed-off-by: Masami Ichikawa <masami256@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
While working on a tool to convert SQL syntex into the histogram language of
the kernel, I discovered the following bug:
# echo 'first u64 start_time u64 end_time pid_t pid u64 delta' >> synthetic_events
# echo 'hist:keys=pid:start=common_timestamp' > events/sched/sched_waking/trigger
# echo 'hist:keys=next_pid:delta=common_timestamp-$start,start2=$start:onmatch(sched.sched_waking).trace(first,$start2,common_timestamp,next_pid,$delta)' > events/sched/sched_switch/trigger
Would not display any histograms in the sched_switch histogram side.
But if I were to swap the location of
"delta=common_timestamp-$start" with "start2=$start"
Such that the last line had:
# echo 'hist:keys=next_pid:start2=$start,delta=common_timestamp-$start:onmatch(sched.sched_waking).trace(first,$start2,common_timestamp,next_pid,$delta)' > events/sched/sched_switch/trigger
The histogram works as expected.
What I found out is that the expressions clear out the value once it is
resolved. As the variables are resolved in the order listed, when
processing:
delta=common_timestamp-$start
The $start is cleared. When it gets to "start2=$start", it errors out with
"unresolved symbol" (which is silent as this happens at the location of the
trace), and the histogram is dropped.
When processing the histogram for variable references, instead of adding a
new reference for a variable used twice, use the same reference. That way,
not only is it more efficient, but the order will no longer matter in
processing of the variables.
From Tom Zanussi:
"Just to clarify some more about what the problem was is that without
your patch, we would have two separate references to the same variable,
and during resolve_var_refs(), they'd both want to be resolved
separately, so in this case, since the first reference to start wasn't
part of an expression, it wouldn't get the read-once flag set, so would
be read normally, and then the second reference would do the read-once
read and also be read but using read-once. So everything worked and
you didn't see a problem:
from: start2=$start,delta=common_timestamp-$start
In the second case, when you switched them around, the first reference
would be resolved by doing the read-once, and following that the second
reference would try to resolve and see that the variable had already
been read, so failed as unset, which caused it to short-circuit out and
not do the trigger action to generate the synthetic event:
to: delta=common_timestamp-$start,start2=$start
With your patch, we only have the single resolution which happens
correctly the one time it's resolved, so this can't happen."
Link: https://lore.kernel.org/r/20200116154216.58ca08eb@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 067fe038e7 ("tracing: Add variable reference handling to hist triggers")
Reviewed-by: Tom Zanuss <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If syscall_enter_define_fields() is called on a system call with no
arguments, the return code variable "ret" will never get initialized.
Initialize it to zero.
Fixes: 04ae87a520 ("ftrace: Rework event_create_dir()")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/0FA8C6E3-D9F5-416D-A1B0-5E4CD583A101@lca.pw
trace_printk() is used to debug the kernel which includes the tracing
infrastructure. But because it writes to the ring buffer, and so does much
of the tracing infrastructure, the ring buffer's recursive detection will
drop writes to the ring buffer that is in the same context as the current
write is happening (it allows interrupts to write when normal context is
writing, but wont let normal context write while normal context is writing).
This can cause confusion and think that the code is where the trace_printk()
exists is not hit. To solve this, up the recursive nesting of the ring
buffer when trace_printk() is called before it writes to the buffer itself.
Note, this does make it dangerous to use trace_printk() in the ring buffer
code itself, because this basically disables the recursion protection of
trace_printk() buffer writes. But as trace_printk() is only used for
debugging, and if this does occur, the developer will see the cause real
quick (recursive blowing up of the stack). Thus the developer can deal with
that. But having trace_printk() silently ignored is a much bigger problem,
and disabling recursive protection is a small price to pay to fix it.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
added helper bpf_send_signal() which permits bpf program to
send a signal to the current process. The signal may be
delivered to any threads in the process.
We found a use case where sending the signal to the current
thread is more preferable.
- A bpf program will collect the stack trace and then
send signal to the user application.
- The user application will add some thread specific
information to the just collected stack trace for
later analysis.
If bpf_send_signal() is used, user application will need
to check whether the thread receiving the signal matches
the thread collecting the stack by checking thread id.
If not, it will need to send signal to another thread
through pthread_kill().
This patch proposed a new helper bpf_send_signal_thread(),
which sends the signal to the thread corresponding to
the current kernel task. This way, user space is guaranteed that
bpf_program execution context and user space signal handling
context are the same thread.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200115035002.602336-1-yhs@fb.com
With CONFIG_PROVE_RCU_LIST, I had many suspicious RCU warnings
when I ran ftracetest trigger testcases.
-----
# dmesg -c > /dev/null
# ./ftracetest test.d/trigger
...
# dmesg | grep "RCU-list traversed" | cut -f 2 -d ] | cut -f 2 -d " "
kernel/trace/trace_events_hist.c:6070
kernel/trace/trace_events_hist.c:1760
kernel/trace/trace_events_hist.c:5911
kernel/trace/trace_events_trigger.c:504
kernel/trace/trace_events_hist.c:1810
kernel/trace/trace_events_hist.c:3158
kernel/trace/trace_events_hist.c:3105
kernel/trace/trace_events_hist.c:5518
kernel/trace/trace_events_hist.c:5998
kernel/trace/trace_events_hist.c:6019
kernel/trace/trace_events_hist.c:6044
kernel/trace/trace_events_trigger.c:1500
kernel/trace/trace_events_trigger.c:1540
kernel/trace/trace_events_trigger.c:539
kernel/trace/trace_events_trigger.c:584
-----
I investigated those warnings and found that the RCU-list
traversals in event trigger and hist didn't need to use
RCU version because those were called only under event_mutex.
I also checked other RCU-list traversals related to event
trigger list, and found that most of them were called from
event_hist_trigger_func() or hist_unregister_trigger() or
register/unregister functions except for a few cases.
Replace these unneeded RCU-list traversals with normal list
traversal macro and lockdep_assert_held() to check the
event_mutex is held.
Link: http://lkml.kernel.org/r/157680910305.11685.15110237954275915782.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: 30350d65ac ("tracing: Add variable support to hist triggers")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Also fixes a couple of typos
Link: http://lkml.kernel.org/r/1401992525-10417-1-git-send-email-fabf@skynet.be
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
[ Found this deep in the abyss of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix double perf_event linking to trace_uprobe_filter on
multiple uprobe event by moving trace_uprobe_filter under
trace_probe_event.
In uprobe perf event, trace_uprobe_filter data structure is
managing target mm filters (in perf_event) related to each
uprobe event.
Since commit 60d53e2c3b ("tracing/probe: Split trace_event
related data from trace_probe") left the trace_uprobe_filter
data structure in trace_uprobe, if a trace_probe_event has
multiple trace_uprobe (multi-probe event), a perf_event is
added to different trace_uprobe_filter on each trace_uprobe.
This leads a linked list corruption.
To fix this issue, move trace_uprobe_filter to trace_probe_event
and link it once on each event instead of each probe.
Link: http://lkml.kernel.org/r/157862073931.1800.3800576241181489174.stgit@devnote2
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: =?utf-8?q?Toke_H=C3=B8iland-J?= =?utf-8?b?w7hyZ2Vuc2Vu?= <thoiland@redhat.com>
Cc: Jean-Tsung Hsiao <jhsiao@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 60d53e2c3b ("tracing/probe: Split trace_event related data from trace_probe")
Link: https://lkml.kernel.org/r/20200108171611.GA8472@kernel.org
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With CONFIG_PROVE_RCU_LIST, I had many suspicious RCU warnings
when I ran ftracetest trigger testcases.
-----
# dmesg -c > /dev/null
# ./ftracetest test.d/trigger
...
# dmesg | grep "RCU-list traversed" | cut -f 2 -d ] | cut -f 2 -d " "
kernel/trace/trace_events_hist.c:6070
kernel/trace/trace_events_hist.c:1760
kernel/trace/trace_events_hist.c:5911
kernel/trace/trace_events_trigger.c:504
kernel/trace/trace_events_hist.c:1810
kernel/trace/trace_events_hist.c:3158
kernel/trace/trace_events_hist.c:3105
kernel/trace/trace_events_hist.c:5518
kernel/trace/trace_events_hist.c:5998
kernel/trace/trace_events_hist.c:6019
kernel/trace/trace_events_hist.c:6044
kernel/trace/trace_events_trigger.c:1500
kernel/trace/trace_events_trigger.c:1540
kernel/trace/trace_events_trigger.c:539
kernel/trace/trace_events_trigger.c:584
-----
I investigated those warnings and found that the RCU-list
traversals in event trigger and hist didn't need to use
RCU version because those were called only under event_mutex.
I also checked other RCU-list traversals related to event
trigger list, and found that most of them were called from
event_hist_trigger_func() or hist_unregister_trigger() or
register/unregister functions except for a few cases.
Replace these unneeded RCU-list traversals with normal list
traversal macro and lockdep_assert_held() to check the
event_mutex is held.
Link: http://lkml.kernel.org/r/157680910305.11685.15110237954275915782.stgit@devnote2
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add below function-tracer filter options to boot-time tracing.
- ftrace.[instance.INSTANCE.]ftrace.filters
This will take an array of tracing function filter rules
- ftrace.[instance.INSTANCE.]ftrace.notraces
This will take an array of NON-tracing function filter rules
Link: http://lkml.kernel.org/r/157867244841.17873.10933616628243103561.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add ftrace.cpumask option support to boot-time tracing.
This sets cpumask for each instance.
- ftrace.[instance.INSTANCE.]cpumask = CPUMASK;
Set the trace cpumask. Note that the CPUMASK should be a string
which <tracefs>/tracing_cpumask can accepts.
Link: http://lkml.kernel.org/r/157867243625.17873.13613922641273149372.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add instance node support to boot-time tracing. User can set
some options and event nodes under instance node.
- ftrace.instance.INSTANCE[...]
Add new INSTANCE instance. Some options and event nodes
are acceptable for instance node.
Link: http://lkml.kernel.org/r/157867242413.17873.9814204526141500278.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add synthetic event node support to boot time tracing.
The synthetic event is a kind of event node, but the group
name is "synthetic".
- ftrace.event.synthetic.EVENT.fields = FIELD[, FIELD2...]
Defines new synthetic event with FIELDs. Each field should be
"type varname".
The synthetic node requires "fields" string arraies, which defines
the fields as same as tracing/synth_events interface.
Link: http://lkml.kernel.org/r/157867241236.17873.12411615143321557709.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add kprobe event support on event node to boot-time tracing.
If the group name of event is "kprobes", the boot-time tracing
defines new probe event according to "probes" values.
- ftrace.event.kprobes.EVENT.probes = PROBE[, PROBE2...]
Defines new kprobe event based on PROBEs. It is able to define
multiple probes on one event, but those must have same type of
arguments.
For example,
ftrace.events.kprobes.myevent {
probes = "vfs_read $arg1 $arg2";
enable;
}
This will add kprobes:myevent on vfs_read with the 1st and the 2nd
arguments.
Link: http://lkml.kernel.org/r/157867240104.17873.9712052065426433111.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add per-event settings for boottime tracing. User can set filter,
actions and enable on each event on boot. The event entries are
under ftrace.event.GROUP.EVENT node (note that the option key
includes event's group name and event name.) This supports below
configs.
- ftrace.event.GROUP.EVENT.enable
Enables GROUP:EVENT tracing.
- ftrace.event.GROUP.EVENT.filter = FILTER
Set FILTER rule to the GROUP:EVENT.
- ftrace.event.GROUP.EVENT.actions = ACTION[, ACTION2...]
Set ACTIONs to the GROUP:EVENT.
For example,
ftrace.event.sched.sched_process_exec {
filter = "pid < 128"
enable
}
this will enable tracing "sched:sched_process_exec" event
with "pid < 128" filter.
Link: http://lkml.kernel.org/r/157867238942.17873.11177628789184546198.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Setup tracing options via extra boot config in addition to kernel
command line.
This adds following commands support. These are applied to
the global trace instance.
- ftrace.options = OPT1[,OPT2...]
Enable given ftrace options.
- ftrace.trace_clock = CLOCK
Set given CLOCK to ftrace's trace_clock.
- ftrace.buffer_size = SIZE
Configure ftrace buffer size to SIZE. You can use "KB" or "MB"
for that SIZE.
- ftrace.events = EVENT[, EVENT2...]
Enable given events on boot. You can use a wild card in EVENT.
- ftrace.tracer = TRACER
Set TRACER to current tracer on boot. (e.g. function)
Note that this is NOT replacing the kernel parameters, because
this boot config based setting is later than that. If you want to
trace earlier boot events, you still need kernel parameters.
Link: http://lkml.kernel.org/r/157867237723.17873.17494943526320587488.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Make the synthetic event accepts a different type field to record.
However, the size and signed flag must be same.
Link: http://lkml.kernel.org/r/157867235358.17873.61732996461602171.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Register kprobe event to dynevent in subsys_initcall level.
This will allow kernel to register new kprobe events in
fs_initcall level via trace_run_command.
Link: http://lkml.kernel.org/r/157867234213.17873.18039000024374948737.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since kprobe-events use event_trigger_unlock_commit_regs() directly,
that events doesn't show up in printk buffer if "tp_printk" is set.
Use trace_event_buffer_commit() in kprobe events so that it can
invoke output_printk() as same as other trace events.
Link: http://lkml.kernel.org/r/157867233085.17873.5210928676787339604.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
[ Adjusted data var declaration placement in __kretprobe_trace_func() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Apply soft-disabled and the filter rule of the trace events to
the printk output of tracepoints (a.k.a. tp_printk kernel parameter)
as same as trace buffer output.
Link: http://lkml.kernel.org/r/157867231876.17873.15825819592284704068.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As there's two struct ring_buffers in the kernel, it causes some confusion.
The other one being the perf ring buffer. It was agreed upon that as neither
of the ring buffers are generic enough to be used globally, they should be
renamed as:
perf's ring_buffer -> perf_buffer
ftrace's ring_buffer -> trace_buffer
This implements the changes to the ring buffer that ftrace uses.
Link: https://lore.kernel.org/r/20191213140531.116b3200@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As we are working to remove the generic "ring_buffer" name that is used by
both tracing and perf, the ring_buffer name for tracing will be renamed to
trace_buffer, and perf's ring buffer will be renamed to perf_buffer.
As there already exists a trace_buffer that is used by the trace_arrays, it
needs to be first renamed to array_buffer.
Link: https://lore.kernel.org/r/20191213153553.GE20583@krava
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In the function, if register_trace_sched_migrate_task() returns error,
sched_switch/sched_wakeup_new/sched_wakeup won't unregister. That is
why fail_deprobe_sched_switch was added.
Link: http://lkml.kernel.org/r/20191231133530.2794-1-pilgrimtao@gmail.com
Cc: stable@vger.kernel.org
Fixes: 478142c39c ("tracing: do not grab lock in wakeup latency function tracing")
Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ftrace_profile->counter is unsigned long and
do_div truncates it to 32 bits, which means it can test
non-zero and be truncated to zero for division.
Fix this issue by using div64_ul() instead.
Link: http://lkml.kernel.org/r/20200103030248.14516-1-wenyang@linux.alibaba.com
Cc: stable@vger.kernel.org
Fixes: e330b3bcd8 ("tracing: Show sample std dev in function profiling")
Fixes: 34886c8bc5 ("tracing: add average time in function to function profiler")
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
On some archs with some configurations, MCOUNT_INSN_SIZE is not defined, and
this makes the stack tracer fail to compile. Just define it to zero in this
case.
Link: https://lore.kernel.org/r/202001020219.zvE3vsty%lkp@intel.com
Cc: stable@vger.kernel.org
Fixes: 4df297129f ("tracing: Remove most or all of stack tracer stack size from stack_max_size")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In order to handle direct calls along side of function graph tracer, a check
is made to see if the address being traced by the function graph tracer is a
direct call or not. To get the address used by direct callers, the return
address is subtracted by MCOUNT_INSN_SIZE.
For some archs with certain configurations, MCOUNT_INSN_SIZE is undefined
here. But these should not be using direct calls anyway. Just define
MCOUNT_INSN_SIZE to zero in this case.
Link: https://lore.kernel.org/r/202001020219.zvE3vsty%lkp@intel.com
Reported-by: kbuild test robot <lkp@intel.com>
Fixes: ff205766db ("ftrace: Fix function_graph tracer interaction with BPF trampoline")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
gcc produces a variable may be uninitialized warning for "val" in
parse_entry(). This is really a false positive, but the code is subtle
enough to just initialize val to zero and it's not a fast path to worry
about it.
Marked for stable to remove the warning in the stable trees as well.
Cc: stable@vger.kernel.org
Fixes: 6c3edaf9fd ("tracing: Introduce trace event injection")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
At least on PA-RISC and s390 synthetic histogram triggers are failing
selftests because trace_event_raw_event_synth() always writes a 64 bit
values, but the reader expects a field->size sized value. On little endian
machines this doesn't hurt, but on big endian this makes the reader always
read zero values.
Link: http://lore.kernel.org/linux-trace-devel/20191218074427.96184-4-svens@linux.ibm.com
Cc: stable@vger.kernel.org
Fixes: 4b147936fa ("tracing: Add support for 'synthetic' events")
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Task T2 Task T3
trace_options_core_write() subsystem_open()
mutex_lock(trace_types_lock) mutex_lock(event_mutex)
set_tracer_flag()
trace_event_enable_tgid_record() mutex_lock(trace_types_lock)
mutex_lock(event_mutex)
This gives a circular dependency deadlock between trace_types_lock and
event_mutex. To fix this invert the usage of trace_types_lock and
event_mutex in trace_options_core_write(). This keeps the sequence of
lock usage consistent.
Link: http://lkml.kernel.org/r/0101016eef175e38-8ca71caf-a4eb-480d-a1e6-6f0bbc015495-000000@us-west-2.amazonses.com
Cc: stable@vger.kernel.org
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The compare functions of the histogram code would be specific for the size
of the value being compared (byte, short, int, long long). It would
reference the value from the array via the type of the compare, but the
value was stored in a 64 bit number. This is fine for little endian
machines, but for big endian machines, it would end up comparing zeros or
all ones (depending on the sign) for anything but 64 bit numbers.
To fix this, first derference the value as a u64 then convert it to the type
being compared.
Link: http://lkml.kernel.org/r/20191211103557.7bed6928@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 08d43a5fa0 ("tracing: Add lock-free tracing_map")
Acked-by: Tom Zanussi <zanussi@kernel.org>
Reported-by: Sven Schnelle <svens@stackframe.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Removal of code I accidentally applied when doing a minor fix up
to a patch, and then using "git commit -a --amend", which pulled
in some other changes I was playing with.
- Remove an used variable in trace_events_inject code
- Fix to function graph tracer when it traces a ftrace direct function.
It will now ignore tracing a function that has a ftrace direct
tramploine attached. This is needed for eBPF to use the ftrace direct
code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXfD/thQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoo2AP4j7ONw7BTmMyo+GdYqPPntBeDnClHK
vfMKrgK1j5BxYgEA7LgkwuUT9bcyLjfJVcyfeW67rB2PtmovKTWnKihFOwI=
=DZ6N
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Remove code I accidentally applied when doing a minor fix up to a
patch, and then using "git commit -a --amend", which pulled in some
other changes I was playing with.
- Remove an used variable in trace_events_inject code
- Fix function graph tracer when it traces a ftrace direct function.
It will now ignore tracing a function that has a ftrace direct
tramploine attached. This is needed for eBPF to use the ftrace direct
code.
* tag 'trace-v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix function_graph tracer interaction with BPF trampoline
tracing: remove set but not used variable 'buffer'
module: Remove accidental change of module_enable_x()
Depending on type of BPF programs served by BPF trampoline it can call original
function. In such case the trampoline will skip one stack frame while
returning. That will confuse function_graph tracer and will cause crashes with
bad RIP. Teach graph tracer to skip functions that have BPF trampoline attached.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
kernel/trace/trace_events_inject.c: In function trace_inject_entry:
kernel/trace/trace_events_inject.c:20:22: warning: variable buffer set but not used [-Wunused-but-set-variable]
It is never used, so remove it.
Link: http://lkml.kernel.org/r/20191207034409.25668-1-yuehaibing@huawei.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20191206092503.303d6a57@canb.auug.org.au
Cc: Linux Next Mailing List <linux-next@vger.kernel.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Fixes:
- Missing __print_hex_dump undef for processing new function in trace events
- Stop WARN_ON messages when lockdown disables tracing on boot up
Enhancement:
- Debug option to inject trace events from userspace (for rasdaemon)
The enhancement has its own config option and is non invasive. It's been
discussed for sever months and should have been added to my original
push, but I never pulled it into my queue.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXehlhRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlGgAQCszcDuNyVllj0VwWi4i+0FAQcI12Ad
W0NGZm0wObGExwD8CDR/CdHq9ulizFQjJfopG6b5Uc3Z4NNJ+QGnMxzBuwo=
=k31z
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing updates from Steven Rostedt:
"Two fixes and one patch that was missed:
Fixes:
- Missing __print_hex_dump undef for processing new function in trace
events
- Stop WARN_ON messages when lockdown disables tracing on boot up
Enhancement:
- Debug option to inject trace events from userspace (for rasdaemon)"
The enhancement has its own config option and is non invasive. It's been
discussed for sever months and should have been added to my original
push, but I never pulled it into my queue.
* tag 'trace-v5.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Do not create directories if lockdown is in affect
tracing: Introduce trace event injection
tracing: Fix __print_hex_dump scope
If lockdown is disabling tracing on boot up, it prevents the tracing files
from even bering created. But when that happens, there's several places that
will give a warning that the files were not created as that is usually a
sign of a bug.
Add in strategic locations where a check is made to see if tracing is
disabled by lockdown, and if it is, do not go further, and fail silently
(but print that tracing is disabled by lockdown, without doing a WARN_ON()).
Cc: Matthew Garrett <mjg59@google.com>
Fixes: 17911ff38a ("tracing: Add locked_down checks to the open calls of files created for tracefs")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull irq updates from Ingo Molnar:
"Most of the IRQ subsystem changes in this cycle were irq-chip driver
updates:
- Qualcomm PDC wakeup interrupt support
- Layerscape external IRQ support
- Broadcom bcm7038 PM and wakeup support
- Ingenic driver cleanup and modernization
- GICv3 ITS preparation for GICv4.1 updates
- GICv4 fixes
There's also the series from Frederic Weisbecker that fixes memory
ordering bugs for the irq-work logic, whose primary fix is to turn
work->irq_work.flags into an atomic variable and then convert the
complex (and buggy) atomic_cmpxchg() loop in irq_work_claim() into a
much simpler atomic_fetch_or() call.
There are also various smaller cleanups"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
pinctrl/sdm845: Add PDC wakeup interrupt map for GPIOs
pinctrl/msm: Setup GPIO chip in hierarchy
irqchip/qcom-pdc: Add irqchip set/get state calls
irqchip/qcom-pdc: Add irqdomain for wakeup capable GPIOs
irqchip/qcom-pdc: Do not toggle IRQ_ENABLE during mask/unmask
irqchip/qcom-pdc: Update max PDC interrupts
of/irq: Document properties for wakeup interrupt parent
genirq: Introduce irq_chip_get/set_parent_state calls
irqdomain: Add bus token DOMAIN_BUS_WAKEUP
genirq: Fix function documentation of __irq_alloc_descs()
irq_work: Fix IRQ_WORK_BUSY bit clearing
irqchip/ti-sci-inta: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))
irq_work: Slightly simplify IRQ_WORK_PENDING clearing
irq_work: Fix irq_work_claim() memory ordering
irq_work: Convert flags to atomic_t
irqchip: Ingenic: Add process for more than one irq at the same time.
irqchip: ingenic: Alloc generic chips from IRQ domain
irqchip: ingenic: Get virq number from IRQ domain
irqchip: ingenic: Error out if IRQ domain creation failed
irqchip: ingenic: Drop redundant irq_suspend / irq_resume functions
...
We have been trying to use rasdaemon to monitor hardware errors like
correctable memory errors. rasdaemon uses trace events to monitor
various hardware errors. In order to test it, we have to inject some
hardware errors, unfortunately not all of them provide error
injections. MCE does provide a way to inject MCE errors, but errors
like PCI error and devlink error don't, it is not easy to add error
injection to each of them. Instead, it is relatively easier to just
allow users to inject trace events in a generic way so that all trace
events can be injected.
This patch introduces trace event injection, where a new 'inject' is
added to each tracepoint directory. Users could write into this file
with key=value pairs to specify the value of each fields of the trace
event, all unspecified fields are set to zero values by default.
For example, for the net/net_dev_queue tracepoint, we can inject:
INJECT=/sys/kernel/debug/tracing/events/net/net_dev_queue/inject
echo "" > $INJECT
echo "name='test'" > $INJECT
echo "name='test' len=1024" > $INJECT
cat /sys/kernel/debug/tracing/trace
...
<...>-614 [000] .... 36.571483: net_dev_queue: dev= skbaddr=00000000fbf338c2 len=0
<...>-614 [001] .... 136.588252: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=0
<...>-614 [001] .N.. 208.431878: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=1024
Triggers could be triggered as usual too:
echo "stacktrace if len == 1025" > /sys/kernel/debug/tracing/events/net/net_dev_queue/trigger
echo "len=1025" > $INJECT
cat /sys/kernel/debug/tracing/trace
...
bash-614 [000] .... 36.571483: net_dev_queue: dev= skbaddr=00000000fbf338c2 len=0
bash-614 [001] .... 136.588252: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=0
bash-614 [001] .N.. 208.431878: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=1024
bash-614 [001] .N.1 284.236349: <stack trace>
=> event_inject_write
=> vfs_write
=> ksys_write
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
The only thing that can't be injected is string pointers as they
require constant string pointers, this can't be done at run time.
Link: http://lkml.kernel.org/r/20191130045218.18979-1-xiyou.wangcong@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- PERAMAENT flag to ftrace_ops when attaching a callback to a function
As /proc/sys/kernel/ftrace_enabled when set to zero will disable all
attached callbacks in ftrace, this has a detrimental impact on live
kernel tracing, as it disables all that it patched. If a ftrace_ops
is registered to ftrace with the PERMANENT flag set, it will prevent
ftrace_enabled from being disabled, and if ftrace_enabled is already
disabled, it will prevent a ftrace_ops with PREMANENT flag set from
being registered.
- New register_ftrace_direct(). As eBPF would like to register its own
trampolines to be called by the ftrace nop locations directly,
without going through the ftrace trampoline, this function has been
added. This allows for eBPF trampolines to live along side of
ftrace, perf, kprobe and live patching. It also utilizes the ftrace
enabled_functions file that keeps track of functions that have been
modified in the kernel, to allow for security auditing.
- Allow for kernel internal use of ftrace instances. Subsystems in
the kernel can now create and destroy their own tracing instances
which allows them to have their own tracing buffer, and be able
to record events without worrying about other users from writing over
their data.
- New seq_buf_hex_dump() that lets users use the hex_dump() in their
seq_buf usage.
- Notifications now added to tracing_max_latency to allow user space
to know when a new max latency is hit by one of the latency tracers.
- Wider spread use of generic compare operations for use of bsearch and
friends.
- More synthetic event fields may be defined (32 up from 16)
- Use of xarray for architectures with sparse system calls, for the
system call trace events.
This along with small clean ups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXdwv4BQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnB5AP91vsdHQjwE1+/UWG/cO+qFtKvn2QJK
QmBRIJNH/s+1TAD/fAOhgw+ojSK3o/qc+NpvPTEW9AEwcJL1wacJUn+XbQc=
=ztql
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New tracing features:
- New PERMANENT flag to ftrace_ops when attaching a callback to a
function.
As /proc/sys/kernel/ftrace_enabled when set to zero will disable
all attached callbacks in ftrace, this has a detrimental impact on
live kernel tracing, as it disables all that it patched. If a
ftrace_ops is registered to ftrace with the PERMANENT flag set, it
will prevent ftrace_enabled from being disabled, and if
ftrace_enabled is already disabled, it will prevent a ftrace_ops
with PREMANENT flag set from being registered.
- New register_ftrace_direct().
As eBPF would like to register its own trampolines to be called by
the ftrace nop locations directly, without going through the ftrace
trampoline, this function has been added. This allows for eBPF
trampolines to live along side of ftrace, perf, kprobe and live
patching. It also utilizes the ftrace enabled_functions file that
keeps track of functions that have been modified in the kernel, to
allow for security auditing.
- Allow for kernel internal use of ftrace instances.
Subsystems in the kernel can now create and destroy their own
tracing instances which allows them to have their own tracing
buffer, and be able to record events without worrying about other
users from writing over their data.
- New seq_buf_hex_dump() that lets users use the hex_dump() in their
seq_buf usage.
- Notifications now added to tracing_max_latency to allow user space
to know when a new max latency is hit by one of the latency
tracers.
- Wider spread use of generic compare operations for use of bsearch
and friends.
- More synthetic event fields may be defined (32 up from 16)
- Use of xarray for architectures with sparse system calls, for the
system call trace events.
This along with small clean ups and fixes"
* tag 'trace-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (51 commits)
tracing: Enable syscall optimization for MIPS
tracing: Use xarray for syscall trace events
tracing: Sample module to demonstrate kernel access to Ftrace instances.
tracing: Adding new functions for kernel access to Ftrace instances
tracing: Fix Kconfig indentation
ring-buffer: Fix typos in function ring_buffer_producer
ftrace: Use BIT() macro
ftrace: Return ENOTSUPP when DYNAMIC_FTRACE_WITH_DIRECT_CALLS is not configured
ftrace: Rename ftrace_graph_stub to ftrace_stub_graph
ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimization
ftrace: Add helper find_direct_entry() to consolidate code
ftrace: Add another check for match in register_ftrace_direct()
ftrace: Fix accounting bug with direct->count in register_ftrace_direct()
ftrace/selftests: Fix spelling mistake "wakeing" -> "waking"
tracing: Increase SYNTH_FIELDS_MAX for synthetic_events
ftrace/samples: Add a sample module that implements modify_ftrace_direct()
ftrace: Add modify_ftrace_direct()
tracing: Add missing "inline" in stub function of latency_fsnotify()
tracing: Remove stray tab in TRACE_EVAL_MAP_FILE's help text
tracing: Use seq_buf_hex_dump() to dump buffers
...
Here is the "big" set of driver core patches for 5.5-rc1
There's a few minor cleanups and fixes in here, but the majority of the
patches in here fall into two buckets:
- debugfs api cleanups and fixes
- driver core device link support for boot dependancy issues
The debugfs api cleanups are working to slowly refactor the debugfs apis
so that it is even harder to use incorrectly. That work has been
happening for the past few kernel releases and will continue over time,
it's a long-term project/goal
The driver core device link support missed 5.4 by just a bit, so it's
been sitting and baking for many months now. It's from Saravana Kannan
to help resolve the problems that DT-based systems have at boot time
with dependancy graphs and kernel modules. Turns out that no one has
actually tried to build a generic arm64 kernel with loads of modules and
have it "just work" for a variety of platforms (like a distro kernel)
The big problem turned out to be a lack of depandancy information
between different areas of DT entries, and the work here resolves that
problem and now allows devices to boot properly, and quicker than a
monolith kernel.
All of these patches have been in linux-next for a long time with no
reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXd6m6Q8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yntJQCcCqg6RQ7LTdHuZv1ETeefXlsfk00An1Jtean6
42bWGx52bGFvAcpjWy8R
=P7hq
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "big" set of driver core patches for 5.5-rc1
There's a few minor cleanups and fixes in here, but the majority of
the patches in here fall into two buckets:
- debugfs api cleanups and fixes
- driver core device link support for boot dependancy issues
The debugfs api cleanups are working to slowly refactor the debugfs
apis so that it is even harder to use incorrectly. That work has been
happening for the past few kernel releases and will continue over
time, it's a long-term project/goal
The driver core device link support missed 5.4 by just a bit, so it's
been sitting and baking for many months now. It's from Saravana Kannan
to help resolve the problems that DT-based systems have at boot time
with dependancy graphs and kernel modules. Turns out that no one has
actually tried to build a generic arm64 kernel with loads of modules
and have it "just work" for a variety of platforms (like a distro
kernel). The big problem turned out to be a lack of dependency
information between different areas of DT entries, and the work here
resolves that problem and now allows devices to boot properly, and
quicker than a monolith kernel.
All of these patches have been in linux-next for a long time with no
reported issues"
* tag 'driver-core-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (68 commits)
tracing: Remove unnecessary DEBUG_FS dependency
of: property: Add device link support for interrupt-parent, dmas and -gpio(s)
debugfs: Fix !DEBUG_FS debugfs_create_automount
of: property: Add device link support for "iommu-map"
of: property: Fix the semantics of of_is_ancestor_of()
i2c: of: Populate fwnode in of_i2c_get_board_info()
drivers: base: Fix Kconfig indentation
firmware_loader: Fix labels with comma for builtin firmware
driver core: Allow device link operations inside sync_state()
driver core: platform: Declare ret variable only once
cpu-topology: declare parse_acpi_topology in <linux/arch_topology.h>
crypto: hisilicon: no need to check return value of debugfs_create functions
driver core: platform: use the correct callback type for bus_find_device
firmware_class: make firmware caching configurable
driver core: Clarify documentation for fwnode_operations.add_links()
mailbox: tegra: Fix superfluous IRQ error message
net: caif: Fix debugfs on 64-bit platforms
mac80211: Use debugfs_create_xul() helper
media: c8sectpfe: no need to check return value of debugfs_create functions
of: property: Add device link support for iommus, mboxes and io-channels
...
Rework event_create_dir() to use an array of static data instead of
function pointers where possible.
The problem is that it would call the function pointer on module load
before parse_args(), possibly even before jump_labels were initialized.
Luckily the generated functions don't use jump_labels but it still seems
fragile. It also gets in the way of changing when we make the module map
executable.
The generated function are basically calling trace_define_field() with a
bunch of static arguments. So instead of a function, capture these
arguments in a static array, avoiding the function call.
Now there are a number of cases where the fields are dynamic (syscall
arguments, kprobes and uprobes), in which case a static array does not
work, for these we preserve the function call. Luckily all these cases
are not related to modules and so we can retain the function call for
them.
Also fix up all broken tracepoint definitions that now generate a
compile error.
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132458.342979914@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf updates from Ingo Molnar:
"The main kernel side changes in this cycle were:
- Various Intel-PT updates and optimizations (Alexander Shishkin)
- Prohibit kprobes on Xen/KVM emulate prefixes (Masami Hiramatsu)
- Add support for LSM and SELinux checks to control access to the
perf syscall (Joel Fernandes)
- Misc other changes, optimizations, fixes and cleanups - see the
shortlog for details.
There were numerous tooling changes as well - 254 non-merge commits.
Here are the main changes - too many to list in detail:
- Enhancements to core tooling infrastructure, perf.data, libperf,
libtraceevent, event parsing, vendor events, Intel PT, callchains,
BPF support and instruction decoding.
- There were updates to the following tools:
perf annotate
perf diff
perf inject
perf kvm
perf list
perf maps
perf parse
perf probe
perf record
perf report
perf script
perf stat
perf test
perf trace
- And a lot of other changes: please see the shortlog and Git log for
more details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (279 commits)
perf parse: Fix potential memory leak when handling tracepoint errors
perf probe: Fix spelling mistake "addrees" -> "address"
libtraceevent: Fix memory leakage in copy_filter_type
libtraceevent: Fix header installation
perf intel-bts: Does not support AUX area sampling
perf intel-pt: Add support for decoding AUX area samples
perf intel-pt: Add support for recording AUX area samples
perf pmu: When using default config, record which bits of config were changed by the user
perf auxtrace: Add support for queuing AUX area samples
perf session: Add facility to peek at all events
perf auxtrace: Add support for dumping AUX area samples
perf inject: Cut AUX area samples
perf record: Add aux-sample-size config term
perf record: Add support for AUX area sampling
perf auxtrace: Add support for AUX area sample recording
perf auxtrace: Move perf_evsel__find_pmu()
perf record: Add a function to test for kernel support for AUX area sampling
perf tools: Add kernel AUX area sampling definitions
perf/core: Make the mlock accounting simple again
perf report: Jump to symbol source view from total cycles view
...
Pull networking updates from David Miller:
"Another merge window, another pull full of stuff:
1) Support alternative names for network devices, from Jiri Pirko.
2) Introduce per-netns netdev notifiers, also from Jiri Pirko.
3) Support MSG_PEEK in vsock/virtio, from Matias Ezequiel Vara
Larsen.
4) Allow compiling out the TLS TOE code, from Jakub Kicinski.
5) Add several new tracepoints to the kTLS code, also from Jakub.
6) Support set channels ethtool callback in ena driver, from Sameeh
Jubran.
7) New SCTP events SCTP_ADDR_ADDED, SCTP_ADDR_REMOVED,
SCTP_ADDR_MADE_PRIM, and SCTP_SEND_FAILED_EVENT. From Xin Long.
8) Add XDP support to mvneta driver, from Lorenzo Bianconi.
9) Lots of netfilter hw offload fixes, cleanups and enhancements,
from Pablo Neira Ayuso.
10) PTP support for aquantia chips, from Egor Pomozov.
11) Add UDP segmentation offload support to igb, ixgbe, and i40e. From
Josh Hunt.
12) Add smart nagle to tipc, from Jon Maloy.
13) Support L2 field rewrite by TC offloads in bnxt_en, from Venkat
Duvvuru.
14) Add a flow mask cache to OVS, from Tonghao Zhang.
15) Add XDP support to ice driver, from Maciej Fijalkowski.
16) Add AF_XDP support to ice driver, from Krzysztof Kazimierczak.
17) Support UDP GSO offload in atlantic driver, from Igor Russkikh.
18) Support it in stmmac driver too, from Jose Abreu.
19) Support TIPC encryption and auth, from Tuong Lien.
20) Introduce BPF trampolines, from Alexei Starovoitov.
21) Make page_pool API more numa friendly, from Saeed Mahameed.
22) Introduce route hints to ipv4 and ipv6, from Paolo Abeni.
23) Add UDP segmentation offload to cxgb4, Rahul Lakkireddy"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1857 commits)
libbpf: Fix usage of u32 in userspace code
mm: Implement no-MMU variant of vmalloc_user_node_flags
slip: Fix use-after-free Read in slip_open
net: dsa: sja1105: fix sja1105_parse_rgmii_delays()
macvlan: schedule bc_work even if error
enetc: add support Credit Based Shaper(CBS) for hardware offload
net: phy: add helpers phy_(un)lock_mdio_bus
mdio_bus: don't use managed reset-controller
ax88179_178a: add ethtool_op_get_ts_info()
mlxsw: spectrum_router: Fix use of uninitialized adjacency index
mlxsw: spectrum_router: After underlay moves, demote conflicting tunnels
bpf: Simplify __bpf_arch_text_poke poke type handling
bpf: Introduce BPF_TRACE_x helper for the tracing tests
bpf: Add bpf_jit_blinding_enabled for !CONFIG_BPF_JIT
bpf, testing: Add various tail call test cases
bpf, x86: Emit patchable direct jump as tail call
bpf: Constant map key tracking for prog array pokes
bpf: Add poke dependency tracking for prog array maps
bpf: Add initial poke descriptor table for jit images
bpf: Move owner type, jited info into array auxiliary data
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl3bpjoACgkQUqAMR0iA
lPJJDA/+IJT4YCRp2TwV2jvIs0QzvXZrzEsxgCLibLE85mYTJgoQBD3W1bH2eyjp
T/9U0Zh5PGr/84cHd4qiMxzo+5Olz930weG59NcO4RJBSr671aRYs5tJqwaQAZDR
wlwaob5S28vUmjPxKulvxv6V3FdI79ZE9xrCOCSTQvz4iCLsGOu+Dn/qtF64pImX
M/EXzPMBrByiQ8RTM4Ege8JoBqiCZPDG9GR3KPXIXQwEeQgIoeYxwRYakxSmSzz8
W8NduFCbWavg/yHhghHikMiyOZeQzAt+V9k9WjOBTle3TGJegRhvjgI7508q3tXe
jQTMGATBOPkIgFaZz7eEn/iBa3jZUIIOzDY93RYBmd26aBvwKLOma/Vkg5oGYl0u
ZK+CMe+/xXl7brQxQ6JNsQhbSTjT+746LvLJlCvPbbPK9R0HeKNhsdKpGY3ugnmz
VAnOFIAvWUHO7qx+J+EnOo5iiPpcwXZj4AjrwVrs/x5zVhzwQ+4DSU6rbNn0O1Ak
ELrBqCQkQzh5kqK93jgMHeWQ9EOUp1Lj6PJhTeVnOx2x8tCOi6iTQFFrfdUPlZ6K
2DajgrFhti4LvwVsohZlzZuKRm5EuwReLRSOn7PU5qoSm5rcouqMkdlYG/viwyhf
mTVzEfrfemrIQOqWmzPrWEXlMj2mq8oJm4JkC+jJ/+HsfK4UU8I=
=QCEy
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:
- Allow to print symbolic error names via new %pe modifier.
- Use pr_warn() instead of the remaining pr_warning() calls. Fix
formatting of the related lines.
- Add VSPRINTF entry to MAINTAINERS.
* tag 'printk-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (32 commits)
checkpatch: don't warn about new vsprintf pointer extension '%pe'
MAINTAINERS: Add VSPRINTF
tools lib api: Renaming pr_warning to pr_warn
ASoC: samsung: Use pr_warn instead of pr_warning
lib: cpu_rmap: Use pr_warn instead of pr_warning
trace: Use pr_warn instead of pr_warning
dma-debug: Use pr_warn instead of pr_warning
vgacon: Use pr_warn instead of pr_warning
fs: afs: Use pr_warn instead of pr_warning
sh/intc: Use pr_warn instead of pr_warning
scsi: Use pr_warn instead of pr_warning
platform/x86: intel_oaktrail: Use pr_warn instead of pr_warning
platform/x86: asus-laptop: Use pr_warn instead of pr_warning
platform/x86: eeepc-laptop: Use pr_warn instead of pr_warning
oprofile: Use pr_warn instead of pr_warning
of: Use pr_warn instead of pr_warning
macintosh: Use pr_warn instead of pr_warning
idsn: Use pr_warn instead of pr_warning
ide: Use pr_warn instead of pr_warning
crypto: n2: Use pr_warn instead of pr_warning
...
Pull cgroup updates from Tejun Heo:
"There are several notable changes here:
- Single thread migrating itself has been optimized so that it
doesn't need threadgroup rwsem anymore.
- Freezer optimization to avoid unnecessary frozen state changes.
- cgroup ID unification so that cgroup fs ino is the only unique ID
used for the cgroup and can be used to directly look up live
cgroups through filehandle interface on 64bit ino archs. On 32bit
archs, cgroup fs ino is still the only ID in use but it is only
unique when combined with gen.
- selftest and other changes"
* 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
writeback: fix -Wformat compilation warnings
docs: cgroup: mm: Fix spelling of "list"
cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
cgroup: use cgrp->kn->id as the cgroup ID
kernfs: use 64bit inos if ino_t is 64bit
kernfs: implement custom exportfs ops and fid type
kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
kernfs: convert kernfs_node->id from union kernfs_node_id to u64
kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
netprio: use css ID instead of cgroup ID
writeback: use ino_t for inodes in tracepoints
kernfs: fix ino wrap-around detection
kselftests: cgroup: Avoid the reuse of fd after it is deallocated
cgroup: freezer: don't change task and cgroups status unnecessarily
cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
cgroup: remove cgroup_enable_task_cg_lists() optimization
cgroup: pids: use atomic64_t for pids->limit
selftests: cgroup: Run test_core under interfering stress
selftests: cgroup: Add task migration tests
...
- On ARMv8 CPUs without hardware updates of the access flag, avoid
failing cow_user_page() on PFN mappings if the pte is old. The patches
introduce an arch_faults_on_old_pte() macro, defined as false on x86.
When true, cow_user_page() makes the pte young before attempting
__copy_from_user_inatomic().
- Covert the synchronous exception handling paths in
arch/arm64/kernel/entry.S to C.
- FTRACE_WITH_REGS support for arm64.
- ZONE_DMA re-introduced on arm64 to support Raspberry Pi 4
- Several kselftest cases specific to arm64, together with a MAINTAINERS
update for these files (moved to the ARM64 PORT entry).
- Workaround for a Neoverse-N1 erratum where the CPU may fetch stale
instructions under certain conditions.
- Workaround for Cortex-A57 and A72 errata where the CPU may
speculatively execute an AT instruction and associate a VMID with the
wrong guest page tables (corrupting the TLB).
- Perf updates for arm64: additional PMU topologies on HiSilicon
platforms, support for CCN-512 interconnect, AXI ID filtering in the
IMX8 DDR PMU, support for the CCPI2 uncore PMU in ThunderX2.
- GICv3 optimisation to avoid a heavy barrier when accessing the
ICC_PMR_EL1 register.
- ELF HWCAP documentation updates and clean-up.
- SMC calling convention conduit code clean-up.
- KASLR diagnostics printed during boot
- NVIDIA Carmel CPU added to the KPTI whitelist
- Some arm64 mm clean-ups: use generic free_initrd_mem(), remove stale
macro, simplify calculation in __create_pgd_mapping(), typos.
- Kconfig clean-ups: CMDLINE_FORCE to depend on CMDLINE, choice for
endinanness to help with allmodconfig.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl3YJswACgkQa9axLQDI
XvFwYg//aTGhNLew3ADgW2TYal7LyqetRROixPBrzqHLu2A8No1+QxHMaKxpZVyf
pt25tABuLtPHql3qBzE0ltmfbLVsPj/3hULo404EJb9HLRfUnVGn7gcPkc+p4YAr
IYkYPXJbk6OlJ84vI+4vXmDEF12bWCqamC9qZ+h99qTpMjFXFO17DSJ7xQ8Xic3A
HHgCh4uA7gpTVOhLxaS6KIw+AZNYwvQxLXch2+wj6agbGX79uw9BeMhqVXdkPq8B
RTDJpOdS970WOT4cHWOkmXwsqqGRqgsgyu+bRUJ0U72+0y6MX0qSHIUnVYGmNc5q
Dtox4rryYLvkv/hbpkvjgVhv98q3J1mXt/CalChWB5dG4YwhJKN2jMiYuoAvB3WS
6dR7Dfupgai9gq1uoKgBayS2O6iFLSa4g58vt3EqUBqmM7W7viGFPdLbuVio4ycn
CNF2xZ8MZR6Wrh1JfggO7Hc11EJdSqESYfHO6V/pYB4pdpnqJLDoriYHXU7RsZrc
HvnrIvQWKMwNbqBvpNbWvK5mpBMMX2pEienA3wOqKNH7MbepVsG+npOZTVTtl9tN
FL0ePb/mKJu/2+gW8ntiqYn7EzjKprRmknOiT2FjWWo0PxgJ8lumefuhGZZbaOWt
/aTAeD7qKd/UXLKGHF/9v3q4GEYUdCFOXP94szWVPyLv+D9h8L8=
=TPL9
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"Apart from the arm64-specific bits (core arch and perf, new arm64
selftests), it touches the generic cow_user_page() (reviewed by
Kirill) together with a macro for x86 to preserve the existing
behaviour on this architecture.
Summary:
- On ARMv8 CPUs without hardware updates of the access flag, avoid
failing cow_user_page() on PFN mappings if the pte is old. The
patches introduce an arch_faults_on_old_pte() macro, defined as
false on x86. When true, cow_user_page() makes the pte young before
attempting __copy_from_user_inatomic().
- Covert the synchronous exception handling paths in
arch/arm64/kernel/entry.S to C.
- FTRACE_WITH_REGS support for arm64.
- ZONE_DMA re-introduced on arm64 to support Raspberry Pi 4
- Several kselftest cases specific to arm64, together with a
MAINTAINERS update for these files (moved to the ARM64 PORT entry).
- Workaround for a Neoverse-N1 erratum where the CPU may fetch stale
instructions under certain conditions.
- Workaround for Cortex-A57 and A72 errata where the CPU may
speculatively execute an AT instruction and associate a VMID with
the wrong guest page tables (corrupting the TLB).
- Perf updates for arm64: additional PMU topologies on HiSilicon
platforms, support for CCN-512 interconnect, AXI ID filtering in
the IMX8 DDR PMU, support for the CCPI2 uncore PMU in ThunderX2.
- GICv3 optimisation to avoid a heavy barrier when accessing the
ICC_PMR_EL1 register.
- ELF HWCAP documentation updates and clean-up.
- SMC calling convention conduit code clean-up.
- KASLR diagnostics printed during boot
- NVIDIA Carmel CPU added to the KPTI whitelist
- Some arm64 mm clean-ups: use generic free_initrd_mem(), remove
stale macro, simplify calculation in __create_pgd_mapping(), typos.
- Kconfig clean-ups: CMDLINE_FORCE to depend on CMDLINE, choice for
endinanness to help with allmodconfig"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
arm64: Kconfig: add a choice for endianness
kselftest: arm64: fix spelling mistake "contiguos" -> "contiguous"
arm64: Kconfig: make CMDLINE_FORCE depend on CMDLINE
MAINTAINERS: Add arm64 selftests to the ARM64 PORT entry
arm64: kaslr: Check command line before looking for a seed
arm64: kaslr: Announce KASLR status on boot
kselftest: arm64: fake_sigreturn_misaligned_sp
kselftest: arm64: fake_sigreturn_bad_size
kselftest: arm64: fake_sigreturn_duplicated_fpsimd
kselftest: arm64: fake_sigreturn_missing_fpsimd
kselftest: arm64: fake_sigreturn_bad_size_for_magic0
kselftest: arm64: fake_sigreturn_bad_magic
kselftest: arm64: add helper get_current_context
kselftest: arm64: extend test_init functionalities
kselftest: arm64: mangle_pstate_invalid_mode_el[123][ht]
kselftest: arm64: mangle_pstate_invalid_daif_bits
kselftest: arm64: mangle_pstate_invalid_compat_toggle and common utils
kselftest: arm64: extend toplevel skeleton Makefile
drivers/perf: hisi: update the sccl_id/ccl_id for certain HiSilicon platform
arm64: mm: reserve CMA and crashkernel in ZONE_DMA32
...
Currently, a lot of memory is wasted for architectures like MIPS when
init_ftrace_syscalls() allocates the array for syscalls using kcalloc.
This is because syscalls numbers start from 4000, 5000 or 6000 and
array elements up to that point are unused.
Fix this by using a data structure more suited to storing sparsely
populated arrays. The XARRAY data structure, implemented using radix
trees, is much more memory efficient for storing the syscalls in
question.
Link: http://lkml.kernel.org/r/20191115234314.21599-1-hnaveed@wavecomp.com
Signed-off-by: Hassan Naveed <hnaveed@wavecomp.com>
Reviewed-by: Paul Burton <paulburton@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Adding 2 new functions -
1) struct trace_array *trace_array_get_by_name(const char *name);
Return pointer to a trace array with given name. If it does not exist,
create and return pointer to the new trace array.
2) int trace_array_set_clr_event(struct trace_array *tr,
const char *system ,const char *event, bool enable);
Enable/Disable events to this trace array.
Additionally,
- To handle reference counters, export trace_array_put()
- Due to introduction of the above 2 new functions, we no longer need to
export - ftrace_set_clr_event & trace_array_create APIs.
Link: http://lkml.kernel.org/r/1574276919-11119-2-git-send-email-divya.indi@oracle.com
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ /\t/' -i */Kconfig
Link: http://lkml.kernel.org/r/20191120133807.12741-1-krzk@kernel.org
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ftrace_graph_stub was created and points to ftrace_stub as a way to
assign the functon graph tracer function pointer to a stub function with a
different prototype than what ftrace_stub has and not trigger the C
verifier. The ftrace_graph_stub was created via the linker script
vmlinux.lds.h. Unfortunately, powerpc already uses the name
ftrace_graph_stub for its internal implementation of the function graph
tracer, and even though powerpc would still build, the change via the linker
script broke function tracer on powerpc from working.
By using the name ftrace_stub_graph, which does not exist anywhere else in
the kernel, this should not be a problem.
Link: https://lore.kernel.org/r/1573849732.5937.136.camel@lca.pw
Fixes: b83b43ffc6 ("fgraph: Fix function type mismatches of ftrace_graph_return using ftrace_stub")
Reorted-by: Qian Cai <cai@lca.pw>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If a direct ftrace callback is at a location that does not have any other
ftrace helpers attached to it, it is possible to simply just change the
text to call the new caller (if the architecture supports it). But this
requires special architecture code. Currently, modify_ftrace_direct() uses a
trick to add a stub ftrace callback to the location forcing it to call the
ftrace iterator. Then it can change the direct helper to call the new
function in C, and then remove the stub. Removing the stub will have the
location now call the new location that the direct helper is using.
The new helper function does the registering the stub trick, but is a weak
function, allowing an architecture to override it to do something a bit more
direct.
Link: https://lore.kernel.org/r/20191115215125.mbqv7taqnx376yed@ast-mbp.dhcp.thefacebook.com
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Both unregister_ftrace_direct() and modify_ftrace_direct() needs to
normalize the ip passed in to match the rec->ip, as it is acceptable to have
the ip on the ftrace call site but not the start. There are also common
validity checks with the record found by the ip, these should be done for
both unregister_ftrace_direct() and modify_ftrace_direct().
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As an instruction pointer passed into register_ftrace_direct() may just
exist on the ftrace call site, but may not be the start of the call site
itself, register_ftrace_direct() still needs to update test if a direct call
exists on the normalized site, as only one direct call is allowed at any one
time.
Fixes: 763e34e74b ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The direct->count wasn't being updated properly, where it only was updated
when the first entry was added, but should be updated every time.
Fixes: 013bf0da04 ("ftrace: Add ftrace_find_direct_func()")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Increase the maximum allowed count of synthetic event fields from 16 to 32
in order to allow for larger-than-usual events.
Link: http://lkml.kernel.org/r/20191115091730.9192-1-dedekind1@gmail.com
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a new function modify_ftrace_direct() that will allow a user to update
an existing direct caller to a new trampoline, without missing hits due to
unregistering one and then adding another.
Link: https://lore.kernel.org/r/20191109022907.6zzo6orhxpt5n2sv@ast-mbp.dhcp.thefacebook.com
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The latency_fsnotify() stub when the function is not defined, was missing
the "inline".
Link: https://lore.kernel.org/r/20191115140213.74c5efe7@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There was a stray tab in the help text of the aforementioned config
option which showed like this:
The "print fmt" of the trace events will show the enum/sizeof names
instead of their values. This can cause problems for user space tools
...
in menuconfig. Remove it and end a sentence with a fullstop.
No functional changes.
Link: http://lkml.kernel.org/r/20191112174219.10933-1-bp@alien8.de
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Without this, buffers can be printed with __print_array macro that has
no formatting options and can be hard to read. The other way is to
mimic formatting capability with multiple calls of trace event with one
call per row which gives performance impact and different timestamp in
each row.
Link: http://lkml.kernel.org/r/1573130738-29390-2-git-send-email-piotrx.maziarz@linux.intel.com
Signed-off-by: Piotr Maziarz <piotrx.maziarz@linux.intel.com>
Signed-off-by: Cezary Rojewski <cezary.rojewski@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Check whether the non-suffixed symbol is notrace, since suffixed
symbols are generated by the compilers for optimization. Based on
these suffixed symbols, notrace check might not work because
some of them are just a partial code of the original function.
(e.g. cold-cache (unlikely) code is separated from original
function as FUNCTION.cold.XX)
For example, without this fix,
# echo p device_add.cold.67 > /sys/kernel/debug/tracing/kprobe_events
sh: write error: Invalid argument
# cat /sys/kernel/debug/tracing/error_log
[ 135.491035] trace_kprobe: error: Failed to register probe event
Command: p device_add.cold.67
^
# dmesg | tail -n 1
[ 135.488599] trace_kprobe: Could not probe notrace function device_add.cold.67
With this,
# echo p device_add.cold.66 > /sys/kernel/debug/tracing/kprobe_events
# cat /sys/kernel/debug/kprobes/list
ffffffff81599de9 k device_add.cold.66+0x0 [DISABLED]
Actually, kprobe blacklist already did similar thing,
see within_kprobe_blacklist().
Link: http://lkml.kernel.org/r/157233790394.6706.18243942030937189679.stgit@devnote2
Fixes: 45408c4f92 ("tracing: kprobes: Prohibit probing on notrace function")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fail to allocate memory for tgid_map, because it requires order-6 page.
detail as:
c3 sh: page allocation failure: order:6,
mode:0x140c0c0(GFP_KERNEL), nodemask=(null)
c3 sh cpuset=/ mems_allowed=0
c3 CPU: 3 PID: 5632 Comm: sh Tainted: G W O 4.14.133+ #10
c3 Hardware name: Generic DT based system
c3 Backtrace:
c3 [<c010bdbc>] (dump_backtrace) from [<c010c08c>](show_stack+0x18/0x1c)
c3 [<c010c074>] (show_stack) from [<c0993c54>](dump_stack+0x84/0xa4)
c3 [<c0993bd0>] (dump_stack) from [<c0229858>](warn_alloc+0xc4/0x19c)
c3 [<c0229798>] (warn_alloc) from [<c022a6e4>](__alloc_pages_nodemask+0xd18/0xf28)
c3 [<c02299cc>] (__alloc_pages_nodemask) from [<c0248344>](kmalloc_order+0x20/0x38)
c3 [<c0248324>] (kmalloc_order) from [<c0248380>](kmalloc_order_trace+0x24/0x108)
c3 [<c024835c>] (kmalloc_order_trace) from [<c01e6078>](set_tracer_flag+0xb0/0x158)
c3 [<c01e5fc8>] (set_tracer_flag) from [<c01e6404>](trace_options_core_write+0x7c/0xcc)
c3 [<c01e6388>] (trace_options_core_write) from [<c0278b1c>](__vfs_write+0x40/0x14c)
c3 [<c0278adc>] (__vfs_write) from [<c0278e10>](vfs_write+0xc4/0x198)
c3 [<c0278d4c>] (vfs_write) from [<c027906c>](SyS_write+0x6c/0xd0)
c3 [<c0279000>] (SyS_write) from [<c01079a0>](ret_fast_syscall+0x0/0x54)
Switch to use kvcalloc to avoid unexpected allocation failures.
Link: http://lkml.kernel.org/r/1571888070-24425-1-git-send-email-chunyan.zhang@unisoc.com
Signed-off-by: Yuming Han <yuming.han@unisoc.com>
Signed-off-by: Chunyan Zhang <chunyan.zhang@unisoc.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The C compiler is allowing more checks to make sure that function pointers
are assigned to the correct prototype function. Unfortunately, the function
graph tracer uses a special name with its assigned ftrace_graph_return
function pointer that maps to a stub function used by the function tracer
(ftrace_stub). The ftrace_graph_return variable is compared to the
ftrace_stub in some archs to know if the function graph tracer is enabled or
not. This means we can not just simply create a new function stub that
compares it without modifying all the archs.
Instead, have the linker script create a function_graph_stub that maps to
ftrace_stub, and this way we can define the prototype for it to match the
prototype of ftrace_graph_return, and make the compiler checks all happy!
Link: http://lkml.kernel.org/r/20191015090055.789a0aed@gandalf.local.home
Cc: linux-sh@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Reported-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As part of commit f45d1225ad ("tracing: Kernel access to Ftrace
instances") we exported certain functions. Here, we are adding some additional
NULL checks to ensure safe usage by users of these APIs.
Link: http://lkml.kernel.org/r/1565805327-579-4-git-send-email-divya.indi@oracle.com
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
A trace array can be destroyed from userspace or kernel. Verify if the
trace array exists before proceeding to destroy/remove it.
Link: http://lkml.kernel.org/r/1565805327-579-3-git-send-email-divya.indi@oracle.com
Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Divya Indi <divya.indi@oracle.com>
[ Removed unneeded braces ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Declare the newly introduced and exported APIs in the header file -
include/linux/trace.h. Moving previous declarations from
kernel/trace/trace.h to include/linux/trace.h.
Link: http://lkml.kernel.org/r/1565805327-579-2-git-send-email-divya.indi@oracle.com
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The event_class_ftrace_##call and event_##call do not seem
to be used outside of trace_export.c so make them both static
to avoid a number of sparse warnings:
kernel/trace/trace_entries.h:59:1: warning: symbol 'event_class_ftrace_function' was not declared. Should it be static?
kernel/trace/trace_entries.h:59:1: warning: symbol '__event_function' was not declared. Should it be static?
kernel/trace/trace_entries.h:77:1: warning: symbol 'event_class_ftrace_funcgraph_entry' was not declared. Should it be static?
kernel/trace/trace_entries.h:77:1: warning: symbol '__event_funcgraph_entry' was not declared. Should it be static?
kernel/trace/trace_entries.h:93:1: warning: symbol 'event_class_ftrace_funcgraph_exit' was not declared. Should it be static?
kernel/trace/trace_entries.h:93:1: warning: symbol '__event_funcgraph_exit' was not declared. Should it be static?
kernel/trace/trace_entries.h:129:1: warning: symbol 'event_class_ftrace_context_switch' was not declared. Should it be static?
kernel/trace/trace_entries.h:129:1: warning: symbol '__event_context_switch' was not declared. Should it be static?
kernel/trace/trace_entries.h:149:1: warning: symbol 'event_class_ftrace_wakeup' was not declared. Should it be static?
kernel/trace/trace_entries.h:149:1: warning: symbol '__event_wakeup' was not declared. Should it be static?
kernel/trace/trace_entries.h:171:1: warning: symbol 'event_class_ftrace_kernel_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:171:1: warning: symbol '__event_kernel_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:191:1: warning: symbol 'event_class_ftrace_user_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:191:1: warning: symbol '__event_user_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:214:1: warning: symbol 'event_class_ftrace_bprint' was not declared. Should it be static?
kernel/trace/trace_entries.h:214:1: warning: symbol '__event_bprint' was not declared. Should it be static?
kernel/trace/trace_entries.h:230:1: warning: symbol 'event_class_ftrace_print' was not declared. Should it be static?
kernel/trace/trace_entries.h:230:1: warning: symbol '__event_print' was not declared. Should it be static?
kernel/trace/trace_entries.h:247:1: warning: symbol 'event_class_ftrace_raw_data' was not declared. Should it be static?
kernel/trace/trace_entries.h:247:1: warning: symbol '__event_raw_data' was not declared. Should it be static?
kernel/trace/trace_entries.h:262:1: warning: symbol 'event_class_ftrace_bputs' was not declared. Should it be static?
kernel/trace/trace_entries.h:262:1: warning: symbol '__event_bputs' was not declared. Should it be static?
kernel/trace/trace_entries.h:277:1: warning: symbol 'event_class_ftrace_mmiotrace_rw' was not declared. Should it be static?
kernel/trace/trace_entries.h:277:1: warning: symbol '__event_mmiotrace_rw' was not declared. Should it be static?
kernel/trace/trace_entries.h:298:1: warning: symbol 'event_class_ftrace_mmiotrace_map' was not declared. Should it be static?
kernel/trace/trace_entries.h:298:1: warning: symbol '__event_mmiotrace_map' was not declared. Should it be static?
kernel/trace/trace_entries.h:322:1: warning: symbol 'event_class_ftrace_branch' was not declared. Should it be static?
kernel/trace/trace_entries.h:322:1: warning: symbol '__event_branch' was not declared. Should it be static?
kernel/trace/trace_entries.h:343:1: warning: symbol 'event_class_ftrace_hwlat' was not declared. Should it be static?
kernel/trace/trace_entries.h:343:1: warning: symbol '__event_hwlat' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20191015121012.18824-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.
Add additional header output for PREEMPT_RT.
Link: http://lkml.kernel.org/r/20191015191821.11479-34-bigeasy@linutronix.de
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This burst feature enables the user to generate a burst of
preempt/irqsoff latencies. This makes it possible to test whether we
are able to detect latencies that systematically occur very close to
each other.
The maximum burst size is 10. We also create 10 identical test
functions, so that we get 10 different backtraces; this is useful
when we want to test whether we can detect all the latencies in a
burst. Otherwise, there would be no easy way of differentiating
between which latency in a burst was captured by the tracer.
In addition, there is a sysfs trigger, so that it's not necessary to
reload the module to repeat the test. The trigger will appear as
/sys/kernel/preemptirq_delay_test/trigger in sysfs.
Link: http://lkml.kernel.org/r/20191008220824.7911-3-viktor.rosendahl@gmail.com
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Viktor Rosendahl (BMW) <viktor.rosendahl@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This patch implements the feature that the tracing_max_latency file,
e.g. /sys/kernel/debug/tracing/tracing_max_latency will receive
notifications through the fsnotify framework when a new latency is
available.
One particularly interesting use of this facility is when enabling
threshold tracing, through /sys/kernel/debug/tracing/tracing_thresh,
together with the preempt/irqsoff tracers. This makes it possible to
implement a user space program that can, with equal probability,
obtain traces of latencies that occur immediately after each other in
spite of the fact that the preempt/irqsoff tracers operate in overwrite
mode.
This facility works with the hwlat, preempt/irqsoff, and wakeup
tracers.
The tracers may call the latency_fsnotify() from places such as
__schedule() or do_idle(); this makes it impossible to call
queue_work() directly without risking a deadlock. The same would
happen with a softirq, kernel thread or tasklet. For this reason we
use the irq_work mechanism to call queue_work().
This patch creates a new workqueue. The reason for doing this is that
I wanted to use the WQ_UNBOUND and WQ_HIGHPRI flags. My thinking was
that WQ_UNBOUND might help with the latency in some important cases.
If we use:
queue_work(system_highpri_wq, &tr->fsnotify_work);
then the work will (almost) always execute on the same CPU but if we are
unlucky that CPU could be too busy while there could be another CPU in
the system that would be able to process the work soon enough.
queue_work_on() could be used to queue the work on another CPU but it
seems difficult to select the right CPU.
Link: http://lkml.kernel.org/r/20191008220824.7911-2-viktor.rosendahl@gmail.com
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Viktor Rosendahl (BMW) <viktor.rosendahl@gmail.com>
[ Added max() to have one compare for max latency ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Looking for ways to shrink the size of the dyn_ftrace structure, knowing the
information about how many pages and the number of groups of those pages, is
useful in working out the best ways to save on memory.
This adds one info print on how many groups of pages were used to allocate
the ftrace dyn_ftrace structures, and also shows the number of pages and
groups in the dyn_ftrace_total_info (which is used for debugging).
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As testing for direct calls from the function graph tracer adds a little
overhead (which is a lot when tracing every function), add a counter that
can be used to test if function_graph tracer needs to test for a direct
caller or not.
It would have been nicer if we could use a static branch, but the static
branch logic fails when used within the function graph tracer trampoline.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As function_graph tracer modifies the return address to insert a trampoline
to trace the return of a function, it must be aware of a direct caller, as
when it gets called, the function's return address may not be at on the
stack where it expects. It may have to see if that return address points to
the a direct caller and adjust if it is.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add the start of the functionality to allow other trampolines to use the
ftrace mcount/fentry/nop location. This adds two new functions:
register_ftrace_direct() and unregister_ftrace_direct()
Both take two parameters: the first is the instruction address of where the
mcount/fentry/nop exists, and the second is the trampoline to have that
location called.
This will handle cases where ftrace is already used on that same location,
and will make it still work, where the registered direct called trampoline
will get called after all the registered ftrace callers are handled.
Currently, it will not allow for IP_MODIFY functions to be called at the
same locations, which include some kprobes and live kernel patching.
At this point, no architecture supports this. This is only the start of
implementing the framework.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
cgroup ID is currently allocated using a dedicated per-hierarchy idr
and used internally and exposed through tracepoints and bpf. This is
confusing because there are tracepoints and other interfaces which use
the cgroupfs ino as IDs.
The preceding changes made kn->id exposed as ino as 64bit ino on
supported archs or ino+gen (low 32bits as ino, high gen). There's no
reason for cgroup to use different IDs. The kernfs IDs are unique and
userland can easily discover them and map them back to paths using
standard file operations.
This patch replaces cgroup IDs with kernfs IDs.
* cgroup_id() is added and all cgroup ID users are converted to use it.
* kernfs_node creation is moved to earlier during cgroup init so that
cgroup_id() is available during init.
* While at it, s/cgroup/cgrp/ in psi helpers for consistency.
* Fallback ID value is changed to 1 to be consistent with root cgroup
ID.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Each kernfs_node is identified with a 64bit ID. The low 32bit is
exposed as ino and the high gen. While this already allows using inos
as keys by looking up with wildcard generation number of 0, it's
adding unnecessary complications for 64bit ino archs which can
directly use kernfs_node IDs as inos to uniquely identify each cgroup
instance.
This patch exposes IDs directly as inos on 64bit ino archs. The
conversion is mostly straight-forward.
* 32bit ino archs behave the same as before. 64bit ino archs now use
the whole 64bit ID as ino and the generation number is fixed at 1.
* 64bit inos still use the same idr allocator which gurantees that the
lower 32bits identify the current live instance uniquely and the
high 32bits are incremented whenever the low bits wrap. As the
upper 32bits are no longer used as gen and we don't wanna start ino
allocation with 33rd bit set, the initial value for highbits
allocation is changed to 0 on 64bit ino archs.
* blktrace exposes two 32bit numbers - (INO,GEN) pair - to identify
the issuing cgroup. Userland builds FILEID_INO32_GEN fids from
these numbers to look up the cgroups. To remain compatible with the
behavior, always output (LOW32,HIGH32) which will be constructed
back to the original 64bit ID by __kernfs_fh_to_dentry().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value. I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to. Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.
This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper. Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen. This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.
This patch doesn't make any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>
We need to convert flags to atomic_t in order to later fix an ordering
issue on atomic_cmpxchg() failure. This will allow us to use atomic_fetch_or().
Also clarify the nature of those flags.
[ mingo: Converted two more usage site the original patch missed. ]
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191108160858.31665-2-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Create a new function called lookup_rec() from the functionality of
ftrace_location_range(). The difference between lookup_rec() is that it
returns the record that it finds, where as ftrace_location_range() returns
only if it found a match or not.
The lookup_rec() is static, and can be used for new functionality where
ftrace needs to find a record of a specific address.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Most of the functionality of __ftrace_hash_move() can be reused, but not all
of it. That is, __ftrace_hash_move() is used to simply make a new hash from
an existing one, using the same size as the original. Creating a dup_hash(),
where we can specify a new size will be useful when we want to create a hash
with a default size, or simply copy the old one.
Signed-off-by: Steven Rostedt (VMWare) <rostedt@goodmis.org>
Architectures may need to perform special initialization of ftrace
callsites, and today they do so by special-casing ftrace_make_nop() when
the expected branch address is MCOUNT_ADDR. In some cases (e.g. for
patchable-function-entry), we don't have an mcount-like symbol and don't
want a synthetic MCOUNT_ADDR, but we may need to perform some
initialization of callsites.
To make it possible to separate initialization from runtime
modification, and to handle cases without an mcount-like symbol, this
patch adds an optional ftrace_init_nop() function that architectures can
implement, which does not pass a branch address.
Where an architecture does not provide ftrace_init_nop(), we will fall
back to the existing behaviour of calling ftrace_make_nop() with
MCOUNT_ADDR.
At the same time, ftrace_code_disable() is renamed to
ftrace_nop_initialize() to make it clearer that it is intended to
intialize a callsite into a disabled state, and is not for disabling a
callsite that has been runtime enabled. The kerneldoc description of rec
arguments is updated to cover non-mcount callsites.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Torsten Duwe <duwe@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Livepatch uses ftrace for redirection to new patched functions. It means
that if ftrace is disabled, all live patched functions are disabled as
well. Toggling global 'ftrace_enabled' sysctl thus affect it directly.
It is not a problem per se, because only administrator can set sysctl
values, but it still may be surprising.
Introduce PERMANENT ftrace_ops flag to amend this. If the
FTRACE_OPS_FL_PERMANENT is set on any ftrace ops, the tracing cannot be
disabled by disabling ftrace_enabled. Equally, a callback with the flag
set cannot be registered if ftrace_enabled is disabled.
Link: http://lkml.kernel.org/r/20191016113316.13415-2-mbenes@suse.cz
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2019-11-02
The following pull-request contains BPF updates for your *net-next* tree.
We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 41 files changed, 1864 insertions(+), 474 deletions(-).
The main changes are:
1) Fix long standing user vs kernel access issue by introducing
bpf_probe_read_user() and bpf_probe_read_kernel() helpers, from Daniel.
2) Accelerated xskmap lookup, from Björn and Maciej.
3) Support for automatic map pinning in libbpf, from Toke.
4) Cleanup of BTF-enabled raw tracepoints, from Alexei.
5) Various fixes to libbpf and selftests.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The only slightly tricky merge conflict was the netdevsim because the
mutex locking fix overlapped a lot of driver reload reorganization.
The rest were (relatively) trivial in nature.
Signed-off-by: David S. Miller <davem@davemloft.net>
The current bpf_probe_read() and bpf_probe_read_str() helpers are broken
in that they assume they can be used for probing memory access for kernel
space addresses /as well as/ user space addresses.
However, plain use of probe_kernel_read() for both cases will attempt to
always access kernel space address space given access is performed under
KERNEL_DS and some archs in-fact have overlapping address spaces where a
kernel pointer and user pointer would have the /same/ address value and
therefore accessing application memory via bpf_probe_read{,_str}() would
read garbage values.
Lets fix BPF side by making use of recently added 3d7081822f ("uaccess:
Add non-pagefault user-space read functions"). Unfortunately, the only way
to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}()
and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}()
helpers are kept as-is to retain their current behavior.
The two *_user() variants attempt the access always under USER_DS set, the
two *_kernel() variants will -EFAULT when accessing user memory if the
underlying architecture has non-overlapping address ranges, also avoiding
throwing the kernel warning via 00c42373d3 ("x86-64: add warning for
non-canonical user access address dereferences").
Fixes: a5e8c07059 ("bpf: add bpf_probe_read_str helper")
Fixes: 2541517c32 ("tracing, perf: Implement BPF programs attached to kprobes")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
Convert the bpf_probe_write_user() helper to probe_user_write() such that
writes are not attempted under KERNEL_DS anymore which is buggy as kernel
and user space pointers can have overlapping addresses. Also, given we have
the access_ok() check inside probe_user_write(), the helper doesn't need
to do it twice.
Fixes: 96ae522795 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/841c461781874c07a0ee404a454c3bc0459eed30.1572649915.git.daniel@iogearbox.net
The bpf program type raw_tp together with 'expected_attach_type'
was the most appropriate api to indicate BTF-enabled raw_tp programs.
But during development it became apparent that 'expected_attach_type'
cannot be used and new 'attach_btf_id' field had to be introduced.
Which means that the information is duplicated in two fields where
one of them is ignored.
Clean it up by introducing new program type where both
'expected_attach_type' and 'attach_btf_id' fields have
specific meaning.
In the future 'expected_attach_type' will be extended
with other attach points that have similar semantics to raw_tp.
This patch is replacing BTF-enabled BPF_PROG_TYPE_RAW_TRACEPOINT with
prog_type = BPF_RPOG_TYPE_TRACING
expected_attach_type = BPF_TRACE_RAW_TP
attach_btf_id = btf_id of raw tracepoint inside the kernel
Future patches will add
expected_attach_type = BPF_TRACE_FENTRY or BPF_TRACE_FEXIT
where programs have the same input context and the same helpers,
but different attach points.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191030223212.953010-2-ast@kernel.org
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-10-27
The following pull-request contains BPF updates for your *net-next* tree.
We've added 52 non-merge commits during the last 11 day(s) which contain
a total of 65 files changed, 2604 insertions(+), 1100 deletions(-).
The main changes are:
1) Revolutionize BPF tracing by using in-kernel BTF to type check BPF
assembly code. The work here teaches BPF verifier to recognize
kfree_skb()'s first argument as 'struct sk_buff *' in tracepoints
such that verifier allows direct use of bpf_skb_event_output() helper
used in tc BPF et al (w/o probing memory access) that dumps skb data
into perf ring buffer. Also add direct loads to probe memory in order
to speed up/replace bpf_probe_read() calls, from Alexei Starovoitov.
2) Big batch of changes to improve libbpf and BPF kselftests. Besides
others: generalization of libbpf's CO-RE relocation support to now
also include field existence relocations, revamp the BPF kselftest
Makefile to add test runner concept allowing to exercise various
ways to build BPF programs, and teach bpf_object__open() and friends
to automatically derive BPF program type/expected attach type from
section names to ease their use, from Andrii Nakryiko.
3) Fix deadlock in stackmap's build-id lookup on rq_lock(), from Song Liu.
4) Allow to read BTF as raw data from bpftool. Most notable use case
is to dump /sys/kernel/btf/vmlinux through this, from Jiri Olsa.
5) Use bpf_redirect_map() helper in libbpf's AF_XDP helper prog which
manages to improve "rx_drop" performance by ~4%., from Björn Töpel.
6) Fix to restore the flow dissector after reattach BPF test and also
fix error handling in bpf_helper_defs.h generation, from Jakub Sitnicki.
7) Improve verifier's BTF ctx access for use outside of raw_tp, from
Martin KaFai Lau.
8) Improve documentation for AF_XDP with new sections and to reflect
latest features, from Magnus Karlsson.
9) Add back 'version' section parsing to libbpf for old kernels, from
John Fastabend.
10) Fix strncat bounds error in libbpf's libbpf_prog_type_by_name(),
from KP Singh.
11) Turn on -mattr=+alu32 in LLVM by default for BPF kselftests in order
to improve insn coverage for built BPF progs, from Yonghong Song.
12) Misc minor cleanups and fixes, from various others.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes a few changes to btf_ctx_access() to prepare
it for non raw_tp use case where the attach_btf_id is not
necessary a BTF_KIND_TYPEDEF.
It moves the "btf_trace_" prefix check and typedef-follow logic to a new
function "check_attach_btf_id()" which is called only once during
bpf_check(). btf_ctx_access() only operates on a BTF_KIND_FUNC_PROTO
type now. That should also be more efficient since it is done only
one instead of every-time check_ctx_access() is called.
"check_attach_btf_id()" needs to find the func_proto type from
the attach_btf_id. It needs to store the result into the
newly added prog->aux->attach_func_proto. func_proto
btf type has no name, so a proper name should be stored into
"attach_func_name" also.
v2:
- Move the "btf_trace_" check to an earlier verifier phase (Alexei)
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191025001811.1718491-1-kafai@fb.com
If CONFIG_NET is n, building fails:
kernel/trace/bpf_trace.o: In function `raw_tp_prog_func_proto':
bpf_trace.c:(.text+0x1a34): undefined reference to `bpf_skb_output_proto'
Wrap it into a #ifdef to fix this.
Fixes: a7658e1a41 ("bpf: Check types of arguments passed into helpers")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191018090344.26936-1-yuehaibing@huawei.com
In the format of synthetic events, the "gfp_t" is shown as "signed:1",
but in fact the "gfp_t" is "unsigned", should be shown as "signed:0".
The issue can be reproduced by the following commands:
echo 'memlatency u64 lat; unsigned int order; gfp_t gfp_flags; int migratetype' > /sys/kernel/debug/tracing/synthetic_events
cat /sys/kernel/debug/tracing/events/synthetic/memlatency/format
name: memlatency
ID: 2233
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:u64 lat; offset:8; size:8; signed:0;
field:unsigned int order; offset:16; size:4; signed:0;
field:gfp_t gfp_flags; offset:24; size:4; signed:1;
field:int migratetype; offset:32; size:4; signed:1;
print fmt: "lat=%llu, order=%u, gfp_flags=%x, migratetype=%d", REC->lat, REC->order, REC->gfp_flags, REC->migratetype
Link: http://lkml.kernel.org/r/20191018012034.6404-1-zhengjun.xing@linux.intel.com
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Zhengjun Xing <zhengjun.xing@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As said in commit f2c2cbcc35 ("powerpc: Use pr_warn instead of
pr_warning"), removing pr_warning so all logging messages use a
consistent <prefix>_warn style. Let's do it.
Link: http://lkml.kernel.org/r/20191018031850.48498-26-wangkefeng.wang@huawei.com
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
In current mainline, the degree of access to perf_event_open(2) system
call depends on the perf_event_paranoid sysctl. This has a number of
limitations:
1. The sysctl is only a single value. Many types of accesses are controlled
based on the single value thus making the control very limited and
coarse grained.
2. The sysctl is global, so if the sysctl is changed, then that means
all processes get access to perf_event_open(2) opening the door to
security issues.
This patch adds LSM and SELinux access checking which will be used in
Android to access perf_event_open(2) for the purposes of attaching BPF
programs to tracepoints, perf profiling and other operations from
userspace. These operations are intended for production systems.
5 new LSM hooks are added:
1. perf_event_open: This controls access during the perf_event_open(2)
syscall itself. The hook is called from all the places that the
perf_event_paranoid sysctl is checked to keep it consistent with the
systctl. The hook gets passed a 'type' argument which controls CPU,
kernel and tracepoint accesses (in this context, CPU, kernel and
tracepoint have the same semantics as the perf_event_paranoid sysctl).
Additionally, I added an 'open' type which is similar to
perf_event_paranoid sysctl == 3 patch carried in Android and several other
distros but was rejected in mainline [1] in 2016.
2. perf_event_alloc: This allocates a new security object for the event
which stores the current SID within the event. It will be useful when
the perf event's FD is passed through IPC to another process which may
try to read the FD. Appropriate security checks will limit access.
3. perf_event_free: Called when the event is closed.
4. perf_event_read: Called from the read(2) and mmap(2) syscalls for the event.
5. perf_event_write: Called from the ioctl(2) syscalls for the event.
[1] https://lwn.net/Articles/696240/
Since Peter had suggest LSM hooks in 2016 [1], I am adding his
Suggested-by tag below.
To use this patch, we set the perf_event_paranoid sysctl to -1 and then
apply selinux checking as appropriate (default deny everything, and then
add policy rules to give access to domains that need it). In the future
we can remove the perf_event_paranoid sysctl altogether.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: rostedt@goodmis.org
Cc: Yonghong Song <yhs@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: jeffv@google.com
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: primiano@google.com
Cc: Song Liu <songliubraving@fb.com>
Cc: rsavitski@google.com
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Matthew Garrett <matthewgarrett@google.com>
Link: https://lkml.kernel.org/r/20191014170308.70668-1-joel@joelfernandes.org
Introduce new helper that reuses existing skb perf_event output
implementation, but can be called from raw_tracepoint programs
that receive 'struct sk_buff *' as tracepoint argument or
can walk other kernel data structures to skb pointer.
In order to do that teach verifier to resolve true C types
of bpf helpers into in-kernel BTF ids.
The type of kernel pointer passed by raw tracepoint into bpf
program will be tracked by the verifier all the way until
it's passed into helper function.
For example:
kfree_skb() kernel function calls trace_kfree_skb(skb, loc);
bpf programs receives that skb pointer and may eventually
pass it into bpf_skb_output() bpf helper which in-kernel is
implemented via bpf_skb_event_output() kernel function.
Its first argument in the kernel is 'struct sk_buff *'.
The verifier makes sure that types match all the way.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-11-ast@kernel.org
libbpf analyzes bpf C program, searches in-kernel BTF for given type name
and stores it into expected_attach_type.
The kernel verifier expects this btf_id to point to something like:
typedef void (*btf_trace_kfree_skb)(void *, struct sk_buff *skb, void *loc);
which represents signature of raw_tracepoint "kfree_skb".
Then btf_ctx_access() matches ctx+0 access in bpf program with 'skb'
and 'ctx+8' access with 'loc' arguments of "kfree_skb" tracepoint.
In first case it passes btf_id of 'struct sk_buff *' back to the verifier core
and 'void *' in second case.
Then the verifier tracks PTR_TO_BTF_ID as any other pointer type.
Like PTR_TO_SOCKET points to 'struct bpf_sock',
PTR_TO_TCP_SOCK points to 'struct bpf_tcp_sock', and so on.
PTR_TO_BTF_ID points to in-kernel structs.
If 1234 is btf_id of 'struct sk_buff' in vmlinux's BTF
then PTR_TO_BTF_ID#1234 points to one of in kernel skbs.
When PTR_TO_BTF_ID#1234 is dereferenced (like r2 = *(u64 *)r1 + 32)
the btf_struct_access() checks which field of 'struct sk_buff' is
at offset 32. Checks that size of access matches type definition
of the field and continues to track the dereferenced type.
If that field was a pointer to 'struct net_device' the r2's type
will be PTR_TO_BTF_ID#456. Where 456 is btf_id of 'struct net_device'
in vmlinux's BTF.
Such verifier analysis prevents "cheating" in BPF C program.
The program cannot cast arbitrary pointer to 'struct sk_buff *'
and access it. C compiler would allow type cast, of course,
but the verifier will notice type mismatch based on BPF assembly
and in-kernel BTF.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-7-ast@kernel.org
A customer reported the following softlockup:
[899688.160002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test.sh:16464]
[899688.160002] CPU: 0 PID: 16464 Comm: test.sh Not tainted 4.12.14-6.23-azure #1 SLE12-SP4
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] Kernel panic - not syncing: softlockup: hung tasks
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] RSP: 0018:ffffa86784d4fde8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff12
[899688.160002] RAX: ffffffff970fea00 RBX: 0000000000000001 RCX: 0000000000000000
[899688.160002] RDX: ffffffff00000001 RSI: 0000000000000080 RDI: ffffffff970fea00
[899688.160002] RBP: ffffffffffffffff R08: ffffffffffffffff R09: 0000000000000000
[899688.160002] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b59014720d8
[899688.160002] R13: ffff8b59014720c0 R14: ffff8b5901471090 R15: ffff8b5901470000
[899688.160002] tracing_read_pipe+0x336/0x3c0
[899688.160002] __vfs_read+0x26/0x140
[899688.160002] vfs_read+0x87/0x130
[899688.160002] SyS_read+0x42/0x90
[899688.160002] do_syscall_64+0x74/0x160
It caught the process in the middle of trace_access_unlock(). There is
no loop. So, it must be looping in the caller tracing_read_pipe()
via the "waitagain" label.
Crashdump analyze uncovered that iter->seq was completely zeroed
at this point, including iter->seq.seq.size. It means that
print_trace_line() was never able to print anything and
there was no forward progress.
The culprit seems to be in the code:
/* reset all but tr, trace, and overruns */
memset(&iter->seq, 0,
sizeof(struct trace_iterator) -
offsetof(struct trace_iterator, seq));
It was added by the commit 53d0aa7730 ("ftrace:
add logic to record overruns"). It was v2.6.27-rc1.
It was the time when iter->seq looked like:
struct trace_seq {
unsigned char buffer[PAGE_SIZE];
unsigned int len;
};
There was no "size" variable and zeroing was perfectly fine.
The solution is to reinitialize the structure after or without
zeroing.
Link: http://lkml.kernel.org/r/20191011142134.11997-1-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
max_latency is intended to record the maximum ever observed hardware
latency, which may occur in either part of the loop (inner/outer). So
we need to also consider the outer-loop sample when updating
max_latency.
Link: http://lkml.kernel.org/r/157073345463.17189.18124025522664682811.stgit@srivatsa-ubuntu
Fixes: e7c15cd8a1 ("tracing: Added hardware latency tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
nmi_total_ts is supposed to record the total time spent in *all* NMIs
that occur on the given CPU during the (active portion of the)
sampling window. However, the code seems to be overwriting this
variable for each NMI, thereby only recording the time spent in the
most recent NMI. Fix it by accumulating the duration instead.
Link: http://lkml.kernel.org/r/157073343544.17189.13911783866738671133.stgit@srivatsa-ubuntu
Fixes: 7b2c862501 ("tracing: Add NMI tracing in hwlat detector")
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Added various checks on open tracefs calls to see if tracefs is in lockdown
mode, and if so, to return -EPERM.
Note, the event format files (which are basically standard on all machines)
as well as the enabled_functions file (which shows what is currently being
traced) are not lockde down. Perhaps they should be, but it seems counter
intuitive to lockdown information to help you know if the system has been
modified.
Link: http://lkml.kernel.org/r/CAHk-=wj7fGPKUspr579Cii-w_y60PtRaiDgKuxVtBAMK0VNNkA@mail.gmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, most files in the tracefs directory test if tracing_disabled is
set. If so, it should return -ENODEV. The tracing_disabled is called when
tracing is found to be broken. Originally it was done in case the ring
buffer was found to be corrupted, and we wanted to prevent reading it from
crashing the kernel. But it's also called if a tracing selftest fails on
boot. It's a one way switch. That is, once it is triggered, tracing is
disabled until reboot.
As most tracefs files can also be used by instances in the tracefs
directory, they need to be carefully done. Each instance has a trace_array
associated to it, and when the instance is removed, the trace_array is
freed. But if an instance is opened with a reference to the trace_array,
then it requires looking up the trace_array to get its ref counter (as there
could be a race with it being deleted and the open itself). Once it is
found, a reference is added to prevent the instance from being removed (and
the trace_array associated with it freed).
Combine the two checks (tracing_disabled and trace_array_get()) into a
single helper function. This will also make it easier to add lockdown to
tracefs later.
Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of having the trace events system open call open code the taking of
the trace_array descriptor (with trace_array_get()) and then calling
trace_open_generic(), have it use the tracing_open_generic_tr() that does
the combination of the two. This requires making tracing_open_generic_tr()
global.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As instances may have different tracers available, we need to look at the
trace_array descriptor that shows the list of the available tracers for the
instance. But there's a race between opening the file and an admin
deleting the instance. The trace_array_get() needs to be called before
accessing the trace_array.
Cc: stable@vger.kernel.org
Fixes: 607e2ea167 ("tracing: Set up infrastructure to allow tracers for instances")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ftrace set_ftrace_filter and set_ftrace_notrace files are specific for
an instance now. They need to take a reference to the instance otherwise
there could be a race between accessing the files and deleting the instance.
It wasn't until the :mod: caching where these file operations started
referencing the trace_array directly.
Cc: stable@vger.kernel.org
Fixes: 673feb9d76 ("ftrace: Add :mod: caching infrastructure to trace_array")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Fixed a buffer overflow by checking nr_args correctly in probes
- Fixed a warning that is reported by clang
- Fixed a possible memory leak in error path of filter processing
- Fixed the selftest that checks for failures, but wasn't failing
- Minor clean up on call site output of a memory trace event
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXZEP5hQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhrSAQDlws8rY/vJN4tKL1YaBTRyS5OW+1B+
LPLOxm9PBuzt0wEArVunv7iMgvRzp5spbmCqmD8Is2vSf+45KSrb10WU2wo=
=L37R
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"A few more tracing fixes:
- Fix a buffer overflow by checking nr_args correctly in probes
- Fix a warning that is reported by clang
- Fix a possible memory leak in error path of filter processing
- Fix the selftest that checks for failures, but wasn't failing
- Minor clean up on call site output of a memory trace event"
* tag 'trace-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
selftests/ftrace: Fix same probe error test
mm, tracing: Print symbol name for call_site in trace events
tracing: Have error path in predicate_parse() free its allocated memory
tracing: Fix clang -Wint-in-bool-context warnings in IF_ASSIGN macro
tracing/probe: Fix to check the difference of nr_args before adding probe
Pull networking fixes from David Miller:
1) Sanity check URB networking device parameters to avoid divide by
zero, from Oliver Neukum.
2) Disable global multicast filter in NCSI, otherwise LLDP and IPV6
don't work properly. Longer term this needs a better fix tho. From
Vijay Khemka.
3) Small fixes to selftests (use ping when ping6 is not present, etc.)
from David Ahern.
4) Bring back rt_uses_gateway member of struct rtable, it's semantics
were not well understood and trying to remove it broke things. From
David Ahern.
5) Move usbnet snaity checking, ignore endpoints with invalid
wMaxPacketSize. From Bjørn Mork.
6) Missing Kconfig deps for sja1105 driver, from Mao Wenan.
7) Various small fixes to the mlx5 DR steering code, from Alaa Hleihel,
Alex Vesker, and Yevgeny Kliteynik
8) Missing CAP_NET_RAW checks in various places, from Ori Nimron.
9) Fix crash when removing sch_cbs entry while offloading is enabled,
from Vinicius Costa Gomes.
10) Signedness bug fixes, generally in looking at the result given by
of_get_phy_mode() and friends. From Dan Crapenter.
11) Disable preemption around BPF_PROG_RUN() calls, from Eric Dumazet.
12) Don't create VRF ipv6 rules if ipv6 is disabled, from David Ahern.
13) Fix quantization code in tcp_bbr, from Kevin Yang.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (127 commits)
net: tap: clean up an indentation issue
nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
sk_buff: drop all skb extensions on free and skb scrubbing
tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions
Documentation: Clarify trap's description
mlxsw: spectrum: Clear VLAN filters during port initialization
net: ena: clean up indentation issue
NFC: st95hf: clean up indentation issue
net: phy: micrel: add Asym Pause workaround for KSZ9021
net: socionext: ave: Avoid using netdev_err() before calling register_netdev()
ptp: correctly disable flags on old ioctls
lib: dimlib: fix help text typos
net: dsa: microchip: Always set regmap stride to 1
nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled
net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
...
In predicate_parse, there is an error path that is not going to
out_free instead it returns directly which leads to a memory leak.
Link: http://lkml.kernel.org/r/20190920225800.3870-1-navid.emamdoost@gmail.com
Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
After r372664 in clang, the IF_ASSIGN macro causes a couple hundred
warnings along the lines of:
kernel/trace/trace_output.c:1331:2: warning: converting the enum
constant to a boolean [-Wint-in-bool-context]
kernel/trace/trace.h:409:3: note: expanded from macro
'trace_assign_type'
IF_ASSIGN(var, ent, struct ftrace_graph_ret_entry,
^
kernel/trace/trace.h:371:14: note: expanded from macro 'IF_ASSIGN'
WARN_ON(id && (entry)->type != id); \
^
264 warnings generated.
This warning can catch issues with constructs like:
if (state == A || B)
where the developer really meant:
if (state == A || state == B)
This is currently the only occurrence of the warning in the kernel
tree across defconfig, allyesconfig, allmodconfig for arm32, arm64,
and x86_64. Add the implicit '!= 0' to the WARN_ON statement to fix
the warnings and find potential issues in the future.
Link: 28b38c277a
Link: https://github.com/ClangBuiltLinux/linux/issues/686
Link: http://lkml.kernel.org/r/20190926162258.466321-1-natechancellor@gmail.com
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Steven reported that a test triggered:
==================================================================
BUG: KASAN: slab-out-of-bounds in trace_kprobe_create+0xa9e/0xe40
Read of size 8 at addr ffff8880c4f25a48 by task ftracetest/4798
CPU: 2 PID: 4798 Comm: ftracetest Not tainted 5.3.0-rc6-test+ #30
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
Call Trace:
dump_stack+0x7c/0xc0
? trace_kprobe_create+0xa9e/0xe40
print_address_description+0x6c/0x332
? trace_kprobe_create+0xa9e/0xe40
? trace_kprobe_create+0xa9e/0xe40
__kasan_report.cold.6+0x1a/0x3b
? trace_kprobe_create+0xa9e/0xe40
kasan_report+0xe/0x12
trace_kprobe_create+0xa9e/0xe40
? print_kprobe_event+0x280/0x280
? match_held_lock+0x1b/0x240
? find_held_lock+0xac/0xd0
? fs_reclaim_release.part.112+0x5/0x20
? lock_downgrade+0x350/0x350
? kasan_unpoison_shadow+0x30/0x40
? __kasan_kmalloc.constprop.6+0xc1/0xd0
? trace_kprobe_create+0xe40/0xe40
? trace_kprobe_create+0xe40/0xe40
create_or_delete_trace_kprobe+0x2e/0x60
trace_run_command+0xc3/0xe0
? trace_panic_handler+0x20/0x20
? kasan_unpoison_shadow+0x30/0x40
trace_parse_run_command+0xdc/0x163
vfs_write+0xe1/0x240
ksys_write+0xba/0x150
? __ia32_sys_read+0x50/0x50
? tracer_hardirqs_on+0x61/0x180
? trace_hardirqs_off_caller+0x43/0x110
? mark_held_locks+0x29/0xa0
? do_syscall_64+0x14/0x260
do_syscall_64+0x68/0x260
Fix to check the difference of nr_args before adding probe
on existing probes. This also may set the error log index
bigger than the number of command parameters. In that case
it sets the error position is next to the last parameter.
Link: http://lkml.kernel.org/r/156966474783.3478.13217501608215769150.stgit@devnote2
Fixes: ca89bc071d ("tracing/kprobe: Add multi-probe per event support")
Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull kernel lockdown mode from James Morris:
"This is the latest iteration of the kernel lockdown patchset, from
Matthew Garrett, David Howells and others.
From the original description:
This patchset introduces an optional kernel lockdown feature,
intended to strengthen the boundary between UID 0 and the kernel.
When enabled, various pieces of kernel functionality are restricted.
Applications that rely on low-level access to either hardware or the
kernel may cease working as a result - therefore this should not be
enabled without appropriate evaluation beforehand.
The majority of mainstream distributions have been carrying variants
of this patchset for many years now, so there's value in providing a
doesn't meet every distribution requirement, but gets us much closer
to not requiring external patches.
There are two major changes since this was last proposed for mainline:
- Separating lockdown from EFI secure boot. Background discussion is
covered here: https://lwn.net/Articles/751061/
- Implementation as an LSM, with a default stackable lockdown LSM
module. This allows the lockdown feature to be policy-driven,
rather than encoding an implicit policy within the mechanism.
The new locked_down LSM hook is provided to allow LSMs to make a
policy decision around whether kernel functionality that would allow
tampering with or examining the runtime state of the kernel should be
permitted.
The included lockdown LSM provides an implementation with a simple
policy intended for general purpose use. This policy provides a coarse
level of granularity, controllable via the kernel command line:
lockdown={integrity|confidentiality}
Enable the kernel lockdown feature. If set to integrity, kernel features
that allow userland to modify the running kernel are disabled. If set to
confidentiality, kernel features that allow userland to extract
confidential information from the kernel are also disabled.
This may also be controlled via /sys/kernel/security/lockdown and
overriden by kernel configuration.
New or existing LSMs may implement finer-grained controls of the
lockdown features. Refer to the lockdown_reason documentation in
include/linux/security.h for details.
The lockdown feature has had signficant design feedback and review
across many subsystems. This code has been in linux-next for some
weeks, with a few fixes applied along the way.
Stephen Rothwell noted that commit 9d1f8be5cf ("bpf: Restrict bpf
when kernel lockdown is in confidentiality mode") is missing a
Signed-off-by from its author. Matthew responded that he is providing
this under category (c) of the DCO"
* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
kexec: Fix file verification on S390
security: constify some arrays in lockdown LSM
lockdown: Print current->comm in restriction messages
efi: Restrict efivar_ssdt_load when the kernel is locked down
tracefs: Restrict tracefs when the kernel is locked down
debugfs: Restrict debugfs when the kernel is locked down
kexec: Allow kexec_file() with appropriate IMA policy when locked down
lockdown: Lock down perf when in confidentiality mode
bpf: Restrict bpf when kernel lockdown is in confidentiality mode
lockdown: Lock down tracing and perf kprobes when in confidentiality mode
lockdown: Lock down /proc/kcore
x86/mmiotrace: Lock down the testmmiotrace module
lockdown: Lock down module params that specify hardware parameters (eg. ioport)
lockdown: Lock down TIOCSSERIAL
lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
acpi: Disable ACPI table override if the kernel is locked down
acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
ACPI: Limit access to custom_method when the kernel is locked down
x86/msr: Restrict MSR access when the kernel is locked down
x86: Lock down IO port access when the kernel is locked down
...
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXYvAlBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlK6APsECr49j3ew/tRCnzkq0Y09w0TLYeHL
ax6aAVO1fHX0TgEAhCBwkWh8ZcoxGbu1CDOkQjJAqfTFFSu38Klv1P+3PQg=
=oSuX
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Srikar Dronamraju fixed a bug in the newmulti probe code"
* tag 'trace-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/probe: Fix same probe event argument matching
Commit fe60b0ce8e ("tracing/probe: Reject exactly same probe event")
tries to reject a event which matches an already existing probe.
However it currently continues to match arguments and rejects adding a
probe even when the arguments don't match. Fix this by only rejecting a
probe if and only if all the arguments match.
Link: http://lkml.kernel.org/r/20190924114906.14038-1-srikar@linux.vnet.ibm.com
Fixes: fe60b0ce8e ("tracing/probe: Reject exactly same probe event")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Initial support for running on a system with an Ultravisor, which is software
that runs below the hypervisor and protects guests against some attacks by
the hypervisor.
- Support for building the kernel to run as a "Secure Virtual Machine", ie. as
a guest capable of running on a system with an Ultravisor.
- Some changes to our DMA code on bare metal, to allow devices with medium
sized DMA masks (> 32 && < 59 bits) to use more than 2GB of DMA space.
- Support for firmware assisted crash dumps on bare metal (powernv).
- Two series fixing bugs in and refactoring our PCI EEH code.
- A large series refactoring our exception entry code to use gas macros, both
to make it more readable and also enable some future optimisations.
As well as many cleanups and other minor features & fixups.
Thanks to:
Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh
Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Balbir Singh, Benjamin
Herrenschmidt, Cédric Le Goater, Christophe JAILLET, Christophe Leroy,
Christopher M. Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens,
David Gibson, David Hildenbrand, Desnes A. Nunes do Rosario, Ganesh Goudar,
Gautham R. Shenoy, Greg Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari
Bathini, Joakim Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras,
Lianbo Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan Chancellor,
Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Qian Cai, Ram
Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm, Sam Bobroff, Santosh Sivaraj,
Segher Boessenkool, Sukadev Bhattiprolu, Thiago Bauermann, Thiago Jung
Bauermann, Thomas Gleixner, Tom Lendacky, Vasant Hegde.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl2EtEcTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPfsD/9uXyBXn3anI/H08+mk74k5gCsmMQpn
D442CD/ByogZcccp23yBTlhawtCE03hcHnCLygn0Xgd8a4YvHts/RGHUe3fPHqlG
bEyZ7jsLVz5ebNZQP7r4eGs2pSzCajwJy2N9HJ/C1ojf15rrfRxoVJtnyhE2wXpm
DL+6o2K+nUCB3gTQ1Inr3DnWzoGOOUfNTOea2u+J+yfHwGRqOBYpevwqiwy5eelK
aRjUJCqMTvrzra49MeFwjo0Nt3/Y8UNcwA+JlGdeR8bRuWhFrYmyBRiZEKPaujNO
5EAfghBBlB0KQCqvF/tRM/c0OftHqK59AMobP9T7u9oOaBXeF/FpZX/iXjzNDPsN
j9Oo2tKLTu/YVEXqBFuREGP+znANr1Wo4CFyOG8SbvYz0HFjR6XbtRJsS+0e8GWl
kqX5/ZhYz3lBnKSNe9jgWOrh/J0KCSFigBTEWJT3xsn4YE8x8kK2l9KPqAIldWEP
sKb2UjGS7v0NKq+NvShH88Q9AeQUEIjTcg/9aDDQDe6FaRQ7KiF8bUxSdwSPi+Fn
j0lnF6i+1ATWZKuCr85veVi7C5qoe/+MqalnmP7MxULyzgXLLxUgN0SzEYO6QofK
LQK/VaH2XVr5+M5YAb7K4/NX5gbM3s1bKrCiUy4EyHNvgG7gricYdbz6HgAjKpR7
oP0rHfgmVYvF1g==
=WlW+
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"This is a bit late, partly due to me travelling, and partly due to a
power outage knocking out some of my test systems *while* I was
travelling.
- Initial support for running on a system with an Ultravisor, which
is software that runs below the hypervisor and protects guests
against some attacks by the hypervisor.
- Support for building the kernel to run as a "Secure Virtual
Machine", ie. as a guest capable of running on a system with an
Ultravisor.
- Some changes to our DMA code on bare metal, to allow devices with
medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of
DMA space.
- Support for firmware assisted crash dumps on bare metal (powernv).
- Two series fixing bugs in and refactoring our PCI EEH code.
- A large series refactoring our exception entry code to use gas
macros, both to make it more readable and also enable some future
optimisations.
As well as many cleanups and other minor features & fixups.
Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew
Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual,
Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe
JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig,
Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand,
Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg
Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim
Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo
Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan
Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver
O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm,
Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu,
Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom
Lendacky, Vasant Hegde"
* tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits)
powerpc/mm/mce: Keep irqs disabled during lockless page table walk
powerpc: Use ftrace_graph_ret_addr() when unwinding
powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
ftrace: Look up the address of return_to_handler() using helpers
powerpc: dump kernel log before carrying out fadump or kdump
docs: powerpc: Add missing documentation reference
powerpc/xmon: Fix output of XIVE IPI
powerpc/xmon: Improve output of XIVE interrupts
powerpc/mm/radix: remove useless kernel messages
powerpc/fadump: support holes in kernel boot memory area
powerpc/fadump: remove RMA_START and RMA_END macros
powerpc/fadump: update documentation about option to release opalcore
powerpc/fadump: consider f/w load area
powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file
powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes
powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel
powerpc/fadump: improve how crashed kernel's memory is reserved
powerpc/fadump: consider reserved ranges while releasing memory
powerpc/fadump: make crash memory ranges array allocation generic
...
- Addition of multiprobes to kprobe and uprobe events
Allows for more than one probe attached to the same location
- Addition of adding immediates to probe parameters
- Clean up of the recordmcount.c code. This brings us closer
to merging recordmcount into objtool, and reuse code.
- Other small clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXYQoqhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlIxAP9VVABbpuvOYqxKuFgyP62ituSXPLkL
gZv4I5Zse4b6/gD/eksFXY/OHo7jp6aQiHvxotUkAiFFE9iHzi0JscdMJgo=
=WqrT
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Addition of multiprobes to kprobe and uprobe events (allows for more
than one probe attached to the same location)
- Addition of adding immediates to probe parameters
- Clean up of the recordmcount.c code. This brings us closer to merging
recordmcount into objtool, and reuse code.
- Other small clean ups
* tag 'trace-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
selftests/ftrace: Update kprobe event error testcase
tracing/probe: Reject exactly same probe event
tracing/probe: Fix to allow user to enable events on unloaded modules
selftests/ftrace: Select an existing function in kprobe_eventname test
tracing/kprobe: Fix NULL pointer access in trace_porbe_unlink()
tracing: Make sure variable reference alias has correct var_ref_idx
tracing: Be more clever when dumping hex in __print_hex()
ftrace: Simplify ftrace hash lookup code in clear_func_from_hash()
tracing: Add "gfp_t" support in synthetic_events
tracing: Rename tracing_reset() to tracing_reset_cpu()
tracing: Document the stack trace algorithm in the comments
tracing/arm64: Have max stack tracer handle the case of return address after data
recordmcount: Clarify what cleanup() does
recordmcount: Remove redundant cleanup() calls
recordmcount: Kernel style formatting
recordmcount: Kernel style function signature formatting
recordmcount: Rewrite error/success handling
selftests/ftrace: Add syntax error test for multiprobe
selftests/ftrace: Add syntax error test for immediates
selftests/ftrace: Add a testcase for kprobe multiprobe event
...
Reject exactly same probe events as existing probes.
Multiprobe allows user to define multiple probes on same
event. If user appends a probe which exactly same definition
(same probe address and same arguments) on existing event,
the event will record same probe information twice.
That can be confusing users, so reject it.
Link: http://lkml.kernel.org/r/156879694602.31056.5533024778165036763.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix to allow user to enable probe events on unloaded modules.
This operations was allowed before commit 60d53e2c3b ("tracing/probe:
Split trace_event related data from trace_probe"), because if users
need to probe module init functions, they have to enable those probe
events before loading module.
Link: http://lkml.kernel.org/r/156879693733.31056.9331322616994665167.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: 60d53e2c3b ("tracing/probe: Split trace_event related data from trace_probe")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull networking updates from David Miller:
1) Support IPV6 RA Captive Portal Identifier, from Maciej Żenczykowski.
2) Use bio_vec in the networking instead of custom skb_frag_t, from
Matthew Wilcox.
3) Make use of xmit_more in r8169 driver, from Heiner Kallweit.
4) Add devmap_hash to xdp, from Toke Høiland-Jørgensen.
5) Support all variants of 5750X bnxt_en chips, from Michael Chan.
6) More RTNL avoidance work in the core and mlx5 driver, from Vlad
Buslov.
7) Add TCP syn cookies bpf helper, from Petar Penkov.
8) Add 'nettest' to selftests and use it, from David Ahern.
9) Add extack support to drop_monitor, add packet alert mode and
support for HW drops, from Ido Schimmel.
10) Add VLAN offload to stmmac, from Jose Abreu.
11) Lots of devm_platform_ioremap_resource() conversions, from
YueHaibing.
12) Add IONIC driver, from Shannon Nelson.
13) Several kTLS cleanups, from Jakub Kicinski.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1930 commits)
mlxsw: spectrum_buffers: Add the ability to query the CPU port's shared buffer
mlxsw: spectrum: Register CPU port with devlink
mlxsw: spectrum_buffers: Prevent changing CPU port's configuration
net: ena: fix incorrect update of intr_delay_resolution
net: ena: fix retrieval of nonadaptive interrupt moderation intervals
net: ena: fix update of interrupt moderation register
net: ena: remove all old adaptive rx interrupt moderation code from ena_com
net: ena: remove ena_restore_ethtool_params() and relevant fields
net: ena: remove old adaptive interrupt moderation code from ena_netdev
net: ena: remove code duplication in ena_com_update_nonadaptive_moderation_interval _*()
net: ena: enable the interrupt_moderation in driver_supported_features
net: ena: reimplement set/get_coalesce()
net: ena: switch to dim algorithm for rx adaptive interrupt moderation
net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
ethtool: implement Energy Detect Powerdown support via phy-tunable
xen-netfront: do not assume sk_buff_head list is empty in error handling
s390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb”
net: ena: don't wake up tx queue when down
drop_monitor: Better sanitize notified packets
...
Original changelog from Steve Rostedt (except last sentence which
explains the problem, and the Fixes: tag):
I performed a three way histogram with the following commands:
echo 'irq_lat u64 lat pid_t pid' > synthetic_events
echo 'wake_lat u64 lat u64 irqlat pid_t pid' >> synthetic_events
echo 'hist:keys=common_pid:irqts=common_timestamp.usecs if function == 0xffffffff81200580' > events/timer/hrtimer_start/trigger
echo 'hist:keys=common_pid:lat=common_timestamp.usecs-$irqts:onmatch(timer.hrtimer_start).irq_lat($lat,pid) if common_flags & 1' > events/sched/sched_waking/trigger
echo 'hist:keys=pid:wakets=common_timestamp.usecs,irqlat=lat' > events/synthetic/irq_lat/trigger
echo 'hist:keys=next_pid:lat=common_timestamp.usecs-$wakets,irqlat=$irqlat:onmatch(synthetic.irq_lat).wake_lat($lat,$irqlat,next_pid)' > events/sched/sched_switch/trigger
echo 1 > events/synthetic/wake_lat/enable
Basically I wanted to see:
hrtimer_start (calling function tick_sched_timer)
Note:
# grep tick_sched_timer /proc/kallsyms
ffffffff81200580 t tick_sched_timer
And save the time of that, and then record sched_waking if it is called
in interrupt context and with the same pid as the hrtimer_start, it
will record the latency between that and the waking event.
I then look at when the task that is woken is scheduled in, and record
the latency between the wakeup and the task running.
At the end, the wake_lat synthetic event will show the wakeup to
scheduled latency, as well as the irq latency in from hritmer_start to
the wakeup. The problem is that I found this:
<idle>-0 [007] d... 190.485261: wake_lat: lat=27 irqlat=190485230 pid=698
<idle>-0 [005] d... 190.485283: wake_lat: lat=40 irqlat=190485239 pid=10
<idle>-0 [002] d... 190.488327: wake_lat: lat=56 irqlat=190488266 pid=335
<idle>-0 [005] d... 190.489330: wake_lat: lat=64 irqlat=190489262 pid=10
<idle>-0 [003] d... 190.490312: wake_lat: lat=43 irqlat=190490265 pid=77
<idle>-0 [005] d... 190.493322: wake_lat: lat=54 irqlat=190493262 pid=10
<idle>-0 [005] d... 190.497305: wake_lat: lat=35 irqlat=190497267 pid=10
<idle>-0 [005] d... 190.501319: wake_lat: lat=50 irqlat=190501264 pid=10
The irqlat seemed quite large! Investigating this further, if I had
enabled the irq_lat synthetic event, I noticed this:
<idle>-0 [002] d.s. 249.429308: irq_lat: lat=164968 pid=335
<idle>-0 [002] d... 249.429369: wake_lat: lat=55 irqlat=249429308 pid=335
Notice that the timestamp of the irq_lat "249.429308" is awfully
similar to the reported irqlat variable. In fact, all instances were
like this. It appeared that:
irqlat=$irqlat
Wasn't assigning the old $irqlat to the new irqlat variable, but
instead was assigning the $irqts to it.
The issue is that assigning the old $irqlat to the new irqlat variable
creates a variable reference alias, but the alias creation code
forgets to make sure the alias uses the same var_ref_idx to access the
reference.
Link: http://lkml.kernel.org/r/1567375321.5282.12.camel@kernel.org
Cc: Linux Trace Devel <linux-trace-devel@vger.kernel.org>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7e8b88a30b ("tracing: Add hist trigger support for variable reference aliases")
Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Function ftrace_lookup_ip() will check empty hash table. So we don't
need extra check outside.
Link: http://lkml.kernel.org/r/20190910143336.13472-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull scheduler updates from Ingo Molnar:
- MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.
As perf and the scheduler is getting bigger and more complex,
document the status quo of current responsibilities and interests,
and spread the review pain^H^H^H^H fun via an increase in the Cc:
linecount generated by scripts/get_maintainer.pl. :-)
- Add another series of patches that brings the -rt (PREEMPT_RT) tree
closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
into a new CONFIG_PREEMPTION category that will allow the eventual
introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
to go though.
- Extend the CPU cgroup controller with uclamp.min and uclamp.max to
allow the finer shaping of CPU bandwidth usage.
- Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).
- Improve the behavior of high CPU count, high thread count
applications running under cpu.cfs_quota_us constraints.
- Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.
- Improve CPU isolation housekeeping CPU allocation NUMA locality.
- Fix deadline scheduler bandwidth calculations and logic when cpusets
rebuilds the topology, or when it gets deadline-throttled while it's
being offlined.
- Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
setscheduler() system calls without creating global serialization.
Add new synchronization between cpuset topology-changing events and
the deadline acceptance tests in setscheduler(), which were broken
before.
- Rework the active_mm state machine to be less confusing and more
optimal.
- Rework (simplify) the pick_next_task() slowpath.
- Improve load-balancing on AMD EPYC systems.
- ... and misc cleanups, smaller fixes and improvements - please see
the Git log for more details.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
sched/psi: Correct overly pessimistic size calculation
sched/fair: Speed-up energy-aware wake-ups
sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
sched/uclamp: Update CPU's refcount on TG's clamp changes
sched/uclamp: Use TG's clamps to restrict TASK's clamps
sched/uclamp: Propagate system defaults to the root group
sched/uclamp: Propagate parent clamps
sched/uclamp: Extend CPU's cgroup controller
sched/topology: Improve load balancing on AMD EPYC systems
arch, ia64: Make NUMA select SMP
sched, perf: MAINTAINERS update, add submaintainers and reviewers
sched/fair: Use rq_lock/unlock in online_fair_sched_group
cpufreq: schedutil: fix equation in comment
sched: Rework pick_next_task() slow-path
sched: Allow put_prev_task() to drop rq->lock
sched/fair: Expose newidle_balance()
sched: Add task_struct pointer to sched_class::set_curr_task
sched: Rework CPU hotplug task selection
sched/{rt,deadline}: Fix set_next_task vs pick_next_task
sched: Fix kerneldoc comment for ia64_set_curr_task
...
Pull RCU updates from Ingo Molnar:
"This cycle's RCU changes were:
- A few more RCU flavor consolidation cleanups.
- Updates to RCU's list-traversal macros improving lockdep usability.
- Forward-progress improvements for no-CBs CPUs: Avoid ignoring
incoming callbacks during grace-period waits.
- Forward-progress improvements for no-CBs CPUs: Use ->cblist
structure to take advantage of others' grace periods.
- Also added a small commit that avoids needlessly inflicting
scheduler-clock ticks on callback-offloaded CPUs.
- Forward-progress improvements for no-CBs CPUs: Reduce contention on
->nocb_lock guarding ->cblist.
- Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
list to further reduce contention on ->nocb_lock guarding ->cblist.
- Miscellaneous fixes.
- Torture-test updates.
- minor LKMM updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
MAINTAINERS: Update from paulmck@linux.ibm.com to paulmck@kernel.org
rcu: Don't include <linux/ktime.h> in rcutiny.h
rcu: Allow rcu_do_batch() to dynamically adjust batch sizes
rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload
rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention
rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention
rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()
rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()
rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed
rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended
rcu/nocb: Add bypass callback queueing
rcu/nocb: Atomic ->len field in rcu_segcblist structure
rcu/nocb: Unconditionally advance and wake for excessive CBs
rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock
rcu/nocb: Reduce contention at no-CBs invocation-done time
rcu/nocb: Reduce contention at no-CBs registry-time CB advancement
rcu/nocb: Round down for number of no-CBs grace-period kthreads
rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU
rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread
rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks
...
Daniel Borkmann says:
====================
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add the ability to use unaligned chunks in the AF_XDP umem. By
relaxing where the chunks can be placed, it allows to use an
arbitrary buffer size and place whenever there is a free
address in the umem. Helps more seamless DPDK AF_XDP driver
integration. Support for i40e, ixgbe and mlx5e, from Kevin and
Maxim.
2) Addition of a wakeup flag for AF_XDP tx and fill rings so the
application can wake up the kernel for rx/tx processing which
avoids busy-spinning of the latter, useful when app and driver
is located on the same core. Support for i40e, ixgbe and mlx5e,
from Magnus and Maxim.
3) bpftool fixes for printf()-like functions so compiler can actually
enforce checks, bpftool build system improvements for custom output
directories, and addition of 'bpftool map freeze' command, from Quentin.
4) Support attaching/detaching XDP programs from 'bpftool net' command,
from Daniel.
5) Automatic xskmap cleanup when AF_XDP socket is released, and several
barrier/{read,write}_once fixes in AF_XDP code, from Björn.
6) Relicense of bpf_helpers.h/bpf_endian.h for future libbpf
inclusion as well as libbpf versioning improvements, from Andrii.
7) Several new BPF kselftests for verifier precision tracking, from Alexei.
8) Several BPF kselftest fixes wrt endianess to run on s390x, from Ilya.
9) And more BPF kselftest improvements all over the place, from Stanislav.
10) Add simple BPF map op cache for nfp driver to batch dumps, from Jakub.
11) AF_XDP socket umem mapping improvements for 32bit archs, from Ivan.
12) Add BPF-to-BPF call and BTF line info support for s390x JIT, from Yauheni.
13) Small optimization in arm64 JIT to spare 1 insns for BPF_MOD, from Jerin.
14) Fix an error check in bpf_tcp_gen_syncookie() helper, from Petar.
15) Various minor fixes and cleanups, from Nathan, Masahiro, Masanari,
Peter, Wei, Yue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add "gfp_t" support in synthetic_events, then the "gfp_t" type
parameter in some functions can be traced.
Prints the gfp flags as hex in addition to the human-readable flag
string. Example output:
whoopsie-630 [000] ...1 78.969452: testevent: bar=b20 (GFP_ATOMIC|__GFP_ZERO)
rcuc/0-11 [000] ...1 81.097555: testevent: bar=a20 (GFP_ATOMIC)
rcuc/0-11 [000] ...1 81.583123: testevent: bar=a20 (GFP_ATOMIC)
Link: http://lkml.kernel.org/r/20190712015308.9908-1-zhengjun.xing@linux.intel.com
Signed-off-by: Zhengjun Xing <zhengjun.xing@linux.intel.com>
[ Added printing of flag names ]
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The name tracing_reset() was a misnomer, as it really only reset a single
CPU buffer. Rename it to tracing_reset_cpu() and also make it static and
remove the prototype from trace.h, as it is only used in a single function.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As the max stack tracer algorithm is not that easy to understand from the
code, add comments that explain the algorithm and mentions how
ARCH_FTRACE_SHIFT_STACK_TRACER affects it.
Link: http://lkml.kernel.org/r/20190806123455.487ac02b@gandalf.local.home
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Most archs (well at least x86) store the function call return address on the
stack before storing the local variables for the function. The max stack
tracer depends on this in its algorithm to display the stack size of each
function it finds in the back trace.
Some archs (arm64), may store the return address (from its link register)
just before calling a nested function. There's no reason to save the link
register on leaf functions, as it wont be updated. This breaks the algorithm
of the max stack tracer.
Add a new define ARCH_FTRACE_SHIFT_STACK_TRACER that an architecture may set
if it stores the return address (link register) after it stores the
function's local variables, and have the stack trace shift the values of the
mapped stack size to the appropriate functions.
Link: 20190802094103.163576-1-jiping.ma2@windriver.com
Reported-by: Jiping Ma <jiping.ma2@windriver.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add immediate string parameter (\"string") support to
probe events. This allows you to specify an immediate
(or dummy) parameter instead of fetching a string from
memory.
This feature looks odd, but imagine that you put a probe
on a code to trace some string data. If the code is
compiled into 2 instructions and 1 instruction has a
string on memory but other has no string since it is
optimized out. In that case, you can not fold those into
one event, even if ftrace supported multiple probes on
one event. With this feature, you can set a dummy string
like foo=\"(optimized)":string instead of something
like foo=+0(+0(%bp)):string.
Link: http://lkml.kernel.org/r/156095691687.28024.13372712423865047991.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add immediate value parameter (\1234) support to
probe events. This allows you to specify an immediate
(or dummy) parameter instead of fetching from memory
or register.
This feature looks odd, but imagine when you put a probe
on a code to trace some data. If the code is compiled into
2 instructions and 1 instruction has a value but other has
nothing since it is optimized out.
In that case, you can not fold those into one event, even
if ftrace supported multiple probes on one event.
With this feature, you can set a dummy value like
foo=\deadbeef instead of something like foo=%di.
Link: http://lkml.kernel.org/r/156095690733.28024.13258186548822649469.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add per-probe delete method from one event passing the head of
definition. In other words, the events which match the head
N parameters are deleted.
Link: http://lkml.kernel.org/r/156095689811.28024.221706761151739433.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Allow user to delete a probe from event. This is done by head
match. For example, if we have 2 probes on an event
$ cat kprobe_events
p:kprobes/testprobe _do_fork r1=%ax r2=%dx
p:kprobes/testprobe idle_fork r1=%ax r2=%cx
Then you can remove one of them by passing the head of definition
which identify the probe.
$ echo "-:kprobes/testprobe idle_fork" >> kprobe_events
Link: http://lkml.kernel.org/r/156095688848.28024.15798690082378432435.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Allow user to define several probes on one uprobe event.
Note that this only support appending method. So deleting
event will delete all probes on the event.
Link: http://lkml.kernel.org/r/156095687876.28024.13840331032234992863.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add multi-probe per one event support to kprobe events.
User can define several different probes on one trace event
if those events have same "event signature",
e.g.
# echo p:testevent _do_fork > kprobe_events
# echo p:testevent fork_idle >> kprobe_events
# kprobe_events
p:kprobes/testevent _do_fork
p:kprobes/testevent fork_idle
The event signature is defined by kprobe type (retprobe or not),
the number of args, argument names, and argument types.
Note that this only support appending method. Delete event
operation will delete all probes on the event.
Link: http://lkml.kernel.org/r/156095686913.28024.9357292202316540742.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pass extra arguments to match operation for checking
exact match. If the event doesn't support exact match,
it will be ignored.
Link: http://lkml.kernel.org/r/156095685930.28024.10405547027475590975.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When user gives an event name to delete, delete all
matched events instead of the first one.
This means if there are several events which have same
name but different group (subsystem) name, those are
removed if user passed only the event name, e.g.
# cat kprobe_events
p:group1/testevent _do_fork
p:group2/testevent fork_idle
# echo -:testevent >> kprobe_events
# cat kprobe_events
#
Link: http://lkml.kernel.org/r/156095684958.28024.16597826267117453638.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Split the trace_event related data from trace_probe data structure
and introduce trace_probe_event data structure for its folder.
This trace_probe_event data structure can have multiple trace_probe.
Link: http://lkml.kernel.org/r/156095683995.28024.7552150340561557873.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix the following kdoc warnings:
kernel/trace/trace.c:1579: warning: Function parameter or member 'tr' not described in 'update_max_tr_single'
kernel/trace/trace.c:1579: warning: Function parameter or member 'tsk' not described in 'update_max_tr_single'
kernel/trace/trace.c:1579: warning: Function parameter or member 'cpu' not described in 'update_max_tr_single'
kernel/trace/trace.c:1776: warning: Function parameter or member 'type' not described in 'register_tracer'
kernel/trace/trace.c:2239: warning: Function parameter or member 'task' not described in 'tracing_record_taskinfo'
kernel/trace/trace.c:2239: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo'
kernel/trace/trace.c:2269: warning: Function parameter or member 'prev' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:2269: warning: Function parameter or member 'next' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:2269: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:3078: warning: Function parameter or member 'ip' not described in 'trace_vbprintk'
kernel/trace/trace.c:3078: warning: Function parameter or member 'fmt' not described in 'trace_vbprintk'
kernel/trace/trace.c:3078: warning: Function parameter or member 'args' not described in 'trace_vbprintk'
Link: http://lkml.kernel.org/r/20190828052549.2472-2-jakub.kicinski@netronome.com
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The function ftrace_set_clr_event is declared static and marked
EXPORT_SYMBOL_GPL(), which is at best an odd combination. Because the
function was decided to be a part of API, this commit removes the static
attribute and adds the declaration to the header.
Link: http://lkml.kernel.org/r/20190704172110.27041-1-efremov@linux.com
Fixes: f45d1225ad ("tracing: Kernel access to Ftrace instances")
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In register_ftrace_function_probe(), we are not checking the return
value of alloc_and_copy_ftrace_hash(). The subsequent call to
ftrace_match_records() may end up dereferencing the same. Add a check to
ensure this doesn't happen.
Link: http://lkml.kernel.org/r/26e92574f25ad23e7cafa3cf5f7a819de1832cbe.1562249521.git.naveen.n.rao@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Fixes: 1ec3a81a0c ("ftrace: Have each function probe use its own ftrace_ops")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The race between adding a function probe and reading the probes that exist
is very subtle. It needs a comment. Also, the issue can also happen if the
probe has has the EMPTY_HASH as its func_hash.
Cc: stable@vger.kernel.org
Fixes: 7b60f3d876 ("ftrace: Dynamically create the probe ftrace_ops for the trace_array")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
LTP testsuite on powerpc results in the below crash:
Unable to handle kernel paging request for data at address 0x00000000
Faulting instruction address: 0xc00000000029d800
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA PowerNV
...
CPU: 68 PID: 96584 Comm: cat Kdump: loaded Tainted: G W
NIP: c00000000029d800 LR: c00000000029dac4 CTR: c0000000001e6ad0
REGS: c0002017fae8ba10 TRAP: 0300 Tainted: G W
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28022422 XER: 20040000
CFAR: c00000000029d90c DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0
...
NIP [c00000000029d800] t_probe_next+0x60/0x180
LR [c00000000029dac4] t_mod_start+0x1a4/0x1f0
Call Trace:
[c0002017fae8bc90] [c000000000cdbc40] _cond_resched+0x10/0xb0 (unreliable)
[c0002017fae8bce0] [c0000000002a15b0] t_start+0xf0/0x1c0
[c0002017fae8bd30] [c0000000004ec2b4] seq_read+0x184/0x640
[c0002017fae8bdd0] [c0000000004a57bc] sys_read+0x10c/0x300
[c0002017fae8be30] [c00000000000b388] system_call+0x5c/0x70
The test (ftrace_set_ftrace_filter.sh) is part of ftrace stress tests
and the crash happens when the test does 'cat
$TRACING_PATH/set_ftrace_filter'.
The address points to the second line below, in t_probe_next(), where
filter_hash is dereferenced:
hash = iter->probe->ops.func_hash->filter_hash;
size = 1 << hash->size_bits;
This happens due to a race with register_ftrace_function_probe(). A new
ftrace_func_probe is created and added into the func_probes list in
trace_array under ftrace_lock. However, before initializing the filter,
we drop ftrace_lock, and re-acquire it after acquiring regex_lock. If
another process is trying to read set_ftrace_filter, it will be able to
acquire ftrace_lock during this window and it will end up seeing a NULL
filter_hash.
Fix this by just checking for a NULL filter_hash in t_probe_next(). If
the filter_hash is NULL, then this probe is just being added and we can
simply return from here.
Link: http://lkml.kernel.org/r/05e021f757625cbbb006fad41380323dbe4e3b43.1562249521.git.naveen.n.rao@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Fixes: 7b60f3d876 ("ftrace: Dynamically create the probe ftrace_ops for the trace_array")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull RCU and LKMM changes from Paul E. McKenney:
- A few more RCU flavor consolidation cleanups.
- Miscellaneous fixes.
- Updates to RCU's list-traversal macros improving lockdep usability.
- Torture-test updates.
- Forward-progress improvements for no-CBs CPUs: Avoid ignoring
incoming callbacks during grace-period waits.
- Forward-progress improvements for no-CBs CPUs: Use ->cblist
structure to take advantage of others' grace periods.
- Also added a small commit that avoids needlessly inflicting
scheduler-clock ticks on callback-offloaded CPUs.
- Forward-progress improvements for no-CBs CPUs: Reduce contention
on ->nocb_lock guarding ->cblist.
- Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
list to further reduce contention on ->nocb_lock guarding ->cblist.
- LKMM updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PERF_EVENT_IOC_SET_BPF supports uprobes since v4.3, and tracepoints
since v4.7 via commit 04a22fae4c ("tracing, perf: Implement BPF
programs attached to uprobes"), and commit 98b5c2c65c ("perf, bpf:
allow bpf programs attach to tracepoints") respectively.
Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
private keys in kernel memory to be leaked. Disable them if the kernel
has been locked down in confidentiality mode.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: netdev@vger.kernel.org
cc: Chun-Yi Lee <jlee@suse.com>
cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: James Morris <jmorris@namei.org>
Disallow the creation of perf and ftrace kprobes when the kernel is
locked down in confidentiality mode by preventing their registration.
This prevents kprobes from being used to access kernel memory to steal
crypto data, but continues to allow the use of kprobes from signed
modules.
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: davem@davemloft.net
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
The rcu_dereference_raw_notrace() API name is confusing. It is equivalent
to rcu_dereference_raw() except that it also does sparse pointer checking.
There are only a few users of rcu_dereference_raw_notrace(). This patches
renames all of them to be rcu_dereference_raw_check() with the "_check()"
indicating sparse checking.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Fix checkpatch warnings about parentheses. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.
Switch the conditionals in the tracer over to CONFIG_PREEMPTION.
This is the first step to make the tracer work on RT. The other small
tweaks are submitted separately.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.409766323@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.
Switch the conditionals in RCU to use CONFIG_PREEMPTION.
That's the first step towards RCU on RT. The further tweaks are work in
progress. This neither touches the selftest bits which need a closer look
by Paul.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.210156346@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We already have tested it before. The second one should be removed.
With this change, the performance should have little improvement.
Link: http://lkml.kernel.org/r/20190730140850.7927-1-changbin.du@gmail.com
Cc: stable@vger.kernel.org
Fixes: 9cd2992f2d ("fgraph: Have set_graph_notrace only affect function_graph tracer")
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Returning the pointer that was passed in allows us to write
slightly more idiomatic code. Convert a few users.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190704221323.24290-1-willy@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Removing ULONG_MAX as the marker for the user stack trace end,
made the tracing code not know where the end is. The end is now
marked with a zero (NULL) pointer. Eiichi fixed this in the tracing
code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXTHsuRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgETAQDqRtu1KhJM6ujNlPY1aw6e9ncDAqWn
6GaumMgAdBqEcAEAxJSjr5UlzXuJsCjUjwE0txLfTscyNwljKW77h4/WNwA=
=bwtH
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Eiichi Tsukata found a small bug from the fixup of the stack code
Removing ULONG_MAX as the marker for the user stack trace end, made
the tracing code not know where the end is. The end is now marked with
a zero (NULL) pointer. Eiichi fixed this in the tracing code"
* tag 'trace-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix user stack trace "??" output
Commit c5c27a0a58 ("x86/stacktrace: Remove the pointless ULONG_MAX
marker") removes ULONG_MAX marker from user stack trace entries but
trace_user_stack_print() still uses the marker and it outputs unnecessary
"??".
For example:
less-1911 [001] d..2 34.758944: <user stack trace>
=> <00007f16f2295910>
=> ??
=> ??
=> ??
=> ??
=> ??
=> ??
=> ??
The user stack trace code zeroes the storage before saving the stack, so if
the trace is shorter than the maximum number of entries it can terminate
the print loop if a zero entry is detected.
Link: http://lkml.kernel.org/r/20190630085438.25545-1-devel@etsukata.com
Cc: stable@vger.kernel.org
Fixes: 4285f2fcef ("tracing: Remove the ULONG_MAX stack trace hackery")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Add user space specific memory reading for kprobes
- Allow kprobes to be executed earlier in boot
The rest are mostly just various clean ups and small fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXS88txQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhaPAQDHaAmu6wXtZjZE6GU4ZP61UNgDECmZ
4wlGrNc1AAlqAQD/QC8339p37aDCp9n27VY1wmJwF3nca+jAHfQLqWkkYgw=
=n/tz
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The main changes in this release include:
- Add user space specific memory reading for kprobes
- Allow kprobes to be executed earlier in boot
The rest are mostly just various clean ups and small fixes"
* tag 'trace-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
tracing: Make trace_get_fields() global
tracing: Let filter_assign_type() detect FILTER_PTR_STRING
tracing: Pass type into tracing_generic_entry_update()
ftrace/selftest: Test if set_event/ftrace_pid exists before writing
ftrace/selftests: Return the skip code when tracing directory not configured in kernel
tracing/kprobe: Check registered state using kprobe
tracing/probe: Add trace_event_call accesses APIs
tracing/probe: Add probe event name and group name accesses APIs
tracing/probe: Add trace flag access APIs for trace_probe
tracing/probe: Add trace_event_file access APIs for trace_probe
tracing/probe: Add trace_event_call register API for trace_probe
tracing/probe: Add trace_probe init and free functions
tracing/uprobe: Set print format when parsing command
tracing/kprobe: Set print format right after parsed command
kprobes: Fix to init kprobes in subsys_initcall
tracepoint: Use struct_size() in kmalloc()
ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS
ftrace: Enable trampoline when rec count returns back to one
tracing/kprobe: Do not run kprobe boot tests if kprobe_event is on cmdline
tracing: Make a separate config for trace event self tests
...
trace_get_fields() is the only way to read tracepoint fields at
run time, as their fields are defined at compile-time with macros.
Make this function visible to all users and it will be used by
trace event injection code to calculate the size of a tracepoint
entry.
Link: http://lkml.kernel.org/r/20190525165802.25944-4-xiyou.wangcong@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
filter_assign_type() could detect dynamic string and static
string, but not string pointers. Teach filter_assign_type()
to detect string pointers, and this will be needed by trace
event injection code.
BTW, trace event hist uses FILTER_PTR_STRING too.
Link: http://lkml.kernel.org/r/20190525165802.25944-3-xiyou.wangcong@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
All callers of tracing_generic_entry_update() have to initialize
entry->type, so let's just simply move it inside.
Link: http://lkml.kernel.org/r/20190525165802.25944-2-xiyou.wangcong@gmail.com
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Change registered check only by trace_kprobe and remove
TP_FLAG_REGISTERED from trace_probe, since this feature
is only used for trace_kprobe.
Link: http://lkml.kernel.org/r/155931588704.28323.4952266828256245833.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add trace_event_call access APIs for trace_probe.
Instead of accessing trace_probe.call directly, use those
accesses by trace_probe_event_call() method. This hides
the relationship of trace_event_call and trace_probe from
trace_kprobe and trace_uprobe.
Link: http://lkml.kernel.org/r/155931587711.28323.8335129014686133120.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add trace_probe_name() and trace_probe_group_name() functions
for accessing probe name and group name of trace_probe.
Link: http://lkml.kernel.org/r/155931586717.28323.8738615064952254761.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add trace_probe_test/set/clear_flag() functions for accessing
trace_probe.flag field.
This flags field should not be accessed directly.
Link: http://lkml.kernel.org/r/155931585683.28323.314290023236905988.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add trace_event_file access APIs for trace_probe data structure.
This simplifies enabling/disabling operations in uprobe and kprobe
events so that those don't touch deep inside the trace_probe.
This also removing a redundant synchronization when the
kprobe event is used from perf, since the perf itself uses
tracepoint_synchronize_unregister() after disabling (ftrace-
defined) event, thus we don't have to synchronize in that
path. Also we don't need to identify local trace_kprobe too
anymore.
Link: http://lkml.kernel.org/r/155931584587.28323.372301976283354629.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since trace_event_call is a field of trace_probe, these
operations should be done in trace_probe.c. trace_kprobe
and trace_uprobe use new functions to register/unregister
trace_event_call.
Link: http://lkml.kernel.org/r/155931583643.28323.14828411185591538876.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add common trace_probe init and cleanup function in
trace_probe.c, and use it from trace_kprobe.c and trace_uprobe.c
Link: http://lkml.kernel.org/r/155931582664.28323.5934870189034740822.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Here is the "big" driver core and debugfs changes for 5.3-rc1
It's a lot of different patches, all across the tree due to some api
changes and lots of debugfs cleanups. Because of this, there is going
to be some merge issues with your tree at the moment, I'll follow up
with the expected resolutions to make it easier for you.
Other than the debugfs cleanups, in this set of changes we have:
- bus iteration function cleanups (will cause build warnings
with s390 and coresight drivers in your tree)
- scripts/get_abi.pl tool to display and parse Documentation/ABI
entries in a simple way
- cleanups to Documenatation/ABI/ entries to make them parse
easier due to typos and other minor things
- default_attrs use for some ktype users
- driver model documentation file conversions to .rst
- compressed firmware file loading
- deferred probe fixes
All of these have been in linux-next for a while, with a bunch of merge
issues that Stephen has been patient with me for. Other than the merge
issues, functionality is working properly in linux-next :)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXSgpnQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykcwgCfS30OR4JmwZydWGJ7zK/cHqk+KjsAnjOxjC1K
LpRyb3zX29oChFaZkc5a
=XrEZ
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core and debugfs updates from Greg KH:
"Here is the "big" driver core and debugfs changes for 5.3-rc1
It's a lot of different patches, all across the tree due to some api
changes and lots of debugfs cleanups.
Other than the debugfs cleanups, in this set of changes we have:
- bus iteration function cleanups
- scripts/get_abi.pl tool to display and parse Documentation/ABI
entries in a simple way
- cleanups to Documenatation/ABI/ entries to make them parse easier
due to typos and other minor things
- default_attrs use for some ktype users
- driver model documentation file conversions to .rst
- compressed firmware file loading
- deferred probe fixes
All of these have been in linux-next for a while, with a bunch of
merge issues that Stephen has been patient with me for"
* tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits)
debugfs: make error message a bit more verbose
orangefs: fix build warning from debugfs cleanup patch
ubifs: fix build warning after debugfs cleanup patch
driver: core: Allow subsystems to continue deferring probe
drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT
arch_topology: Remove error messages on out-of-memory conditions
lib: notifier-error-inject: no need to check return value of debugfs_create functions
swiotlb: no need to check return value of debugfs_create functions
ceph: no need to check return value of debugfs_create functions
sunrpc: no need to check return value of debugfs_create functions
ubifs: no need to check return value of debugfs_create functions
orangefs: no need to check return value of debugfs_create functions
nfsd: no need to check return value of debugfs_create functions
lib: 842: no need to check return value of debugfs_create functions
debugfs: provide pr_fmt() macro
debugfs: log errors when something goes wrong
drivers: s390/cio: Fix compilation warning about const qualifiers
drivers: Add generic helper to match by of_node
driver_find_device: Unify the match function with class_find_device()
bus_find_device: Unify the match callback with class_find_device
...
Pull networking updates from David Miller:
"Some highlights from this development cycle:
1) Big refactoring of ipv6 route and neigh handling to support
nexthop objects configurable as units from userspace. From David
Ahern.
2) Convert explored_states in BPF verifier into a hash table,
significantly decreased state held for programs with bpf2bpf
calls, from Alexei Starovoitov.
3) Implement bpf_send_signal() helper, from Yonghong Song.
4) Various classifier enhancements to mvpp2 driver, from Maxime
Chevallier.
5) Add aRFS support to hns3 driver, from Jian Shen.
6) Fix use after free in inet frags by allocating fqdirs dynamically
and reworking how rhashtable dismantle occurs, from Eric Dumazet.
7) Add act_ctinfo packet classifier action, from Kevin
Darbyshire-Bryant.
8) Add TFO key backup infrastructure, from Jason Baron.
9) Remove several old and unused ISDN drivers, from Arnd Bergmann.
10) Add devlink notifications for flash update status to mlxsw driver,
from Jiri Pirko.
11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.
12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.
13) Various enhancements to ipv6 flow label handling, from Eric
Dumazet and Willem de Bruijn.
14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
der Merwe, and others.
15) Various improvements to axienet driver including converting it to
phylink, from Robert Hancock.
16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.
17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
Radulescu.
18) Add devlink health reporting to mlx5, from Moshe Shemesh.
19) Convert stmmac over to phylink, from Jose Abreu.
20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
Shalom Toledo.
21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.
22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.
23) Track spill/fill of constants in BPF verifier, from Alexei
Starovoitov.
24) Support bounded loops in BPF, from Alexei Starovoitov.
25) Various page_pool API fixes and improvements, from Jesper Dangaard
Brouer.
26) Just like ipv4, support ref-countless ipv6 route handling. From
Wei Wang.
27) Support VLAN offloading in aquantia driver, from Igor Russkikh.
28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.
29) Add flower GRE encap/decap support to nfp driver, from Pieter
Jansen van Vuuren.
30) Protect against stack overflow when using act_mirred, from John
Hurley.
31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.
32) Use page_pool API in netsec driver, Ilias Apalodimas.
33) Add Google gve network driver, from Catherine Sullivan.
34) More indirect call avoidance, from Paolo Abeni.
35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.
36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.
37) Add MPLS manipulation actions to TC, from John Hurley.
38) Add sending a packet to connection tracking from TC actions, and
then allow flower classifier matching on conntrack state. From
Paul Blakey.
39) Netfilter hw offload support, from Pablo Neira Ayuso"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
net/mlx5e: Return in default case statement in tx_post_resync_params
mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
net: dsa: add support for BRIDGE_MROUTER attribute
pkt_sched: Include const.h
net: netsec: remove static declaration for netsec_set_tx_de()
net: netsec: remove superfluous if statement
netfilter: nf_tables: add hardware offload support
net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
net: flow_offload: add flow_block_cb_is_busy() and use it
net: sched: remove tcf block API
drivers: net: use flow block API
net: sched: use flow block API
net: flow_offload: add flow_block_cb_{priv, incref, decref}()
net: flow_offload: add list handling functions
net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
net: flow_offload: add flow_block_cb_setup_simple()
net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
...
Pull perf updates from Ingo Molnar:
"The main changes in this cycle on the kernel side were:
- CPU PMU and uncore driver updates to Intel Snow Ridge, IceLake,
KabyLake, AmberLake and WhiskeyLake CPUs.
- Rework the MSR probing infrastructure to make it more robust, make
it work better on virtualized systems and to better expose it on
sysfs.
- Rework PMU attributes group support based on the feedback from
Greg. The core sysfs patch that adds sysfs_update_groups() was
acked by Greg.
There's a lot of perf tooling changes as well, all around the place:
- vendor updates to Intel, cs-etm (ARM), ARM64, s390,
- various enhancements to Intel PT tooling support:
- Improve CBR (Core to Bus Ratio) packets support.
- Export power and ptwrite events to sqlite and postgresql.
- Add support for decoding PEBS via PT packets.
- Add support for samples to contain IPC ratio, collecting cycles
information from CYC packets, showing the IPC info periodically
- Allow using time ranges
- lots of updates to perf pmu, perf stat, perf trace, eBPF support,
perf record, perf diff, etc. - please see the shortlog and Git log
for details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (252 commits)
tools arch x86: Sync asm/cpufeatures.h with the with the kernel
tools build: Check if gettid() is available before providing helper
perf jvmti: Address gcc string overflow warning for strncpy()
perf python: Remove -fstack-protector-strong if clang doesn't have it
perf annotate TUI browser: Do not use member from variable within its own initialization
perf tests: Fix record+probe_libc_inet_pton.sh for powerpc64
perf evsel: Do not rely on errno values for precise_ip fallback
perf thread: Allow references to thread objects after machine__exit()
perf header: Assign proper ff->ph in perf_event__synthesize_features()
tools arch kvm: Sync kvm headers with the kernel sources
perf script: Allow specifying the files to process guest samples
perf tools metric: Don't include duration_time in group
perf list: Avoid extra : for --raw metrics
perf vendor events intel: Metric fixes for SKX/CLX
perf tools: Fix typos / broken sentences
perf jevents: Add support for Hisi hip08 L3C PMU aliasing
perf jevents: Add support for Hisi hip08 HHA PMU aliasing
perf jevents: Add support for Hisi hip08 DDRC PMU aliasing
perf pmu: Support more complex PMU event aliasing
perf diff: Documentation -c cycles option
...
Pull scheduler updates from Ingo Molnar:
- Remove the unused per rq load array and all its infrastructure, by
Dietmar Eggemann.
- Add utilization clamping support by Patrick Bellasi. This is a
refinement of the energy aware scheduling framework with support for
boosting of interactive and capping of background workloads: to make
sure critical GUI threads get maximum frequency ASAP, and to make
sure background processing doesn't unnecessarily move to cpufreq
governor to higher frequencies and less energy efficient CPU modes.
- Add the bare minimum of tracepoints required for LISA EAS regression
testing, by Qais Yousef - which allows automated testing of various
power management features, including energy aware scheduling.
- Restructure the former tsk_nr_cpus_allowed() facility that the -rt
kernel used to modify the scheduler's CPU affinity logic such as
migrate_disable() - introduce the task->cpus_ptr value instead of
taking the address of &task->cpus_allowed directly - by Sebastian
Andrzej Siewior.
- Misc optimizations, fixes, cleanups and small enhancements - see the
Git log for details.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
sched/uclamp: Add uclamp support to energy_compute()
sched/uclamp: Add uclamp_util_with()
sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
sched/uclamp: Set default clamps for RT tasks
sched/uclamp: Reset uclamp values on RESET_ON_FORK
sched/uclamp: Extend sched_setattr() to support utilization clamping
sched/core: Allow sched_setattr() to use the current policy
sched/uclamp: Add system default clamps
sched/uclamp: Enforce last task's UCLAMP_MAX
sched/uclamp: Add bucket local max tracking
sched/uclamp: Add CPU's clamp buckets refcounting
sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
sched/debug: Export the newly added tracepoints
sched/debug: Add sched_overutilized tracepoint
sched/debug: Add new tracepoint to track PELT at se level
sched/debug: Add new tracepoints to track PELT at rq level
sched/debug: Add a new sched_trace_*() helper functions
sched/autogroup: Make autogroup_path() always available
sched/wait: Deduplicate code with do-while
sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
...
- Fixes a deadlock from a previous fix to keep module loading
and function tracing text modifications from stepping on each other.
(this has a few patches to help document the issue in comments)
- Fix a crash when the snapshot buffer gets out of sync with the
main ring buffer.
- Fix a memory leak when reading the memory logs
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXRzBCBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnDaAP9qTFBOFtgIGCT5wVP8xjQeESxh1b8R
tbaT7/U2oPpeiwEAvp1mYo5UYcc8KauBqVaLSLJVN4pv07xiZF5Qgh9C1QE=
=m2IT
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This includes three fixes:
- Fix a deadlock from a previous fix to keep module loading and
function tracing text modifications from stepping on each other
(this has a few patches to help document the issue in comments)
- Fix a crash when the snapshot buffer gets out of sync with the main
ring buffer
- Fix a memory leak when reading the memory logs"
* tag 'trace-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/x86: Anotate text_mutex split between ftrace_arch_code_modify_post_process() and ftrace_arch_code_modify_prepare()
tracing/snapshot: Resize spare buffer if size changed
tracing: Fix memory leak in tracing_err_log_open()
ftrace/x86: Add a comment to why we take text_mutex in ftrace_arch_code_modify_prepare()
ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()
The commit 9f255b632b ("module: Fix livepatch/ftrace module text
permissions race") causes a possible deadlock between register_kprobe()
and ftrace_run_update_code() when ftrace is using stop_machine().
The existing dependency chain (in reverse order) is:
-> #1 (text_mutex){+.+.}:
validate_chain.isra.21+0xb32/0xd70
__lock_acquire+0x4b8/0x928
lock_acquire+0x102/0x230
__mutex_lock+0x88/0x908
mutex_lock_nested+0x32/0x40
register_kprobe+0x254/0x658
init_kprobes+0x11a/0x168
do_one_initcall+0x70/0x318
kernel_init_freeable+0x456/0x508
kernel_init+0x22/0x150
ret_from_fork+0x30/0x34
kernel_thread_starter+0x0/0xc
-> #0 (cpu_hotplug_lock.rw_sem){++++}:
check_prev_add+0x90c/0xde0
validate_chain.isra.21+0xb32/0xd70
__lock_acquire+0x4b8/0x928
lock_acquire+0x102/0x230
cpus_read_lock+0x62/0xd0
stop_machine+0x2e/0x60
arch_ftrace_update_code+0x2e/0x40
ftrace_run_update_code+0x40/0xa0
ftrace_startup+0xb2/0x168
register_ftrace_function+0x64/0x88
klp_patch_object+0x1a2/0x290
klp_enable_patch+0x554/0x980
do_one_initcall+0x70/0x318
do_init_module+0x6e/0x250
load_module+0x1782/0x1990
__s390x_sys_finit_module+0xaa/0xf0
system_call+0xd8/0x2d0
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(text_mutex);
lock(cpu_hotplug_lock.rw_sem);
lock(text_mutex);
lock(cpu_hotplug_lock.rw_sem);
It is similar problem that has been solved by the commit 2d1e38f566
("kprobes: Cure hotplug lock ordering issues"). Many locks are involved.
To be on the safe side, text_mutex must become a low level lock taken
after cpu_hotplug_lock.rw_sem.
This can't be achieved easily with the current ftrace design.
For example, arm calls set_all_modules_text_rw() already in
ftrace_arch_code_modify_prepare(), see arch/arm/kernel/ftrace.c.
This functions is called:
+ outside stop_machine() from ftrace_run_update_code()
+ without stop_machine() from ftrace_module_enable()
Fortunately, the problematic fix is needed only on x86_64. It is
the only architecture that calls set_all_modules_text_rw()
in ftrace path and supports livepatching at the same time.
Therefore it is enough to move text_mutex handling from the generic
kernel/trace/ftrace.c into arch/x86/kernel/ftrace.c:
ftrace_arch_code_modify_prepare()
ftrace_arch_code_modify_post_process()
This patch basically reverts the ftrace part of the problematic
commit 9f255b632b ("module: Fix livepatch/ftrace module
text permissions race"). And provides x86_64 specific-fix.
Some refactoring of the ftrace code will be needed when livepatching
is implemented for arm or nds32. These architectures call
set_all_modules_text_rw() and use stop_machine() at the same time.
Link: http://lkml.kernel.org/r/20190627081334.12793-1-pmladek@suse.com
Fixes: 9f255b632b ("module: Fix livepatch/ftrace module text permissions race")
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
[
As reviewed by Miroslav Benes <mbenes@suse.cz>, removed return value of
ftrace_run_update_code() as it is a void function.
]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With CONFIG_MODULES=n, the following compiler warning occurs:
/data/users/yhs/work/net-next/kernel/trace/bpf_trace.c:605:13: warning:
‘do_bpf_send_signal’ defined but not used [-Wunused-function]
static void do_bpf_send_signal(struct irq_work *entry)
The __init function send_signal_irq_work_init(), which calls
do_bpf_send_signal(), is defined under CONFIG_MODULES. Hence,
when CONFIG_MODULES=n, nobody calls static function do_bpf_send_signal(),
hence the warning.
The init function send_signal_irq_work_init() should work without
CONFIG_MODULES. Moving it out of CONFIG_MODULES
code section fixed the compiler warning, and also make bpf_send_signal()
helper work without CONFIG_MODULES.
Fixes: 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
Reported-By: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull networking fixes from David Miller:
"Lots of bug fixes here:
1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.
2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
Crispin.
3) Use after free in psock backlog workqueue, from John Fastabend.
4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
Salem.
5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.
6) Network header needs to be set for packet redirect in nfp, from
John Hurley.
7) Fix udp zerocopy refcnt, from Willem de Bruijn.
8) Don't assume linear buffers in vxlan and geneve error handlers,
from Stefano Brivio.
9) Fix TOS matching in mlxsw, from Jiri Pirko.
10) More SCTP cookie memory leak fixes, from Neil Horman.
11) Fix VLAN filtering in rtl8366, from Linus Walluij.
12) Various TCP SACK payload size and fragmentation memory limit fixes
from Eric Dumazet.
13) Use after free in pneigh_get_next(), also from Eric Dumazet.
14) LAPB control block leak fix from Jeremy Sowden"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
lapb: fixed leak of control-blocks.
tipc: purge deferredq list for each grp member in tipc_group_delete
ax25: fix inconsistent lock state in ax25_destroy_timer
neigh: fix use-after-free read in pneigh_get_next
tcp: fix compile error if !CONFIG_SYSCTL
hv_sock: Suppress bogus "may be used uninitialized" warnings
be2net: Fix number of Rx queues used for flow hashing
net: handle 802.1P vlan 0 packets properly
tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
tcp: add tcp_min_snd_mss sysctl
tcp: tcp_fragment() should apply sane memory limits
tcp: limit payload size of sacked skbs
Revert "net: phylink: set the autoneg state in phylink_phy_change"
bpf: fix nested bpf tracepoints with per-cpu data
bpf: Fix out of bounds memory access in bpf_sk_storage
vsock/virtio: set SOCK_DONE on peer shutdown
net: dsa: rtl8366: Fix up VLAN filtering
net: phylink: set the autoneg state in phylink_phy_change
net: add high_order_alloc_disable sysctl/static key
tcp: add tcp_tx_skb_cache sysctl
...
BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
they do not increment bpf_prog_active while executing.
This enables three levels of nesting, to support
- a kprobe or raw tp or perf event,
- another one of the above that irq context happens to call, and
- another one in nmi context
(at most one of which may be a kprobe or perf event).
Fixes: 20b9d7ac48 ("bpf: avoid excessive stack usage for perf_sample_data")
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Out of range read of stack trace output
- Fix for NULL pointer dereference in trace_uprobe_create()
- Fix to a livepatching / ftrace permission race in the module code
- Fix for NULL pointer dereference in free_ftrace_func_mapper()
- A couple of build warning clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXQToxhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qusmAP4/mmJPgsDchnu5ui0wB8BByzJlsPn8
luXFDuqI4f34zgD+JCmeYbj5LLh98D9XkaaEgP4yz3yKsWeSdwWPCU0vTgo=
=M/+E
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Out of range read of stack trace output
- Fix for NULL pointer dereference in trace_uprobe_create()
- Fix to a livepatching / ftrace permission race in the module code
- Fix for NULL pointer dereference in free_ftrace_func_mapper()
- A couple of build warning clean ups
* tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
module: Fix livepatch/ftrace module text permissions race
tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
tracing: Make two symbols static
tracing: avoid build warning with HAVE_NOP_MCOUNT
tracing: Fix out-of-range read in trace_stack_print()
Commit 0597c49c69 ("tracing/uprobes: Use dyn_event framework for
uprobe events") cleaned up the usage of trace_uprobe_create(), and the
function has been no longer used for removing uprobe/uretprobe.
Link: http://lkml.kernel.org/r/20190614074026.8045-2-devel@etsukata.com
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix sparse warnings:
kernel/trace/trace.c:6927:24: warning:
symbol 'get_tracing_log_err' was not declared. Should it be static?
kernel/trace/trace.c:8196:15: warning:
symbol 'trace_instance_dir' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20190614153210.24424-1-yuehaibing@huawei.com
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Selecting HAVE_NOP_MCOUNT enables -mnop-mcount (if gcc supports it)
and sets CC_USING_NOP_MCOUNT. Reuse __is_defined (which is suitable for
testing CC_USING_* defines) to avoid conditional compilation and fix
the following gcc 9 warning on s390:
kernel/trace/ftrace.c:2514:1: warning: ‘ftrace_code_disable’ defined
but not used [-Wunused-function]
Link: http://lkml.kernel.org/r/patch.git-1a82d13f33ac.your-ad-here.call-01559732716-ext-6629@work.hours
Fixes: 2f4df0017b ("tracing: Add -mcount-nop option support")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, non-privileged user could only use uprobe with
kernel.perf_event_paranoid = -1
However, setting perf_event_paranoid to -1 leaks other users' processes to
non-privileged uprobes.
To introduce proper permission control of uprobes, we are building the
following system:
A daemon with CAP_SYS_ADMIN is in charge to create uprobes via tracefs;
Users asks the daemon to create uprobes;
Then user can attach uprobe only to processes owned by the user.
This patch allows non-privileged user to attach uprobe to processes owned
by the user.
The following example shows how to use uprobe with non-privileged user.
This is based on Brendan's blog post [1]
1. Create uprobe with root:
sudo perf probe -x 'readline%return +0($retval):string'
2. Then non-root user can use the uprobe as:
perf record -vvv -e probe_bash:readline__return -p <pid> sleep 20
perf script
[1] http://www.brendangregg.com/blog/2015-06-28/linux-ftrace-uprobe.html
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190507161545.788381-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In commit:
4b53a3412d ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper")
the tsk_nr_cpus_allowed() wrapper was removed. There was not
much difference in !RT but in RT we used this to implement
migrate_disable(). Within a migrate_disable() section the CPU mask is
restricted to single CPU while the "normal" CPU mask remains untouched.
As an alternative implementation Ingo suggested to use:
struct task_struct {
const cpumask_t *cpus_ptr;
cpumask_t cpus_mask;
};
with
t->cpus_ptr = &t->cpus_mask;
In -RT we then can switch the cpus_ptr to:
t->cpus_ptr = &cpumask_of(task_cpu(p));
in a migration disabled region. The rules are simple:
- Code that 'uses' ->cpus_allowed would use the pointer.
- Code that 'modifies' ->cpus_allowed would use the direct mask.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190423142636.14347-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2019-05-31
The following pull-request contains BPF updates for your *net-next* tree.
Lots of exciting new features in the first PR of this developement cycle!
The main changes are:
1) misc verifier improvements, from Alexei.
2) bpftool can now convert btf to valid C, from Andrii.
3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs.
This feature greatly improves BPF speed on 32-bit architectures. From Jiong.
4) cgroups will now auto-detach bpf programs. This fixes issue of thousands
bpf programs got stuck in dying cgroups. From Roman.
5) new bpf_send_signal() helper, from Yonghong.
6) cgroup inet skb programs can signal CN to the stack, from Lawrence.
7) miscellaneous cleanups, from many developers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXO303xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlb7AP4179SWPy9RFxvxyjTmpPFPL9oR0Q26
sOTyIBN99MUTsgEA0FNWz7/FWtFDa1wbh0tEVreaTQlKEeoIYF96dkN0iwE=
=BJgg
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"This fixes a memory leak from the error path in the event filter
logic"
* tag 'trace-v5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Avoid memory leak in predicate_parse()
Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In case of errors, predicate_parse() goes to the out_free label
to free memory and to return an error code.
However, predicate_parse() does not free the predicates of the
temporary prog_stack array, thence leaking them.
Link: http://lkml.kernel.org/r/20190528154338.29976-1-tomasbortoli@gmail.com
Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Reported-by: syzbot+6b8e0fb820e570c59e19@syzkaller.appspotmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
[ Added protection around freeing prog_stack[i].pred ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit c19fa94a8f ("Add HAVE_64BIT_ALIGNED_ACCESS") added the config for
architectures that required 64bit aligned access for all 64bit words. As
the ftrace ring buffer stores data on 4 byte alignment, this config option
was used to force it to store data on 8 byte alignment to make sure the data
being stored and written directly into the ring buffer was 8 byte aligned as
it would cause issues trying to write an 8 byte word on a 4 not 8 byte
aligned memory location.
But with the removal of the metag architecture, which was the only
architecture to use this, there is no architecture supported by Linux that
requires 8 byte aligne access for all 8 byte words (4 byte alignment is good
enough). Removing this config can simplify the code a bit.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
introduced bpf_send_signal() helper. If the context is nmi,
the sending signal work needs to be deferred to irq_work.
If the signal is invalid, the error will appear in irq_work
and it won't be propagated to user.
This patch did an early check in the helper itself to notify
user invalid signal, as suggested by Daniel.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
GCC 9 now warns about calling memset() on partial structures when it
goes across multiple fields. This adds a helper for the place in
tracing that does this type of clearing of a structure.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXOrlfhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoDhAP4mogBm0JjJ1LWr8RX2/X7qFm0x1zLz
5Mk0QKfeRP3MYgEAl2mV/HeFp7aMxEY2CKy0LslmaXPhamPx1r0LlfMgIws=
=drP3
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing warning fix from Steven Rostedt:
"Make the GCC 9 warning for sub struct memset go away.
GCC 9 now warns about calling memset() on partial structures when it
goes across multiple fields. This adds a helper for the place in
tracing that does this type of clearing of a structure"
* tag 'trace-v5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Silence GCC 9 array bounds warning
Custom trampolines can only be enabled if there is only a single ops
attached to it. If there's only a single callback registered to a function,
and the ops has a trampoline registered for it, then we can call the
trampoline directly. This is very useful for improving the performance of
ftrace and livepatch.
If more than one callback is registered to a function, the general
trampoline is used, and the custom trampoline is not restored back to the
direct call even if all the other callbacks were unregistered and we are
back to one callback for the function.
To fix this, set FTRACE_FL_TRAMP flag if rec count is decremented
to one, and the ops that left has a trampoline.
Testing After this patch :
insmod livepatch_unshare_files.ko
cat /sys/kernel/debug/tracing/enabled_functions
unshare_files (1) R I tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0
echo unshare_files > /sys/kernel/debug/tracing/set_ftrace_filter
echo function > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/enabled_functions
unshare_files (2) R I ->ftrace_ops_list_func+0x0/0x150
echo nop > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/enabled_functions
unshare_files (1) R I tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0
Link: http://lkml.kernel.org/r/1556969979-111047-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The trace event self tests enable loop through *all* events, enables each
one, one at a time, runs some code to trigger various events (not
necessarily the same events), and checks if anything went wrong. The issue
is that trace events are usually the least likely start up test to cause a
problem, but they take the longest to run (because there are so many
events). When one of the other tests trigger a bug, the trace event start up
tests causes the bisect to take much longer, because it takes 10s of seconds
to get through the trace event tests.
By making them a separate config (even though they are enabled by default if
start up tests are set), it is possible to turn them off and still run the
other tracing start up tests much quicker.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add kprobe_event= boot parameter to define kprobe events
at boot time.
The definition syntax is similar to tracefs/kprobe_events
interface, but use ',' and ';' instead of ' ' and '\n'
respectively. e.g.
kprobe_event=p,vfs_read,$arg1,$arg2
This puts a probe on vfs_read with argument1 and 2, and
enable the new event.
Link: http://lkml.kernel.org/r/155851395498.15728.830529496248543583.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The comment of trace_filter_add_remove_task() refers to the function as
'trace_pid_filter_add_remove_task', use the correct name.
Link: http://lkml.kernel.org/r/20190523192628.134406-1-mka@chromium.org
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Support user-space dereference syntax for probe event arguments
to dereference the data-structure or array in user-space.
The syntax is just adding 'u' before an offset value.
+|-u<OFFSET>(<FETCHARG>)
e.g. +u8(%ax), +u0(+0(%si))
For example, if you probe do_sched_setscheduler(pid, policy,
param) and record param->sched_priority, you can add new
probe as below;
p do_sched_setscheduler priority=+u0($arg3)
Note that kprobe event provides this and it doesn't change the
dereference method automatically because we do not know whether
the given address is in userspace or kernel on some archs.
So as same as "ustring", this is an option for user, who has to
carefully choose the dereference method.
Link: http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add "ustring" type for fetching user-space string from kprobe event.
User can specify ustring type at uprobe event, and it is same as
"string" for uprobe.
Note that probe-event provides this option but it doesn't choose the
correct type automatically since we have not way to decide the address
is in user-space or not on some arch (and on some other arch, you can
fetch the string by "string" type). So user must carefully check the
target code (e.g. if you see __user on the target variable) and
use this new type.
Link: http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The code modification functions have "enable" and "update" variables that
are sometimes "int" but used as "bool". Remove the ambiguity and make them
"bool" when they are only used for true or false values.
Link: http://lkml.kernel.org/r/e1429923d9eda92a3cf5ee9e33c7eacce539781d.1558115654.git.naveen.n.rao@linux.vnet.ibm.com
Reported-by: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Starting with GCC 9, -Warray-bounds detects cases when memset is called
starting on a member of a struct but the size to be cleared ends up
writing over further members.
Such a call happens in the trace code to clear, at once, all members
after and including `seq` on struct trace_iterator:
In function 'memset',
inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3:
./include/linux/string.h:344:9: warning: '__builtin_memset' offset
[8505, 8560] from the object at 'iter' is out of the bounds of
referenced subobject 'seq' with type 'struct trace_seq' at offset
4368 [-Warray-bounds]
344 | return __builtin_memset(p, c, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to avoid GCC complaining about it, we compute the address
ourselves by adding the offsetof distance instead of referring
directly to the member.
Since there are two places doing this clear (trace.c and trace_kdb.c),
take the chance to move the workaround into a single place in
the internal header.
Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.com
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
[ Removed unnecessary parenthesis around "iter" ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
code and I forgot to incorporate them.
I also added a small clean up patch that was sent to me a while ago
and I just noticed it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXOixqRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlPsAQCmNzno+SJMXLLIojZ80pqs9PqLqIrW
iBopFNihRBAkxgEAle8Pvr53S4R/Hjn6j++kxBm/uWaEfICOzyqcSyqxlQA=
=Sin+
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Tom Zanussi sent me some small fixes and cleanups to the histogram
code and I forgot to incorporate them.
I also added a small clean up patch that was sent to me a while ago
and I just noticed it"
* tag 'trace-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
kernel/trace/trace.h: Remove duplicate header of trace_seq.h
tracing: Add a check_val() check before updating cond_snapshot() track_val
tracing: Check keys for variable references in expressions too
tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts
This patch tries to solve the following specific use case.
Currently, bpf program can already collect stack traces
through kernel function get_perf_callchain()
when certain events happens (e.g., cache miss counter or
cpu clock counter overflows). But such stack traces are
not enough for jitted programs, e.g., hhvm (jited php).
To get real stack trace, jit engine internal data structures
need to be traversed in order to get the real user functions.
bpf program itself may not be the best place to traverse
the jit engine as the traversing logic could be complex and
it is not a stable interface either.
Instead, hhvm implements a signal handler,
e.g. for SIGALARM, and a set of program locations which
it can dump stack traces. When it receives a signal, it will
dump the stack in next such program location.
Such a mechanism can be implemented in the following way:
. a perf ring buffer is created between bpf program
and tracing app.
. once a particular event happens, bpf program writes
to the ring buffer and the tracing app gets notified.
. the tracing app sends a signal SIGALARM to the hhvm.
But this method could have large delays and causing profiling
results skewed.
This patch implements bpf_send_signal() helper to send
a signal to hhvm in real time, resulting in intended stack traces.
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Here are series of patches that add SPDX tags to different kernel files,
based on two different things:
- SPDX entries are added to a bunch of files that we missed a year ago
that do not have any license information at all.
These were either missed because the tool saw the MODULE_LICENSE()
tag, or some EXPORT_SYMBOL tags, and got confused and thought the
file had a real license, or the files have been added since the last
big sweep, or they were Makefile/Kconfig files, which we didn't
touch last time.
- Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
tools can determine the license text in the file itself. Where this
happens, the license text is removed, in order to cut down on the
700+ different ways we have in the kernel today, in a quest to get
rid of all of these.
These patches have been out for review on the linux-spdx@vger mailing
list, and while they were created by automatic tools, they were
hand-verified by a bunch of different people, all whom names are on the
patches are reviewers.
The reason for these "large" patches is if we were to continue to
progress at the current rate of change in the kernel, adding license
tags to individual files in different subsystems, we would be finished
in about 10 years at the earliest.
There will be more series of these types of patches coming over the next
few weeks as the tools and reviewers crunch through the more "odd"
variants of how to say "GPLv2" that developers have come up with over
the years, combined with other fun oddities (GPL + a BSD disclaimer?)
that are being unearthed, with the goal for the whole kernel to be
cleaned up.
These diffstats are not small, 3840 files are touched, over 10k lines
removed in just 24 patches.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOP8uw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynmGQCgy3evqzleuOITDpuWaxewFdHqiJYAnA7KRw4H
1KwtfRnMtG6dk/XaS7H7
=O9lH
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull SPDX update from Greg KH:
"Here is a series of patches that add SPDX tags to different kernel
files, based on two different things:
- SPDX entries are added to a bunch of files that we missed a year
ago that do not have any license information at all.
These were either missed because the tool saw the MODULE_LICENSE()
tag, or some EXPORT_SYMBOL tags, and got confused and thought the
file had a real license, or the files have been added since the
last big sweep, or they were Makefile/Kconfig files, which we
didn't touch last time.
- Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
tools can determine the license text in the file itself. Where this
happens, the license text is removed, in order to cut down on the
700+ different ways we have in the kernel today, in a quest to get
rid of all of these.
These patches have been out for review on the linux-spdx@vger mailing
list, and while they were created by automatic tools, they were
hand-verified by a bunch of different people, all whom names are on
the patches are reviewers.
The reason for these "large" patches is if we were to continue to
progress at the current rate of change in the kernel, adding license
tags to individual files in different subsystems, we would be finished
in about 10 years at the earliest.
There will be more series of these types of patches coming over the
next few weeks as the tools and reviewers crunch through the more
"odd" variants of how to say "GPLv2" that developers have come up with
over the years, combined with other fun oddities (GPL + a BSD
disclaimer?) that are being unearthed, with the goal for the whole
kernel to be cleaned up.
These diffstats are not small, 3840 files are touched, over 10k lines
removed in just 24 patches"
* tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (24 commits)
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 23
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 22
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 21
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 20
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 19
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 18
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 17
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 15
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 14
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 12
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 11
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 10
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 9
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 7
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 5
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 4
treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 3
...
Without this check a snapshot is taken whenever a bucket's max is hit,
rather than only when the global max is hit, as it should be.
Before:
In this example, we do a first run of the workload (cyclictest),
examine the output, note the max ('triggering value') (347), then do
a second run and note the max again.
In this case, the max in the second run (39) is below the max in the
first run, but since we haven't cleared the histogram, the first max
is still in the histogram and is higher than any other max, so it
should still be the max for the snapshot. It isn't however - the
value should still be 347 after the second run.
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_prio,next_comm,prev_pid,prev_prio,prev_comm):onmax($wakeup_lat).snapshot() if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# cyclictest -p 80 -n -s -t 2 -D 2
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
{ next_pid: 2143 } hitcount: 199
max: 44 next_prio: 120 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/4
{ next_pid: 2145 } hitcount: 1325
max: 38 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/2
{ next_pid: 2144 } hitcount: 1982
max: 347 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/6
Snapshot taken (see tracing/snapshot). Details:
triggering value { onmax($wakeup_lat) }: 347
triggered by event with key: { next_pid: 2144 }
# cyclictest -p 80 -n -s -t 2 -D 2
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
{ next_pid: 2143 } hitcount: 199
max: 44 next_prio: 120 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/4
{ next_pid: 2148 } hitcount: 199
max: 16 next_prio: 120 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/1
{ next_pid: 2145 } hitcount: 1325
max: 38 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/2
{ next_pid: 2150 } hitcount: 1326
max: 39 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/4
{ next_pid: 2144 } hitcount: 1982
max: 347 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/6
{ next_pid: 2149 } hitcount: 1983
max: 130 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/0
Snapshot taken (see tracing/snapshot). Details:
triggering value { onmax($wakeup_lat) }: 39
triggered by event with key: { next_pid: 2150 }
After:
In this example, we do a first run of the workload (cyclictest),
examine the output, note the max ('triggering value') (375), then do
a second run and note the max again.
In this case, the max in the second run is still 375, the highest in
any bucket, as it should be.
# cyclictest -p 80 -n -s -t 2 -D 2
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
{ next_pid: 2072 } hitcount: 200
max: 28 next_prio: 120 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/5
{ next_pid: 2074 } hitcount: 1323
max: 375 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/2
{ next_pid: 2073 } hitcount: 1980
max: 153 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/6
Snapshot taken (see tracing/snapshot). Details:
triggering value { onmax($wakeup_lat) }: 375
triggered by event with key: { next_pid: 2074 }
# cyclictest -p 80 -n -s -t 2 -D 2
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
{ next_pid: 2101 } hitcount: 199
max: 49 next_prio: 120 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/6
{ next_pid: 2072 } hitcount: 200
max: 28 next_prio: 120 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/5
{ next_pid: 2074 } hitcount: 1323
max: 375 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/2
{ next_pid: 2103 } hitcount: 1325
max: 74 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/4
{ next_pid: 2073 } hitcount: 1980
max: 153 next_prio: 19 next_comm: cyclictest
prev_pid: 0 prev_prio: 120 prev_comm: swapper/6
{ next_pid: 2102 } hitcount: 1981
max: 84 next_prio: 19 next_comm: cyclictest
prev_pid: 12 prev_prio: 120 prev_comm: kworker/0:1
Snapshot taken (see tracing/snapshot). Details:
triggering value { onmax($wakeup_lat) }: 375
triggered by event with key: { next_pid: 2074 }
Link: http://lkml.kernel.org/r/95958351329f129c07504b4d1769c47a97b70d65.1555597045.git.tom.zanussi@linux.intel.com
Cc: stable@vger.kernel.org
Fixes: a3785b7eca ("tracing: Add hist trigger snapshot() action")
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There's an existing check for variable references in keys, but it
doesn't go far enough. It checks whether a key field is a variable
reference but doesn't check whether it's an expression containing
variable references, which can cause the same problems for callers.
Use the existing field_has_hist_vars() function rather than a direct
top-level flag check to catch all possible variable references.
Link: http://lkml.kernel.org/r/e8c3d3d53db5ca90ceea5a46e5413103a6902fc7.1555597045.git.tom.zanussi@linux.intel.com
Cc: stable@vger.kernel.org
Fixes: 067fe038e7 ("tracing: Add variable reference handling to hist triggers")
Reported-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
hist_field_var_ref() is an implementation of hist_field_fn_t(), which
can be called with a null tracing_map_elt elt param when assembling a
key in event_hist_trigger().
In the case of hist_field_var_ref() this doesn't make sense, because a
variable can only be resolved by looking it up using an already
assembled key i.e. a variable can't be used to assemble a key since
the key is required in order to access the variable.
Upper layers should prevent the user from constructing a key using a
variable in the first place, but in case one slips through, it
shouldn't cause a NULL pointer dereference. Also if one does slip
through, we want to know about it, so emit a one-time warning in that
case.
Link: http://lkml.kernel.org/r/64ec8dc15c14d305295b64cdfcc6b2b9dd14753f.1555597045.git.tom.zanussi@linux.intel.com
Reported-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull networking fixes from David Miller:1) Use after free in __dev_map_entry_free(), from Eric Dumazet.
1) Use after free in __dev_map_entry_free(), from Eric Dumazet.
2) Fix TCP retransmission timestamps on passive Fast Open, from Yuchung
Cheng.
3) Orphan NFC, we'll take the patches directly into my tree. From
Johannes Berg.
4) We can't recycle cloned TCP skbs, from Eric Dumazet.
5) Some flow dissector bpf test fixes, from Stanislav Fomichev.
6) Fix RCU marking and warnings in rhashtable, from Herbert Xu.
7) Fix some potential fib6 leaks, from Eric Dumazet.
8) Fix a _decode_session4 uninitialized memory read bug fix that got
lost in a merge. From Florian Westphal.
9) Fix ipv6 source address routing wrt. exception route entries, from
Wei Wang.
10) The netdev_xmit_more() conversion was not done %100 properly in mlx5
driver, fix from Tariq Toukan.
11) Clean up botched merge on netfilter kselftest, from Florian
Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (74 commits)
of_net: fix of_get_mac_address retval if compiled without CONFIG_OF
net: fix kernel-doc warnings for socket.c
net: Treat sock->sk_drops as an unsigned int when printing
kselftests: netfilter: fix leftover net/net-next merge conflict
mlxsw: core: Prevent reading unsupported slave address from SFP EEPROM
mlxsw: core: Prevent QSFP module initialization for old hardware
vsock/virtio: Initialize core virtio vsock before registering the driver
net/mlx5e: Fix possible modify header actions memory leak
net/mlx5e: Fix no rewrite fields with the same match
net/mlx5e: Additional check for flow destination comparison
net/mlx5e: Add missing ethtool driver info for representors
net/mlx5e: Fix number of vports for ingress ACL configuration
net/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled
net/mlx5e: Fix wrong xmit_more application
net/mlx5: Fix peer pf disable hca command
net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index
net/mlx5: Add meaningful return codes to status_to_err function
net/mlx5: Imply MLXFW in mlx5_core
Revert "tipc: fix modprobe tipc failed after switch order of device registration"
vsock/virtio: free packets during the socket release
...
Daniel Borkmann says:
====================
pull-request: bpf 2019-05-16
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a use after free in __dev_map_entry_free(), from Eric.
2) Several sockmap related bug fixes: a splat in strparser if
it was never initialized, remove duplicate ingress msg list
purging which can race, fix msg->sg.size accounting upon
skb to msg conversion, and last but not least fix a timeout
bug in tcp_bpf_wait_data(), from John.
3) Fix LRU map to avoid messing with eviction heuristics upon
syscall lookup, e.g. map walks from user space side will
then lead to eviction of just recently created entries on
updates as it would mark all map entries, from Daniel.
4) Don't bail out when libbpf feature probing fails. Also
various smaller fixes to flow_dissector test, from Stanislav.
5) Fix missing brackets for BTF_INT_OFFSET() in UAPI, from Gary.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- Removing of non-DYNAMIC_FTRACE from 32bit x86
- Removing of mcount support from x86
- Emulating a call from int3 on x86_64, fixes live kernel patching
- Consolidated Tracing Error logs file
Minor updates:
- Removal of klp_check_compiler_support()
- kdb ftrace dumping output changes
- Accessing and creating ftrace instances from inside the kernel
- Clean up of #define if macro
- Introduction of TRACE_EVENT_NOP() to disable trace events based on config
options
And other minor fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXNxMZxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qq4PAP44kP6VbwL8CHyI2A3xuJ6Hwxd+2Z2r
ip66RtzyJ+2iCgEA2QCuWUlEt2bLpF9a8IQ4N9tWenSeW2i7gunPb+tioQw=
=RVQo
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The major changes in this tracing update includes:
- Removal of non-DYNAMIC_FTRACE from 32bit x86
- Removal of mcount support from x86
- Emulating a call from int3 on x86_64, fixes live kernel patching
- Consolidated Tracing Error logs file
Minor updates:
- Removal of klp_check_compiler_support()
- kdb ftrace dumping output changes
- Accessing and creating ftrace instances from inside the kernel
- Clean up of #define if macro
- Introduction of TRACE_EVENT_NOP() to disable trace events based on
config options
And other minor fixes and clean ups"
* tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
x86: Hide the int3_emulate_call/jmp functions from UML
livepatch: Remove klp_check_compiler_support()
ftrace/x86: Remove mcount support
ftrace/x86_32: Remove support for non DYNAMIC_FTRACE
tracing: Simplify "if" macro code
tracing: Fix documentation about disabling options using trace_options
tracing: Replace kzalloc with kcalloc
tracing: Fix partial reading of trace event's id file
tracing: Allow RCU to run between postponed startup tests
tracing: Fix white space issues in parse_pred() function
tracing: Eliminate const char[] auto variables
ring-buffer: Fix mispelling of Calculate
tracing: probeevent: Fix to make the type of $comm string
tracing: probeevent: Do not accumulate on ret variable
tracing: uprobes: Re-enable $comm support for uprobe events
ftrace/x86_64: Emulate call function while updating in breakpoint handler
x86_64: Allow breakpoints to emulate call instructions
x86_64: Add gap to int3 to allow for call emulation
tracing: kdb: Allow ftdump to skip all but the last few entries
tracing: Add trace_total_entries() / trace_total_entries_cpu()
...
Both of them are not declared in the headers and not used outside
of bpf_trace.c file.
Fixes: a38d1107f9 ("bpf: support raw tracepoints in modules")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Core changes:
- The gpiolib MMIO driver has been enhanced to handle two direction
registers, i.e. one register to set lines as input and one register
to set lines as output. It turns out some silicon engineer thinks
the ability to configure a line as input and output at the same
time makes sense, this can be debated but includes a lot of analog
electronics reasoning, and the registers are there and need to
be handled consistently. Unsurprisingly, we enforce the lines to
be either inputs or outputs in such schemes.
- Send in the proper argument value to .set_config() dispatched to
the pin control subsystem. Nobody used it before, now someone
does, so fix it to work as expected.
- The ACPI gpiolib portions can now handle pin bias setting (pull up
or pull down). This has been in the ACPI spec for years and we
finally have it properly integrated with Linux GPIOs. It was based
on an observation from Andy Schevchenko that Thomas Petazzoni's
changes to the core for biasing the PCA950x GPIO expander actually
happen to fit hand-in-glove with what the ACPI core needed.
Such nice synergies happen sometimes.
New drivers:
- A new driver for the Mellanox BlueField GPIO controller. This is
using 64bit MMIO registers and can configure lines as inputs
and outputs at the same time and after improving the MMIO library
we handle it just fine. Interesting.
- A new IXP4xx proper gpiochip driver with hierarchical interrupts
should be coming in from the ARM SoC tree as well.
Driver enhancements:
- The PCA053x driver handles the CAT9554 GPIO expander.
- The PCA053x driver handles the NXP PCAL6416 GPIO expander.
- Wake-up support on PCA053x GPIO lines.
- OMAP now does a nice asynchronous IRQ handling on wake-ups by
letting everything wake up on edges, and this makes runtime PM
work as expected too.
Misc:
- Several cleanups such as devres fixes.
- Get rid of some languager comstructs that cause problems when
compiling with LLVMs clang.
- Documentation review and update.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJc1olZAAoJEEEQszewGV1zEU4P/RmTf3hG8xmNPS3MDTmR6gAy
/YJOXjXBf3CD/dmEAyyaNLnUQismrtRNvHSoEGbno7gkU+htzp9UfUJkj6+HIXs2
RpF+Hi78HzZNDxGWuBLu6OZolpmBtx+sRKOhHk/XfNS45qd1FgXWDuulzsYa9Xsr
hYMXdtdv9wY/vcc68q1rtKAbzlu5ZNCa3Zj1iNOr/XQt3Nl2BW66hGLgjK4mOvgx
fJy4rFXuDIMfDvo69U1Opz2b39sfE7XMhfZS/MOgg4yEV9zGRgDoI1tyMcTqGb8Q
8LQbp5dXkP+3dJQB8tgbu3Vk4WC1Rd/pmIli5sMgsk0HYQ6XegfT6HJKozSmwN9r
0s8jKlrocWZvdPo1aJwQgtRS56t2rFWcrcRye8bLqxkkW5cYIq9CwkE8USwB31Kv
PFpoOwRuCtj0gkCxf7WIEcC5NAkYPow3K1KPdk3E0Si6I3pj0NqqlaAD0JAlkC2V
aPq3xbTuFCAdmcADEt2Z+dUJ7WIs5Y9oQgosMAx+A2AD4K3QDBMu3pZsT6SCu4XZ
mK0eWJi9/CvOj/s7bA0BEJVxQA+p8KYsNRBOULg/8aAOqGcLnSydQjqrxDTE8YrL
xmmRG7i7ht0B9CchZuIB5hqdvjbCgvcVa5OnCUDfLxE0GdCx8iJ9y9OrsMXbabYq
8FcPDo1N38cTYLnLqvKI
=rhto
-----END PGP SIGNATURE-----
Merge tag 'gpio-v5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull gpio updates from Linus Walleij:
"This is the bulk of the GPIO changes for the v5.2 kernel cycle. A bit
later than usual because I was ironing out my own mistakes. I'm
holding some stuff back for the next kernel as a result, and this
should be a healthy and well tested batch.
Core changes:
- The gpiolib MMIO driver has been enhanced to handle two direction
registers, i.e. one register to set lines as input and one register
to set lines as output. It turns out some silicon engineer thinks
the ability to configure a line as input and output at the same
time makes sense, this can be debated but includes a lot of analog
electronics reasoning, and the registers are there and need to be
handled consistently. Unsurprisingly, we enforce the lines to be
either inputs or outputs in such schemes.
- Send in the proper argument value to .set_config() dispatched to
the pin control subsystem. Nobody used it before, now someone does,
so fix it to work as expected.
- The ACPI gpiolib portions can now handle pin bias setting (pull up
or pull down). This has been in the ACPI spec for years and we
finally have it properly integrated with Linux GPIOs. It was based
on an observation from Andy Schevchenko that Thomas Petazzoni's
changes to the core for biasing the PCA950x GPIO expander actually
happen to fit hand-in-glove with what the ACPI core needed. Such
nice synergies happen sometimes.
New drivers:
- A new driver for the Mellanox BlueField GPIO controller. This is
using 64bit MMIO registers and can configure lines as inputs and
outputs at the same time and after improving the MMIO library we
handle it just fine. Interesting.
- A new IXP4xx proper gpiochip driver with hierarchical interrupts
should be coming in from the ARM SoC tree as well.
Driver enhancements:
- The PCA053x driver handles the CAT9554 GPIO expander.
- The PCA053x driver handles the NXP PCAL6416 GPIO expander.
- Wake-up support on PCA053x GPIO lines.
- OMAP now does a nice asynchronous IRQ handling on wake-ups by
letting everything wake up on edges, and this makes runtime PM work
as expected too.
Misc:
- Several cleanups such as devres fixes.
- Get rid of some languager comstructs that cause problems when
compiling with LLVMs clang.
- Documentation review and update"
* tag 'gpio-v5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (85 commits)
gpio: Update documentation
docs: gpio: convert docs to ReST and rename to *.rst
gpio: sch: Remove write-only core_base
gpio: pxa: Make two symbols static
gpiolib: acpi: Respect pin bias setting
gpiolib: acpi: Add acpi_gpio_update_gpiod_lookup_flags() helper
gpiolib: acpi: Set pin value, based on bias, more accurately
gpiolib: acpi: Change type of dflags
gpiolib: Introduce GPIO_LOOKUP_FLAGS_DEFAULT
gpiolib: Make use of enum gpio_lookup_flags consistent
gpiolib: Indent entry values of enum gpio_lookup_flags
gpio: pca953x: add support for pca6416
dt-bindings: gpio: pca953x: document the nxp,pca6416
gpio: pca953x: add pcal6416 to the of_device_id table
gpio: gpio-omap: Remove conditional pm_runtime handling for GPIO interrupts
gpio: gpio-omap: configure edge detection for level IRQs for idle wakeup
tracing: stop making gpio tracing configurable
gpio: pca953x: Configure wake-up path when wake-up is enabled
gpio: of: Optimize quirk checks
gpio: mmio: Drop bgpio_dir_inverted
...
To disable a tracing option using the trace_options file, the option
name needs to be prefixed with 'no', and not suffixed, as the README
states. Fix it.
Link: http://lkml.kernel.org/r/154872690031.47356.5739053380942044586.stgit@srivatsa-ubuntu
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Replace kzalloc() function with its 2-factor argument form, kcalloc().
This patch replaces cases of:
kzalloc(a * b, gfp)
with:
kcalloc(a, b, gfp)
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20190115043408.GA23456@embeddedor
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When reading only part of the id file, the ppos isn't tracked correctly.
This is taken care by simple_read_from_buffer.
Reading a single byte, and then the next byte would result EOF.
While this seems like not a big deal, this breaks abstractions that
reads information from files unbuffered. See for example
https://github.com/golang/go/issues/29399
This code was mentioned as problematic in
commit cd458ba9d5
("tracing: Do not (ab)use trace_seq in event_id_read()")
An example C code that show this bug is:
#include <stdio.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
int main(int argc, char **argv) {
if (argc < 2)
return 1;
int fd = open(argv[1], O_RDONLY);
char c;
read(fd, &c, 1);
printf("First %c\n", c);
read(fd, &c, 1);
printf("Second %c\n", c);
}
Then run with, e.g.
sudo ./a.out /sys/kernel/debug/tracing/events/tcp/tcp_set_state/id
You'll notice you're getting the first character twice, instead of the
first two characters in the id file.
Link: http://lkml.kernel.org/r/20181231115837.4932-1-elazar@lightbitslabs.com
Cc: Orit Wasserman <orit.was@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 23725aeeab ("ftrace: provide an id file for each event")
Signed-off-by: Elazar Leibovich <elazar@lightbitslabs.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When building a allmodconfig kernel for arm64 and boot that in qemu,
CONFIG_FTRACE_STARTUP_TEST gets enabled and that takes time so the
watchdog expires and prints out a message like this:
'watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]'
Depending on what the what test gets called from init_trace_selftests()
it stays minutes in the loop.
Rework so that function cond_resched() gets called in the
init_trace_selftests loop.
Link: http://lkml.kernel.org/r/20181130145622.26334-1-anders.roxell@linaro.org
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Trivial fix to clean up an indentation issue, a whole chunk of code
has an extra space in the indentation.
Link: http://lkml.kernel.org/r/20181109132312.20994-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Automatic const char[] variables cause unnecessary code
generation. For example, the this_mod variable leads to
3f04: 48 b8 5f 5f 74 68 69 73 5f 6d movabs $0x6d5f736968745f5f,%rax # __this_m
3f0e: 4c 8d 44 24 02 lea 0x2(%rsp),%r8
3f13: 48 8d 7c 24 10 lea 0x10(%rsp),%rdi
3f18: 48 89 44 24 02 mov %rax,0x2(%rsp)
3f1d: 4c 89 e9 mov %r13,%rcx
3f20: b8 65 00 00 00 mov $0x65,%eax # e
3f25: 48 c7 c2 00 00 00 00 mov $0x0,%rdx
3f28: R_X86_64_32S .rodata.str1.1+0x18d
3f2c: be 48 00 00 00 mov $0x48,%esi
3f31: c7 44 24 0a 6f 64 75 6c movl $0x6c75646f,0xa(%rsp) # odul
3f39: 66 89 44 24 0e mov %ax,0xe(%rsp)
i.e., the string gets built on the stack at runtime. Similar code can be
found for the other instances I'm replacing here. Putting the string
in .rodata reduces the combined .text+.rodata size and saves time and
stack space at runtime.
The simplest fix, and what I've done for the this_mod case, is to just
make the variable static.
However, for the "<faulted>" case where the same string is used twice,
that prevents the linker from merging those two literals, so instead use
a macro - that also keeps the two instances automatically in
sync (instead of only the compile-time strlen expression).
Finally, for the two runs of spaces, it turns out that the "build
these strings on the stack" is not the worst part of what gcc does -
it turns print_func_help_header_irq() into "if (tgid) { /*
print_event_info + five seq_printf calls */ } else { /* print
event_info + another five seq_printf */}". Taking inspiration from a
suggestion from Al Viro, use %.*s to make snprintf either stop after
the first two spaces or print the whole string. As a bonus, the
seq_printfs now fit on single lines (at least, they are not longer
than the existing ones in the function just above), making it easier
to see that the ascii art lines up.
x86-64 defconfig + CONFIG_FUNCTION_TRACER:
$ scripts/stackdelta /tmp/stackusage.{0,1}
./kernel/trace/ftrace.c ftrace_mod_callback 152 136 -16
./kernel/trace/trace.c trace_default_header 56 32 -24
./kernel/trace/trace.c tracing_mark_raw_write 96 72 -24
./kernel/trace/trace.c tracing_mark_write 104 80 -24
bloat-o-meter
add/remove: 1/0 grow/shrink: 0/4 up/down: 14/-375 (-361)
Function old new delta
this_mod - 14 +14
ftrace_mod_callback 577 542 -35
tracing_mark_raw_write 444 374 -70
tracing_mark_write 616 540 -76
trace_default_header 600 406 -194
Link: http://lkml.kernel.org/r/20190320081757.6037-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix to make the type of $comm "string". If we set the other type to $comm
argument, it shows meaningless value or wrong data. Currently probe events
allow us to set string array type (e.g. ":string[2]"), or other digit types
like x8 on $comm. But since clearly $comm is just a string data, it should
not be fetched by other types including array.
Link: http://lkml.kernel.org/r/155723736241.9149.14582064184468574539.stgit@devnote2
Cc: Andreas Ziegler <andreas.ziegler@fau.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 533059281e ("tracing: probeevent: Introduce new argument fetching code")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Do not accumulate strlen result on "ret" local variable, because
it is accumulated on "total" local variable for array case.
Link: http://lkml.kernel.org/r/155723735237.9149.3192150444705457531.stgit@devnote2
Fixes: 40b53b7718 ("tracing: probeevent: Add array type support")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since commit 533059281e ("tracing: probeevent: Introduce new
argument fetching code") dropped the $comm support from uprobe
events, this re-enables it.
For $comm support, uses strlcpy() instead of strncpy_from_user()
to copy current task's comm. Because it is in the kernel space,
strncpy_from_user() always fails to copy the comm.
This also uses strlen() instead of strnlen_user() to measure the
length of the comm.
Note that this uses -ECOMM as a token value to fetch the comm
string. If the user-space pointer points -ECOMM, it will be
translated to task->comm.
Link: http://lkml.kernel.org/r/155723734162.9149.4042756162201097965.stgit@devnote2
Fixes: 533059281e ("tracing: probeevent: Introduce new argument fetching code")
Reported-by: Andreas Ziegler <andreas.ziegler@fau.de>
Acked-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull networking updates from David Miller:
"Highlights:
1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.
2) Add fib_sync_mem to control the amount of dirty memory we allow to
queue up between synchronize RCU calls, from David Ahern.
3) Make flow classifier more lockless, from Vlad Buslov.
4) Add PHY downshift support to aquantia driver, from Heiner
Kallweit.
5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
contention on SLAB spinlocks in heavy RPC workloads.
6) Partial GSO offload support in XFRM, from Boris Pismenny.
7) Add fast link down support to ethtool, from Heiner Kallweit.
8) Use siphash for IP ID generator, from Eric Dumazet.
9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
entries, from David Ahern.
10) Move skb->xmit_more into a per-cpu variable, from Florian
Westphal.
11) Improve eBPF verifier speed and increase maximum program size,
from Alexei Starovoitov.
12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
spinlocks. From Neil Brown.
13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.
14) Improve link partner cap detection in generic PHY code, from
Heiner Kallweit.
15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
Maguire.
16) Remove SKB list implementation assumptions in SCTP, your's truly.
17) Various cleanups, optimizations, and simplifications in r8169
driver. From Heiner Kallweit.
18) Add memory accounting on TX and RX path of SCTP, from Xin Long.
19) Switch PHY drivers over to use dynamic featue detection, from
Heiner Kallweit.
20) Support flow steering without masking in dpaa2-eth, from Ioana
Ciocoi.
21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
Pirko.
22) Increase the strict parsing of current and future netlink
attributes, also export such policies to userspace. From Johannes
Berg.
23) Allow DSA tag drivers to be modular, from Andrew Lunn.
24) Remove legacy DSA probing support, also from Andrew Lunn.
25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
Haabendal.
26) Add a generic tracepoint for TX queue timeouts to ease debugging,
from Cong Wang.
27) More indirect call optimizations, from Paolo Abeni"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
cxgb4: Fix error path in cxgb4_init_module
net: phy: improve pause mode reporting in phy_print_status
dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
net: macb: Change interrupt and napi enable order in open
net: ll_temac: Improve error message on error IRQ
net/sched: remove block pointer from common offload structure
net: ethernet: support of_get_mac_address new ERR_PTR error
net: usb: smsc: fix warning reported by kbuild test robot
staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
net: dsa: support of_get_mac_address new ERR_PTR error
net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
vrf: sit mtu should not be updated when vrf netdev is the link
net: dsa: Fix error cleanup path in dsa_init_module
l2tp: Fix possible NULL pointer dereference
taprio: add null check on sched_nest to avoid potential null pointer dereference
net: mvpp2: cls: fix less than zero check on a u32 variable
net_sched: sch_fq: handle non connected flows
net_sched: sch_fq: do not assume EDT packets are ordered
net: hns3: use devm_kcalloc when allocating desc_cb
net: hns3: some cleanup for struct hns3_enet_ring
...
Pull x86 mm updates from Ingo Molnar:
"The changes in here are:
- text_poke() fixes and an extensive set of executability lockdowns,
to (hopefully) eliminate the last residual circumstances under
which we are using W|X mappings even temporarily on x86 kernels.
This required a broad range of surgery in text patching facilities,
module loading, trampoline handling and other bits.
- tweak page fault messages to be more informative and more
structured.
- remove DISCONTIGMEM support on x86-32 and make SPARSEMEM the
default.
- reduce KASLR granularity on 5-level paging kernels from 512 GB to
1 GB.
- misc other changes and updates"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/mm: Initialize PGD cache during mm initialization
x86/alternatives: Add comment about module removal races
x86/kprobes: Use vmalloc special flag
x86/ftrace: Use vmalloc special flag
bpf: Use vmalloc special flag
modules: Use vmalloc special flag
mm/vmalloc: Add flag for freeing of special permsissions
mm/hibernation: Make hibernation handle unmapped pages
x86/mm/cpa: Add set_direct_map_*() functions
x86/alternatives: Remove the return value of text_poke_*()
x86/jump-label: Remove support for custom text poker
x86/modules: Avoid breaking W^X while loading modules
x86/kprobes: Set instruction page as executable
x86/ftrace: Set trampoline pages as executable
x86/kgdb: Avoid redundant comparison of patched code
x86/alternatives: Use temporary mm for text poking
x86/alternatives: Initialize temporary mm for patching
fork: Provide a function for copying init_mm
uprobes: Initialize uprobes earlier
x86/mm: Save debug registers when loading a temporary mm
...
Pull stack trace updates from Ingo Molnar:
"So Thomas looked at the stacktrace code recently and noticed a few
weirdnesses, and we all know how such stories of crummy kernel code
meeting German engineering perfection end: a 45-patch series to clean
it all up! :-)
Here's the changes in Thomas's words:
'Struct stack_trace is a sinkhole for input and output parameters
which is largely pointless for most usage sites. In fact if embedded
into other data structures it creates indirections and extra storage
overhead for no benefit.
Looking at all usage sites makes it clear that they just require an
interface which is based on a storage array. That array is either on
stack, global or embedded into some other data structure.
Some of the stack depot usage sites are outright wrong, but
fortunately the wrongness just causes more stack being used for
nothing and does not have functional impact.
Another oddity is the inconsistent termination of the stack trace
with ULONG_MAX. It's pointless as the number of entries is what
determines the length of the stored trace. In fact quite some call
sites remove the ULONG_MAX marker afterwards with or without nasty
comments about it. Not all architectures do that and those which do,
do it inconsistenly either conditional on nr_entries == 0 or
unconditionally.
The following series cleans that up by:
1) Removing the ULONG_MAX termination in the architecture code
2) Removing the ULONG_MAX fixups at the call sites
3) Providing plain storage array based interfaces for stacktrace
and stackdepot.
4) Cleaning up the mess at the callsites including some related
cleanups.
5) Removing the struct stack_trace based interfaces
This is not changing the struct stack_trace interfaces at the
architecture level, but it removes the exposure to the generic
code'"
* 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
x86/stacktrace: Use common infrastructure
stacktrace: Provide common infrastructure
lib/stackdepot: Remove obsolete functions
stacktrace: Remove obsolete functions
livepatch: Simplify stack trace retrieval
tracing: Remove the last struct stack_trace usage
tracing: Simplify stack trace retrieval
tracing: Make ftrace_trace_userstack() static and conditional
tracing: Use percpu stack trace buffer more intelligently
tracing: Simplify stacktrace retrieval in histograms
lockdep: Simplify stack trace handling
lockdep: Remove save argument from check_prev_add()
lockdep: Remove unused trace argument from print_circular_bug()
drm: Simplify stacktrace handling
dm persistent data: Simplify stack trace handling
dm bufio: Simplify stack trace retrieval
btrfs: ref-verify: Simplify stack trace retrieval
dma/debug: Simplify stracktrace retrieval
fault-inject: Simplify stacktrace retrieval
mm/page_owner: Simplify stack trace handling
...
Pull objtool updates from Ingo Molnar:
"This is a series from Peter Zijlstra that adds x86 build-time uaccess
validation of SMAP to objtool, which will detect and warn about the
following uaccess API usage bugs and weirdnesses:
- call to %s() with UACCESS enabled
- return with UACCESS enabled
- return with UACCESS disabled from a UACCESS-safe function
- recursive UACCESS enable
- redundant UACCESS disable
- UACCESS-safe disables UACCESS
As it turns out not leaking uaccess permissions outside the intended
uaccess functionality is hard when the interfaces are complex and when
such bugs are mostly dormant.
As a bonus we now also check the DF flag. We had at least one
high-profile bug in that area in the early days of Linux, and the
checking is fairly simple. The checks performed and warnings emitted
are:
- call to %s() with DF set
- return with DF set
- return with modified stack frame
- recursive STD
- redundant CLD
It's all x86-only for now, but later on this can also be used for PAN
on ARM and objtool is fairly cross-platform in principle.
While all warnings emitted by this new checking facility that got
reported to us were fixed, there might be GCC version dependent
warnings that were not reported yet - which we'll address, should they
trigger.
The warnings are non-fatal build warnings"
* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
mm/uaccess: Use 'unsigned long' to placate UBSAN warnings on older GCC versions
x86/uaccess: Dont leak the AC flag into __put_user() argument evaluation
sched/x86_64: Don't save flags on context switch
objtool: Add Direction Flag validation
objtool: Add UACCESS validation
objtool: Fix sibling call detection
objtool: Rewrite alt->skip_orig
objtool: Add --backtrace support
objtool: Rewrite add_ignores()
objtool: Handle function aliases
objtool: Set insn->func for alternatives
x86/uaccess, kcov: Disable stack protector
x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP
x86/uaccess, ubsan: Fix UBSAN vs. SMAP
x86/uaccess, kasan: Fix KASAN vs SMAP
x86/smap: Ditch __stringify()
x86/uaccess: Introduce user_access_{save,restore}()
x86/uaccess, signal: Fix AC=1 bloat
x86/uaccess: Always inline user_access_begin()
x86/uaccess, xen: Suppress SMAP warnings
...
The 'ftdump' command in kdb is currently a bit of a last resort, at
least if you have lots of traces turned on. It's going to print a
whole boatload of data out your serial port which is probably running
at 115200. This could easily take many, many minutes.
Usually you're most interested in what's at the _end_ of the ftrace
buffer, AKA what happened most recently. That means you've got to
wait the full time for the dump. The 'ftdump' command does attempt to
help you a little bit by allowing you to skip a fixed number of
entries. Unfortunately it provides no way for you to know how many
entries you should skip.
Let's do similar to python and allow you to use a negative number to
indicate that you want to skip all entries except the last few. This
allows you to quickly see what you want.
Note that we also change the printout in ftdump to print the
(positive) number of entries actually skipped since that could be
helpful to know when you've specified a negative skip count.
Link: http://lkml.kernel.org/r/20190319171206.97107-3-dianders@chromium.org
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
These two new exported functions will be used in a future patch by
kdb_ftdump() to quickly skip all but the last few trace entries.
Link: http://lkml.kernel.org/r/20190319171206.97107-2-dianders@chromium.org
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The things skipped by kdb's "ftdump" command when you pass it a
parameter has always been entries, not lines. The difference usually
doesn't matter but when the trace buffer has multi-line entries (like
a stack dump) it can matter.
Let's fix this both in the help text for ftdump and also in the local
variable names.
Link: http://lkml.kernel.org/r/20190319171206.97107-1-dianders@chromium.org
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When using a temporary mm, bpf_probe_write_user() should not be able to
write to user memory, since user memory addresses may be used to map
kernel memory. Detect these cases and fail bpf_probe_write_user() in
such cases.
Suggested-by: Jann Horn <jannh@google.com>
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-24-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's only used in trace.c and there is absolutely no point in compiling it
in when user space stack traces are not supported.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094803.162400595@linutronix.de
The per cpu stack trace buffer usage pattern is odd at best. The buffer has
place for 512 stack trace entries on 64-bit and 1024 on 32-bit. When
interrupts or exceptions nest after the per cpu buffer was acquired the
stacktrace length is hardcoded to 8 entries. 512/1024 stack trace entries
in kernel stacks are unrealistic so the buffer is a complete waste.
Split the buffer into 4 nest levels, which are 128/256 entries per
level. This allows nesting contexts (interrupts, exceptions) to utilize the
cpu buffer for stack retrieval and avoids the fixed length allocation along
with the conditional execution pathes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094803.066064076@linutronix.de
- Remove the extra array member of stack_dump_trace[] along with the
ARRAY_SIZE - 1 initialization for struct stack_trace :: max_entries.
Both are historical leftovers of no value. The stack tracer never exceeds
the array and there is no extra storage requirement either.
- Make variables which are only used in trace_stack.c static.
- Simplify the enable/disable logic.
- Rename stack_trace_print() as it's using the stack_trace_ namespace. Free
the name up for stack trace related functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094801.230654524@linutronix.de
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-04-28
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Introduce BPF socket local storage map so that BPF programs can store
private data they associate with a socket (instead of e.g. separate hash
table), from Martin.
2) Add support for bpftool to dump BTF types. This is done through a new
`bpftool btf dump` sub-command, from Andrii.
3) Enable BPF-based flow dissector for skb-less eth_get_headlen() calls which
was currently not supported since skb was used to lookup netns, from Stanislav.
4) Add an opt-in interface for tracepoints to expose a writable context
for attached BPF programs, used here for NBD sockets, from Matt.
5) BPF xadd related arm64 JIT fixes and scalability improvements, from Daniel.
6) Change the skb->protocol for bpf_skb_adjust_room() helper in order to
support tunnels such as sit. Add selftests as well, from Willem.
7) Various smaller misc fixes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an opt-in interface that allows a tracepoint to provide a safe
buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
The size of the buffer must be a compile-time constant, and is checked
before allowing a BPF program to attach to a tracepoint that uses this
feature.
The pointer to this buffer will be the first argument of tracepoints
that opt in; the pointer is valid and can be bpf_probe_read() by both
BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
programs that attach to such a tracepoint, but the buffer to which it
points may only be written by the latter.
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Use "nosteal" for ring buffer splice pages
- Memory leak fix in error path of trace_pid_write()
- Fix preempt_enable_no_resched() (use preempt_enable()) in ring buffer code
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXMMoghQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qmB1AQDfpVxYxcmxibBBAM6fZyILYpKqDWmy
ut6gHZ+GHhQT4AEAwSRsC6V4yO3d5dJFpkcQXUj1v+Ip9XU+dv//s8O6tAI=
=LsG/
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Three tracing fixes:
- Use "nosteal" for ring buffer splice pages
- Memory leak fix in error path of trace_pid_write()
- Fix preempt_enable_no_resched() (use preempt_enable()) in ring
buffer code"
* tag 'trace-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
trace: Fix preempt_enable_no_resched() abuse
tracing: Fix a memory leak by early error exit in trace_pid_write()
tracing: Fix buffer_ref pipe ops
Unless the very next line is schedule(), or implies it, one must not use
preempt_enable_no_resched(). It can cause a preemption to go missing and
thereby cause arbitrary delays, breaking the PREEMPT=y invariant.
Link: http://lkml.kernel.org/r/20190423200318.GY14281@hirez.programming.kicks-ass.net
Cc: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: the arch/x86 maintainers <x86@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: stable@vger.kernel.org
Fixes: 2c2d7329d8 ("tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In trace_pid_write(), the buffer for trace parser is allocated through
kmalloc() in trace_parser_get_init(). Later on, after the buffer is used,
it is then freed through kfree() in trace_parser_put(). However, it is
possible that trace_pid_write() is terminated due to unexpected errors,
e.g., ENOMEM. In that case, the allocated buffer will not be freed, which
is a memory leak bug.
To fix this issue, free the allocated buffer when an error is encountered.
Link: http://lkml.kernel.org/r/1555726979-15633-1-git-send-email-wang6495@umn.edu
Fixes: f4d34a87e9 ("tracing: Use pid bitmap instead of a pid array for set_event_pid")
Cc: stable@vger.kernel.org
Signed-off-by: Wenwen Wang <wang6495@umn.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This fixes multiple issues in buffer_pipe_buf_ops:
- The ->steal() handler must not return zero unless the pipe buffer has
the only reference to the page. But generic_pipe_buf_steal() assumes
that every reference to the pipe is tracked by the page's refcount,
which isn't true for these buffers - buffer_pipe_buf_get(), which
duplicates a buffer, doesn't touch the page's refcount.
Fix it by using generic_pipe_buf_nosteal(), which refuses every
attempted theft. It should be easy to actually support ->steal, but the
only current users of pipe_buf_steal() are the virtio console and FUSE,
and they also only use it as an optimization. So it's probably not worth
the effort.
- The ->get() and ->release() handlers can be invoked concurrently on pipe
buffers backed by the same struct buffer_ref. Make them safe against
concurrency by using refcount_t.
- The pointers stored in ->private were only zeroed out when the last
reference to the buffer_ref was dropped. As far as I know, this
shouldn't be necessary anyway, but if we do it, let's always do it.
Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Fixes: 73a757e631 ("ring-buffer: Return reader page back into existing ring buffer")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2019-04-22
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) allow stack/queue helpers from more bpf program types, from Alban.
2) allow parallel verification of root bpf programs, from Alexei.
3) introduce bpf sysctl hook for trusted root cases, from Andrey.
4) recognize var/datasec in btf deduplication, from Andrii.
5) cpumap performance optimizations, from Jesper.
6) verifier prep for alu32 optimization, from Jiong.
7) libbpf xsk cleanup, from Magnus.
8) other various fixes and cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The function_graph boot up self test emulates the tr->init() function in
order to add a wrapper around the function graph tracer entry code to test
for lock ups and such. But it does not emulate the tr->reset(), and just
calls the function_graph tracer tr->reset() function which will use its own
fgraph_ops to unregister function tracing with. As the fgraph_ops is
becoming more meaningful with the register_ftrace_graph() and
unregister_ftrace_graph() functions, the two need to be the same. The
emulated tr->init() uses its own fgraph_ops descriptor, which means the
unregister_ftrace_graph() must use the same ftrace_ops, which the selftest
currently does not do. By emulating the tr->reset() as the selftest does
with the tr->init() it will be able to pass the same fgraph_ops descriptor
to the unregister_ftrace_graph() as it did with the register_ftrace_graph().
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
commit f1a2e44a3a ("bpf: add queue and stack maps") introduced new BPF
helper functions:
- BPF_FUNC_map_push_elem
- BPF_FUNC_map_pop_elem
- BPF_FUNC_map_peek_elem
but they were made available only for network BPF programs. This patch
makes them available for tracepoint, cgroup and lirc programs.
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Merge page ref overflow branch.
Jann Horn reported that he can overflow the page ref count with
sufficient memory (and a filesystem that is intentionally extremely
slow).
Admittedly it's not exactly easy. To have more than four billion
references to a page requires a minimum of 32GB of kernel memory just
for the pointers to the pages, much less any metadata to keep track of
those pointers. Jann needed a total of 140GB of memory and a specially
crafted filesystem that leaves all reads pending (in order to not ever
free the page references and just keep adding more).
Still, we have a fairly straightforward way to limit the two obvious
user-controllable sources of page references: direct-IO like page
references gotten through get_user_pages(), and the splice pipe page
duplication. So let's just do that.
* branch page-refs:
fs: prevent page refcount overflow in pipe_buf_get
mm: prevent get_user_pages() from overflowing page refcount
mm: add 'try_get_page()' helper function
mm: make page ref count overflow check tighter and more explicit
No architecture terminates the stack trace with ULONG_MAX anymore. As the
code checks the number of entries stored anyway there is no point in
keeping all that ULONG_MAX magic around.
The histogram code zeroes the storage before saving the stack, so if the
trace is shorter than the maximum number of entries it can terminate the
print loop if a zero entry is detected.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Link: https://lkml.kernel.org/r/20190410103645.048761764@linutronix.de
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount. All
callers converted to handle a failure.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function_graph tracer has a stub function and its ops flag has the
FTRACE_OPS_FL_STUB set. As the function graph does not use the
ftrace_ops->func pointer but instead is called by a separate part of the
ftrace trampoline. The function_graph tracer still requires to pass in a
ftrace_ops that may also hold the hash of the functions to call. But there's
no reason to test that hash in the function tracing portion. Instead of
testing to see if we should call the stub function, just test if the ops has
FTRACE_OPS_FL_STUB set, and just skip it.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ASSIGN_OPS_HASH() macro was moved to fgraph.c where it was used, but for
some reason it wasn't removed from ftrace.c, as it is no longer referenced
there.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As each instance has their own error_log file, it makes more sense that the
instances show the errors of their own instead of all error_logs having the
same data. Make it that the errors show up in the instance error_log file
that the error happens in. If no instance trace_array is available, then
NULL can be passed in which will create the error in the top level instance
(the one at the top of the tracefs directory).
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Have the trace_array that associates the trace instance of the histogram
passed around to functions so that error handling can display the error
message in the proper instance.
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pass in the trace_array that represents the instance the filter being
changed is in to create_event_filter(). This will allow for error messages
that happen when writing to the filter can be displayed in the proper
instance "error_log" file.
Note, for calls to create_filter() (that was also modified to support
create_event_filter()), that changes filters that do not exist in a instance
(for perf for example), NULL may be passed in, which means that there will
not be any message to log for that filter.
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
gpio tracing was made configurable in 4.4-rc1 (commit ddd70280bf
("tracing: gpio: Add Kconfig option for enabling/disabling trace
events")). Since then it is the only event type that can be compiled
conditionally. Given that there is only little overhead I don't
understand the reasoning and I was annoyed more than once that gpio
events were not available without recompiling.
So drop the Kconfig symbol and make gpio events available
unconditionally.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
The only users that calls syscall_get_arguments() with a variable and not a
hard coded '6' is ftrace_syscall_enter(). syscall_get_arguments() can be
optimized by removing a variable input, and always grabbing 6 arguments
regardless of what the system call actually uses.
Change ftrace_syscall_enter() to pass the 6 args into a local stack array
and copy the necessary arguments into the trace event as needed.
This is needed to remove two parameters from syscall_get_arguments().
Link: http://lkml.kernel.org/r/20161107213233.627583542@goodmis.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
For CONFIG_TRACE_BRANCH_PROFILING=y the likely/unlikely things get
overloaded and generate callouts to this code, and thus also when
AC=1.
Make it safe.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use tracing error_log with probe events for logging error more
precisely. This also makes all parse error returns -EINVAL
(except for -ENOMEM), because user can see better error message
in error_log file now.
Link: http://lkml.kernel.org/r/6a4d90e141d138040ea61f4776b991597077451e.1554072478.git.tom.zanussi@linux.intel.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Use tracing_log_err() from the new tracing error_log mechanism to send
filter parse errors to tracing/error_log.
With this change, users will be able to see filter errors by looking
at tracing/error_log.
The same errors will also be available in the filter file, as
expected.
Link: http://lkml.kernel.org/r/1d942c419941539a11d78a6810fc5740a99b2974.1554072478.git.tom.zanussi@linux.intel.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Replace hist_err() and hist_err_event() with tracing_log_err() from
the new tracing error_log mechanism.
Also add a couple related helper functions and remove most of the old
hist_err()-related code.
With this change, users no longer read the hist files for hist trigger
error information, but instead look at tracing/error_log for the same
information.
Link: http://lkml.kernel.org/r/c98f77a97c9715d18b623eeb5741057b330d5ac0.1554072478.git.tom.zanussi@linux.intel.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In preparation for making use of the new trace error log, save the
subsystem and event name associated with the last hist command - it
will be passed as the location param in the event_log_err() calls.
Link: http://lkml.kernel.org/r/eb0fd1362be8f39facb86c83eecf441b7a5876f8.1554072478.git.tom.zanussi@linux.intel.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Introduce a new ftrace file, tracing/error_log, for ftrace commands to
log errors. This is useful for allowing more complex commands such as
hist trigger and kprobe_event commands to point out specifically where
something may have gone wrong without forcing them to resort to more
ad hoc methods such as tacking error messages onto existing output
files.
To log a tracing error, call the event_log_err() function, passing it
a location string describing where it came from e.g. kprobe_events or
system:event, the command that caused the error, an array of static
error strings describing errors and an index within that array which
describes the specific error, along with the position to place the
error caret.
Reading the log displays the last (currently) 8 errors logged in the
following format:
[timestamp] <loc>: error: <static error text>
Command: <command that caused the error>
^
Memory for the error log isn't allocated unless there has been a trace
event error, and the error log can be cleared and have its memory
freed by writing the empty string in truncation mode to it:
# echo > tracing/error_log.
Link: http://lkml.kernel.org/r/0c2c82571fd38c5f3a88ca823627edff250e9416.1554072478.git.tom.zanussi@linux.intel.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Improvements-suggested-by: Steve Rostedt <rostedt@goodmis.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Ftrace provides the feature “instances” that provides the capability to
create multiple Ftrace ring buffers. However, currently these buffers
are created/accessed via userspace only. The kernel APIs providing these
features are not exported, hence cannot be used by other kernel
components.
This patch aims to extend this infrastructure to provide the
flexibility to create/log/remove/ enable-disable existing trace events
to these buffers from within the kernel.
Link: http://lkml.kernel.org/r/1553106531-3281-2-git-send-email-divya.indi@oracle.com
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix compile warning in create_dyn_event(): 'ret' may be used uninitialized
in this function [-Wuninitialized].
Link: http://lkml.kernel.org/r/1553237900-8555-1-git-send-email-frowand.list@gmail.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Fixes: 5448d44c38 ("tracing: Add unified dynamic event framework")
Signed-off-by: Frank Rowand <frank.rowand@sony.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyL124QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptsxD/42slmoE5TC3vwXcgMBEilrjIHCns6O4Leo
0r8Awdwil8QkVDphfAWsgkTBjRPUNKv4cCg2kG4VEzAy62YSutUWPeqJZwLOpGDI
kji9XI6WLqwQ/VhDFwEln9G+xWDUQxds5PZDomlzLpjiNqkFArwwsPFnJbshH4fB
U6kZrhVSLfvJHIJmC9H4RIWuTEwUH1yFSvzzMqDOOyvRon2g/A2YlHb2KhSCaJPq
1b0jbhyR0GVP0EH1FdeKvNYFZfvXXSPAbxDN1CEtW/Lq8WxXeoaCj390tC+gL7yQ
WWHntvUoVU/weWudbT3tVsYgpI91KfPM5OuWTDGod6lFwHrI5X91Pao3KYUGPb9d
cwvNBOlkNqR1ENZOGTgxLeKwiwV7G1DIjvsaijRQJhGy4Uw4RkM/YEct9JHxWBIF
x4ZuSVUVZ5Y3zNPC945iJ6Z5feOz/UO9bQL00oimu0c0JhAp++3pHWAFJEMQ8q1a
0IRifkeUyhf0p9CIVPDnUzmNgSBglFkAVTPVAWySBVDU+v0/GoNcYwTzPq4cgPrF
UJEIlx+RdDpKKmCqBvKjtx4w7BC1lCebL/1ZJrbARNO42djt8xeuyvKw0t+MYVTZ
UsvLX72tXwUIbj0IZZGuz+8uSGD4ddDs8+x486FN4oaCPf36FUnnkOZZkhjV/KQA
vsZNrNNZpw==
=qBae
-----END PGP SIGNATURE-----
Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block
Pull more block layer changes from Jens Axboe:
"This is a collection of both stragglers, and fixes that came in after
I finalized the initial pull. This contains:
- An MD pull request from Song, with a few minor fixes
- Set of NVMe patches via Christoph
- Pull request from Konrad, with a few fixes for xen/blkback
- pblk fix IO calculation fix (Javier)
- Segment calculation fix for pass-through (Ming)
- Fallthrough annotation for blkcg (Mathieu)"
* tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits)
blkcg: annotate implicit fall through
nvme-tcp: support C2HData with SUCCESS flag
nvmet: ignore EOPNOTSUPP for discard
nvme: add proper write zeroes setup for the multipath device
nvme: add proper discard setup for the multipath device
nvme: remove nvme_ns_config_oncs
nvme: disable Write Zeroes for qemu controllers
nvmet-fc: bring Disconnect into compliance with FC-NVME spec
nvmet-fc: fix issues with targetport assoc_list list walking
nvme-fc: reject reconnect if io queue count is reduced to zero
nvme-fc: fix numa_node when dev is null
nvme-fc: use nr_phys_segments to determine existence of sgl
nvme-loop: init nvmet_ctrl fatal_err_work when allocate
nvme: update comment to make the code easier to read
nvme: put ns_head ref if namespace fails allocation
nvme-trace: fix cdw10 buffer overrun
nvme: don't warn on block content change effects
nvme: add get-feature to admin cmds tracer
md: Fix failed allocation of md_register_thread
It's wrong to add len to sector_nr in raid10 reshape twice
...
error checks.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXIukmxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qi+PAQCKf6Yz7LZ3oBtjKy7jJRhkAn3Ie5Ls
n7KOXnOGntO1cgD/RdynJcFtpwgDxuj7L/c3iInel0B/rdU5VLbglXy+2AA=
=y/ft
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes and cleanups from Steven Rostedt:
"This contains a series of last minute clean ups, small fixes and error
checks"
* tag 'trace-v5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/probe: Verify alloc_trace_*probe() result
tracing/probe: Check event/group naming rule at parsing
tracing/probe: Check the size of argument name and body
tracing/probe: Check event name length correctly
tracing/probe: Check maxactive error cases
tracing: kdb: Fix ftdump to not sleep
trace/probes: Remove kernel doc style from non kernel doc comment
tracing/probes: Make reserved_field_names static
Since alloc_trace_*probe() returns -EINVAL only if !event && !group,
it should not happen in trace_*probe_create(). If we catch that case
there is a bug. So use WARN_ON_ONCE() instead of pr_info().
Link: http://lkml.kernel.org/r/155253785078.14922.16902223633734601469.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Ensure given name of event is not too long when parsing it,
and fix to update event name offset correctly when the group
name is given. For example, this makes probe event to check
the "p:foo/" error case correctly.
Link: http://lkml.kernel.org/r/155253782046.14922.14724124823730168629.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Check maxactive on kprobe error case, because maxactive
is only for kretprobe, not for kprobe. Also, maxactive
should not be 0, it should be at least 1.
Link: http://lkml.kernel.org/r/155253780952.14922.15784129810238750331.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There is a plan to build the kernel with -Wimplicit-fallthrough and
this place in the code produced a warning (W=1).
This commit remove the following warning:
kernel/trace/blktrace.c:725:9: warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
BUG for "sleeping function called from invalid context".
kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
atomic context. A very simple solution for this is to add allocation
flags to ring_buffer_read_prepare() so kdb can call it without
triggering the allocation error. This patch does that.
Note that in the original email thread about this, it was suggested
that perhaps the solution for kdb was to either preallocate the buffer
ahead of time or create our own iterator. I'm hoping that this
alternative of adding allocation flags to ring_buffer_read_prepare()
can be considered since it means I don't need to duplicate more of the
core trace code into "trace_kdb.c" (for either creating my own
iterator or re-preparing a ring allocator whose memory was already
allocated).
NOTE: another option for kdb is to actually figure out how to make it
reuse the existing ftrace_dump() function and totally eliminate the
duplication. This sounds very appealing and actually works (the "sr
z" command can be seen to properly dump the ftrace buffer). The
downside here is that ftrace_dump() fully consumes the trace buffer.
Unless that is changed I'd rather not use it because it means "ftdump
| grep xyz" won't be very useful to search the ftrace buffer since it
will throw away the whole trace on the first grep. A future patch to
dump only the last few lines of the buffer will also be hard to
implement.
[1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com
Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org
Reported-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull misc vfs updates from Al Viro:
"Assorted fixes (really no common topic here)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: Make __vfs_write() static
vfs: fix preadv64v2 and pwritev64v2 compat syscalls with offset == -1
pipe: stop using ->can_merge
splice: don't merge into linked buffers
fs: move generic stat response attr handling to vfs_getattr_nosec
orangefs: don't reinitialize result_mask in ->getattr
fs/devpts: always delete dcache dentry-s in dput()
CC kernel/trace/trace_kprobe.o
kernel/trace/trace_kprobe.c:41: warning: cannot understand function prototype: 'struct trace_kprobe '
The real problem is that a comment looked like kerneldoc when it shouldn't be...
Link: http://lkml.kernel.org/r/2812.1552381112@turing-police
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
sparse complains:
CHECK kernel/trace/trace_probe.c
kernel/trace/trace_probe.c:16:12: warning: symbol 'reserved_field_names' was not declared. Should it be static?
Yes, it should be static.
Link: http://lkml.kernel.org/r/2478.1552380778@turing-police
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Add "onchange(var)" histogram handler that executes a action when $var
changes.
- Add new "snapshot()" action for histogram handlers, that causes a
snapshot of the ring buffer when triggered.
ie. onchange(var).snapshot() will trigger a snapshot if var changes.
- Add alternative for "trace()" action.
Currently, to trigger a synthetic event, the name of that event is used
as the handler name, which is inconsistent with the other actions.
onchange(var).synthetic(param) where it can now be
onchange(var).trace(synthetic, param). The older method will still be
allowed, as long as the synthetic events do not overlap with other
handler names.
- The histogram documentation at testcases were updated for the new
changes.
Added a quicker way to enable set_ftrace_filter files, that will make
it much quicker to bisect tracing a function that shouldn't be traced and
crashes the kernel. (You can echo in numbers to set_ftrace_filter, and it
will select the corresponding function that is in
available_filter_functions).
Some better displaying of the tracing data (and more information was added).
The rest are small fixes and more clean ups to the code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXIXXjRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrSJAQCbGXAvZE+shfKRhbU1cu1C1nwRMHhH
eeRecJs1RChGFgD/TwatD4FzARQPjfk7snQD5KWPpoRc9grUACC2cZcaWwQ=
=LVBI
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The biggest change for this release is in the histogram code:
- Add "onchange(var)" histogram handler that executes a action when
$var changes.
- Add new "snapshot()" action for histogram handlers, that causes a
snapshot of the ring buffer when triggered. ie.
onchange(var).snapshot() will trigger a snapshot if var changes.
- Add alternative for "trace()" action. Currently, to trigger a
synthetic event, the name of that event is used as the handler
name, which is inconsistent with the other actions.
onchange(var).synthetic(param) where it can now be
onchange(var).trace(synthetic, param). The older method will still
be allowed, as long as the synthetic events do not overlap with
other handler names.
- The histogram documentation at testcases were updated for the new
changes.
Outside of the histogram code, we have:
- Added a quicker way to enable set_ftrace_filter files, that will
make it much quicker to bisect tracing a function that shouldn't be
traced and crashes the kernel. (You can echo in numbers to
set_ftrace_filter, and it will select the corresponding function
that is in available_filter_functions).
- Some better displaying of the tracing data (and more information
was added).
The rest are small fixes and more clean ups to the code"
* tag 'trace-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (37 commits)
tracing: Use strncpy instead of memcpy when copying comm in trace.c
tracing: Use strncpy instead of memcpy when copying comm for hist triggers
tracing: Use strncpy instead of memcpy for string keys in hist triggers
tracing: Use str_has_prefix() in synth_event_create()
x86/ftrace: Fix warning and considate ftrace_jmp_replace() and ftrace_call_replace()
tracing/perf: Use strndup_user() instead of buggy open-coded version
doc: trace: Fix documentation for uprobe_profile
tracing: Fix spelling mistake: "analagous" -> "analogous"
tracing: Comment why cond_snapshot is checked outside of max_lock protection
tracing: Add hist trigger action 'expected fail' test case
tracing: Add alternative synthetic event trace action test case
tracing: Add hist trigger onchange() handler test case
tracing: Add hist trigger snapshot() action test case
tracing: Add SPDX license GPL-2.0 license identifier to inter-event testcases
tracing: Add alternative synthetic event trace action syntax
tracing: Add hist trigger onchange() handler Documentation
tracing: Add hist trigger onchange() handler
tracing: Add hist trigger snapshot() action Documentation
tracing: Add hist trigger snapshot() action
tracing: Add conditional snapshot
...
- do not generate unneeded top-level built-in.a
- let git ignore O= directory entirely
- optimize scripts/kallsyms slightly
- exclude DWARF info from *.s regardless of config options
- fix GCC toolchain search path for Clang to prepare ld.lld support
- do not generate modules.order when CONFIG_MODULES is disabled
- simplify single target rules and remove VPATH for external module build
- allow to add optional flags to dpkg-buildpackage when building deb-pkg
- move some compiler option tests from Makefile to Kconfig
- various Makefile cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJcgxYUAAoJED2LAQed4NsGr7YQAJq4LmN/aZDI9Mt0YAQjEyyA
PCpm8J2HI9HO1sMoY7J/ksWmV0BU25G+uspKD7dXAQo3l9fmahQM5e4dsyZ4Xqs8
DyyYSGtJJnMJaWmupIZNA4UKDCVtwPoVW8YeuK9rwADVokCux9avogof9O1OoA/E
Pylo+I4UCM82kbpZSd+UxnCx6B0v8XGtW+d31Q4yZXCkw5nw14chrlaprcqB3UgB
+7C3xOnDWCi7gyxaTqmD7dLay2DM8KCDlznEvBL733Y/cK3to1fywzEPzp0JQCLX
BLgmmpW13NF++q5BCoTW6sFjZAhBVbiYZwesMrCi75Y32T8zt4G5l4pkvGkSuGF/
UQh5aoCxaMIp70VPj/loZ0lh78nwVGTok9zRb0rfztM0X4DbmiPi5MNiHRzRpIeE
1jjEa/GK1t0TDnXc/MuDFK8cWwdhttIqUL5yWfAxjXbtP27eLtsopQUdW7EPHs7d
sMnfuSUuhOC28yByVxIkBcwawLyYrcWRphJ3ixCO70CoJWt2DT6aOKxcFJefoJix
Pto6Oo3oQ4iypMM5M9/0Uo+AK2TKRejWIqtZdbo+ir70tNxVH3WDZq++fG0drXOB
r2I/GY6nRjuzLOe2jzEqywFTFd2xpk4Qo84LGb1R3U6aU5qS2gA0W/q00JS5c2qU
R8uReJ7bvmLmrVNZ/NI4
=y9YG
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- do not generate unneeded top-level built-in.a
- let git ignore O= directory entirely
- optimize scripts/kallsyms slightly
- exclude DWARF info from *.s regardless of config options
- fix GCC toolchain search path for Clang to prepare ld.lld support
- do not generate modules.order when CONFIG_MODULES is disabled
- simplify single target rules and remove VPATH for external module
build
- allow to add optional flags to dpkg-buildpackage when building
deb-pkg
- move some compiler option tests from Makefile to Kconfig
- various Makefile cleanups
* tag 'kbuild-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (40 commits)
kbuild: remove scripts/basic/% build target
kbuild: use -Werror=implicit-... instead of -Werror-implicit-...
kbuild: clean up scripts/gcc-version.sh
kbuild: remove cc-version macro
kbuild: update comment block of scripts/clang-version.sh
kbuild: remove commented-out INITRD_COMPRESS
kbuild: move -gsplit-dwarf, -gdwarf-4 option tests to Kconfig
kbuild: [bin]deb-pkg: add DPKG_FLAGS variable
kbuild: move ".config not found!" message from Kconfig to Makefile
kbuild: invoke syncconfig if include/config/auto.conf.cmd is missing
kbuild: simplify single target rules
kbuild: remove empty rules for makefiles
kbuild: make -r/-R effective in top Makefile for old Make versions
kbuild: move tools_silent to a more relevant place
kbuild: compute false-positive -Wmaybe-uninitialized cases in Kconfig
kbuild: refactor cc-cross-prefix implementation
kbuild: hardcode genksyms path and remove GENKSYMS variable
scripts/gdb: refactor rules for symlink creation
kbuild: create symlink to vmlinux-gdb.py in scripts_gdb target
scripts/gdb: do not descend into scripts/gdb from scripts
...
A small fix Pavel sent me back in august was accidentally lost due to it
being placed with some other patches that failed some tests, and was rebased
out of my local tree. Which was a regression that caused event filters
not to handle negative numbers.
The clean up is from Masami that realized that the code in kprobes that
calls probe_mem_read() wrapper, which is to be used in code used by both
kprobes and uprobes, was only in code for kprobes. It should not use the
wrapper there, but instead call probe_kernel_read() directly.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXH2gZRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrTyAQC+dfxfyS32BWpGOwZbUIHGYYHtnpEB
7bIQzXy2q8r4YQD/UGJ0qoHur1gSMEIVjhIM0Qow4VepqqhthIAV/1iVbgU=
=F9yc
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.0-pre' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix/cleanup from Steven Rostedt:
"This is a "pre-pull". It's only one small fix and one small clean up.
I'm testing a few small patches for my real pull request which will
come at a later time. The second patch depends on your tree anyway so
I included it along with the urgent fix.
A small fix Pavel sent me back in august was accidentally lost due to
it being placed with some other patches that failed some tests, and
was rebased out of my local tree. Which was a regression that caused
event filters not to handle negative numbers.
The clean up is from Masami that realized that the code in kprobes
that calls probe_mem_read() wrapper, which is to be used in code used
by both kprobes and uprobes, was only in code for kprobes. It should
not use the wrapper there, but instead call probe_kernel_read()
directly"
* tag 'trace-v5.0-pre' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Use probe_kernel_read instead of probe_mem_read
tracing: Fix event filters and triggers to handle negative numbers
Because there may be random garbage beyond a string's null terminator,
code that might use the entire comm array e.g. histogram keys, can
give unexpected results if that garbage is copied in too, so avoid
that possibility by using strncpy instead of memcpy.
Link: http://lkml.kernel.org/r/1d6ebac26570c2a29ce9fb575379f17ef5c8b81b.1551802084.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Because there may be random garbage beyond a string's null terminator,
code that might use the entire comm array e.g. histogram keys, can
give unexpected results if that garbage is copied in too, so avoid
that possibility by using strncpy instead of memcpy.
Link: http://lkml.kernel.org/r/1eb9f096a8086c3c82c7fc087c900005143cec54.1551802084.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Because there may be random garbage beyond a string's null terminator,
it's not correct to copy the the complete character array for use as a
hist trigger key. This results in multiple histogram entries for the
'same' string key.
So, in the case of a string key, use strncpy instead of memcpy to
avoid copying in the extra bytes.
Before, using the gdbus entries in the following hist trigger as an
example:
# echo 'hist:key=comm' > /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
# cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist
...
{ comm: ImgDecoder #4 } hitcount: 203
{ comm: gmain } hitcount: 213
{ comm: gmain } hitcount: 216
{ comm: StreamTrans #73 } hitcount: 221
{ comm: mozStorage #3 } hitcount: 230
{ comm: gdbus } hitcount: 233
{ comm: StyleThread#5 } hitcount: 253
{ comm: gdbus } hitcount: 256
{ comm: gdbus } hitcount: 260
{ comm: StyleThread#4 } hitcount: 271
...
# cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
51
After:
# cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
1
Link: http://lkml.kernel.org/r/50c35ae1267d64eee975b8125e151e600071d4dc.1549309756.git.tom.zanussi@linux.intel.com
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 79e577cbce ("tracing: Support string type key properly")
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Use probe_kernel_read() instead of probe_mem_read() because
probe_mem_read() is a kind of wrapper for switching memory
read function between uprobes and kprobes.
Link: http://lkml.kernel.org/r/20190222011643.3e19ade84a3db3e83518648f@kernel.org
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Then tracing syscall exit event it is extremely useful to filter exit
codes equal to some negative value, to react only to required errors.
But negative numbers does not work:
[root@snorch sys_exit_read]# echo "ret == -1" > filter
bash: echo: write error: Invalid argument
[root@snorch sys_exit_read]# cat filter
ret == -1
^
parse_error: Invalid value (did you forget quotes)?
Similar thing happens when setting triggers.
These is a regression in v4.17 introduced by the commit mentioned below,
testing without these commit shows no problem with negative numbers.
Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com
Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since -Wmaybe-uninitialized was introduced by GCC 4.7, we have patched
various false positives:
- commit e74fc973b6 ("Turn off -Wmaybe-uninitialized when building
with -Os") turned off this option for -Os.
- commit 815eb71e71 ("Kbuild: disable 'maybe-uninitialized' warning
for CONFIG_PROFILE_ALL_BRANCHES") turned off this option for
CONFIG_PROFILE_ALL_BRANCHES
- commit a76bcf557e ("Kbuild: enable -Wmaybe-uninitialized warning
for "make W=1"") turned off this option for GCC < 4.9
Arnd provided more explanation in https://lkml.org/lkml/2017/3/14/903
I think this looks better by shifting the logic from Makefile to Kconfig.
Link: https://github.com/ClangBuiltLinux/linux/issues/350
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
The first version of this method was missing the check for
`ret == PATH_MAX`; then such a check was added, but it didn't call kfree()
on error, so there was still a small memory leak in the error case.
Fix it by using strndup_user() instead of open-coding it.
Link: http://lkml.kernel.org/r/20190220165443.152385-1-jannh@google.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 0eadcc7a7b ("perf/core: Fix perf_uprobe_init()")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Before setting tr->cond_snapshot, it must be NULL before it can be updated.
It can go to NULL when a trace event hist trigger is created or removed, and
can only be modified under the max_lock spin lock. But because it can only
be set to something other than NULL under both the max_lock spin lock as
well as the trace_types_lock, we can perform the check if it is not NULL
only under the trace_types_lock and fail out without having to grab the
max_lock spin lock.
This is very subtle, and deserves a comment.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a 'trace(synthetic_event_name, params)' alternative to
synthetic_event_name(params).
Currently, the syntax used for generating synthetic events is to
invoke synthetic_event_name(params) i.e. use the synthetic event name
as a function call.
Users requested a new form that more explicitly shows that the
synthetic event is in effect being traced. In this version, a new
'trace()' keyword is used, and the synthetic event name is passed in
as the first argument.
In addition, for the sake of consistency with other actions, change
the documention to emphasize the trace() form over the function-call
form, which remains documented as equivalent.
Link: http://lkml.kernel.org/r/d082773e50232a001480cf837679a1e01c1a2eb7.1550100284.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add support for hist:handlerXXX($var).snapshot(), which will take a
snapshot of the current trace buffer whenever handlerXXX is hit.
As a first user, this also adds snapshot() action support for the
onmax() handler i.e. hist:onmax($var).snapshot().
Also, the hist trigger key printing is moved into a separate function
so the snapshot() action can print a histogram key outside the
histogram display - add and use hist_trigger_print_key() for that
purpose.
Link: http://lkml.kernel.org/r/2f1a952c0dcd8aca8702ce81269581a692396d45.1550100284.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, tracing snapshots are context-free - they capture the ring
buffer contents at the time the tracing_snapshot() function was
invoked, and nothing else. Additionally, they're always taken
unconditionally - the calling code can decide whether or not to take a
snapshot, but the data used to make that decision is kept separately
from the snapshot itself.
This change adds the ability to associate with each trace instance
some user data, along with an 'update' function that can use that data
to determine whether or not to actually take a snapshot. The update
function can then update that data along with any other state (as part
of the data presumably), if warranted.
Because snapshots are 'global' per-instance, only one user can enable
and use a conditional snapshot for any given trace instance. To
enable a conditional snapshot (see details in the function and data
structure comments), the user calls tracing_snapshot_cond_enable().
Similarly, to disable a conditional snapshot and free it up for other
users, tracing_snapshot_cond_disable() should be called.
To actually initiate a conditional snapshot, tracing_snapshot_cond()
should be called. tracing_snapshot_cond() will invoke the update()
callback, allowing the user to decide whether or not to actually take
the snapshot and update the user-defined data associated with the
snapshot. If the callback returns 'true', tracing_snapshot_cond()
will then actually take the snapshot and return.
This scheme allows for flexibility in snapshot implementations - for
example, by implementing slightly different update() callbacks,
snapshots can be taken in situations where the user is only interested
in taking a snapshot when a new maximum in hit versus when a value
changes in any way at all. Future patches will demonstrate both
cases.
Link: http://lkml.kernel.org/r/1bea07828d5fd6864a585f83b1eed47ce097eb45.1550100284.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The action refactor code allowed actions and handlers to be separated,
but the existing onmax handler and save action code is still not
flexible enough to handle arbitrary coupling. This change generalizes
them and in the process makes additional handlers and actions easier
to implement.
The onmax action can be broken up and thought of as two separate
components - a variable to be tracked (the parameter given to the
onmax($var_to_track) function) and an invisible variable created to
save the ongoing result of doing something with that variable, such as
saving the max value of that variable so far seen.
Separating it out like this and renaming it appropriately allows us to
use the same code for similar tracking functions such as
onchange($var_to_track), which would just track the last value seen
rather than the max seen so far, which is useful in some situations.
Additionally, because different handlers and actions may want to save
and access data differently e.g. save and retrieve tracking values as
local variables vs something more global, save_val() and get_val()
interface functions are introduced and max-specific implementations
are used instead.
The same goes for the code that checks whether a maximum has been hit
- a generic check_val() interface and max-checking implementation is
used instead, which allows future patches to make use of he same code
using their own implemetations of similar functionality.
Link: http://lkml.kernel.org/r/980ea73dd8e3f36db3d646f99652f8fed42b77d4.1550100284.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, the onmatch action data binds the onmatch action to data
related to synthetic event generation. Since we want to allow the
onmatch handler to potentially invoke a different action, and because
we expect other handlers to generate synthetic events, we need to
separate the data related to these two functions.
Also rename the onmatch data to something more descriptive, and create
and use common action data destroy function.
Link: http://lkml.kernel.org/r/b9abbf9aae69fe3920cdc8ddbcaad544dd258d78.1550100284.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The hist trigger action code currently implements two essentially
hard-coded pairs of 'actions' - onmax(), which tracks a variable and
saves some event fields when a max is hit, and onmatch(), which is
hard-coded to generate a synthetic event.
These hardcoded pairs (track max/save fields and detect match/generate
synthetic event) should really be decoupled into separate components
that can then be arbitrarily combined. The first component of each
pair (track max/detect match) is called a 'handler' in the new code,
while the second component (save fields/generate synthetic event) is
called an 'action' in this scheme.
This change refactors the action code to reflect this split by adding
two handlers, HANDLER_ONMATCH and HANDLER_ONMAX, along with two
actions, ACTION_SAVE and ACTION_TRACE.
The new code combines them to produce the existing ONMATCH/TRACE and
ONMAX/SAVE functionality, but doesn't implement the other combinations
now possible. Future patches will expand these to further useful
cases, such as ONMAX/TRACE, as well as add additional handlers and
actions such as ONCHANGE and SNAPSHOT.
Also, add abbreviated documentation for handlers and actions to
README.
Link: http://lkml.kernel.org/r/98bfdd48c1b4ff29fc5766442f99f5bc3c34b76b.1550100284.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit d716ff71dd ("tracing: Remove taking of trace_types_lock in
pipe files") use the current tracer instead of the copy in
tracing_open_pipe(), but it forget to remove the freeing sentence in
the error path.
There's an error path that can call kfree(iter->trace) after the iter->trace
was assigned to tr->current_trace, which would be bad to free.
Link: http://lkml.kernel.org/r/1550060946-45984-1-git-send-email-yi.zhang@huawei.com
Cc: stable@vger.kernel.org
Fixes: d716ff71dd ("tracing: Remove taking of trace_types_lock in pipe files")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Have kprobes not use copy_from_user() to access kernel addresses,
because kprobes can legitimately poke at bad kernel memory, which
will fault. Copy from user code should never fault in kernel space.
Using probe_mem_read() can handle kernel address space faulting.
- Put back the entries counter in the tracing output that was accidentally
removed.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXGb7BxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqvaAQC66gQ79frSW7xPjJ4Y+qLIm0YDV18i
aCHowAXxDeK3qAEA3sDeELAPVupacPrzZc6zejI+bf0HArPe08n3vlHwAQw=
=uUQj
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.0-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two more tracing fixes
- Have kprobes not use copy_from_user() to access kernel addresses,
because kprobes can legitimately poke at bad kernel memory, which
will fault. Copy from user code should never fault in kernel space.
Using probe_mem_read() can handle kernel address space faulting.
- Put back the entries counter in the tracing output that was
accidentally removed"
* tag 'trace-v5.0-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix number of entries in trace header
kprobe: Do not use uaccess functions to access kernel memory that can fault
Enabling of large number of functions by echoing in a large subset of the
functions in available_filter_functions can take a very long time. The
process requires testing all functions registered by the function tracer
(which is in the 10s of thousands), and doing a kallsyms lookup to convert
the ip address into a name, then comparing that name with the string passed
in.
When a function causes the function tracer to crash the system, a binary
bisect of the available_filter_functions can be done to find the culprit.
But this requires passing in half of the functions in
available_filter_functions over and over again, which makes it basically a
O(n^2) operation. With 40,000 functions, that ends up bing 1,600,000,000
opertions! And enabling this can take over 20 minutes.
As a quick speed up, if a number is passed into one of the filter files,
instead of doing a search, it just enables the function at the corresponding
line of the available_filter_functions file. That is:
# echo 50 > set_ftrace_filter
# cat set_ftrace_filter
x86_pmu_commit_txn
# head -50 available_filter_functions | tail -1
x86_pmu_commit_txn
This allows setting of half the available_filter_functions to take place in
less than a second!
# time seq 20000 > set_ftrace_filter
real 0m0.042s
user 0m0.005s
sys 0m0.015s
# wc -l set_ftrace_filter
20000 set_ftrace_filter
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The following commit
441dae8f2f ("tracing: Add support for display of tgid in trace output")
removed the call to print_event_info() from print_func_help_header_irq()
which results in the ftrace header not reporting the number of entries
written in the buffer. As this wasn't the original intent of the patch,
re-introduce the call to print_event_info() to restore the orginal
behaviour.
Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.com
Acked-by: Joel Fernandes <joelaf@google.com>
Cc: stable@vger.kernel.org
Fixes: 441dae8f2f ("tracing: Add support for display of tgid in trace output")
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
it processes the args but does not update the remaining length
of the buffer that the string arguments will be placed in. It
constantly passes in the total size of buffer used instead of
passing in the remaining size of the buffer used. This could cause
issues if the strings are larger than the max size of an event
which could cause the strings to be written beyond what was reserved
on the buffer.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXGN7BRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoa1AQD7+6O0DncGwk5aWqRHESXKlmOWteW6
eMFbEw3KDcvs2gEAvNLB1i2yVH6Enn50M0KpmYJMbyZK/LVn2QsPZfU/LgQ=
=KMBZ
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"This fixes kprobes/uprobes dynamic processing of strings, where it
processes the args but does not update the remaining length of the
buffer that the string arguments will be placed in. It constantly
passes in the total size of buffer used instead of passing in the
remaining size of the buffer used.
This could cause issues if the strings are larger than the max size of
an event which could cause the strings to be written beyond what was
reserved on the buffer"
* tag 'trace-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: probeevent: Correctly update remaining space in dynamic area
Commit 9178412ddf ("tracing: probeevent: Return consumed
bytes of dynamic area") improved the string fetching
mechanism by returning the number of required bytes after
copying the argument to the dynamic area. However, this
return value is now only used to increment the pointer
inside the dynamic area but misses updating the 'maxlen'
variable which indicates the remaining space in the dynamic
area.
This means that fetch_store_string() always reads the *total*
size of the dynamic area from the data_loc pointer instead of
the *remaining* size (and passes it along to
strncpy_from_{user,unsafe}) even if we're already about to
copy data into the middle of the dynamic area.
Link: http://lkml.kernel.org/r/20190206190013.16405-1-andreas.ziegler@fau.de
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 9178412ddf ("tracing: probeevent: Return consumed bytes of dynamic area")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull networking fixes from David Miller:
"This pull request is dedicated to the upcoming snowpocalypse parts 2
and 3 in the Pacific Northwest:
1) Drop profiles are broken because some drivers use dev_kfree_skb*
instead of dev_consume_skb*, from Yang Wei.
2) Fix IWLWIFI kconfig deps, from Luca Coelho.
3) Fix percpu maps updating in bpftool, from Paolo Abeni.
4) Missing station release in batman-adv, from Felix Fietkau.
5) Fix some networking compat ioctl bugs, from Johannes Berg.
6) ucc_geth must reset the BQL queue state when stopping the device,
from Mathias Thore.
7) Several XDP bug fixes in virtio_net from Toshiaki Makita.
8) TSO packets must be sent always on queue 0 in stmmac, from Jose
Abreu.
9) Fix socket refcounting bug in RDS, from Eric Dumazet.
10) Handle sparse cpu allocations in bpf selftests, from Martynas
Pumputis.
11) Make sure mgmt frames have enough tailroom in mac80211, from Felix
Feitkau.
12) Use safe list walking in sctp_sendmsg() asoc list traversal, from
Greg Kroah-Hartman.
13) Make DCCP's ccid_hc_[rt]x_parse_options always check for NULL
ccid, from Eric Dumazet.
14) Need to reload WoL password into bcmsysport device after deep
sleeps, from Florian Fainelli.
15) Remove filter from mask before freeing in cls_flower, from Petr
Machata.
16) Missing release and use after free in error paths of s390 qeth
code, from Julian Wiedmann.
17) Fix lockdep false positive in dsa code, from Marc Zyngier.
18) Fix counting of ATU violations in mv88e6xxx, from Andrew Lunn.
19) Fix EQ firmware assert in qed driver, from Manish Chopra.
20) Don't default Caivum PTP to Y in kconfig, from Bjorn Helgaas"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
net: dsa: b53: Fix for failure when irq is not defined in dt
sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
geneve: should not call rt6_lookup() when ipv6 was disabled
net: Don't default Cavium PTP driver to 'y'
net: broadcom: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: via-velocity: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: tehuti: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: sun: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: fsl_ucc_hdlc: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: fec_mpc52xx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: smsc: epic100: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: dscc4: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: tulip: de2104x: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net: defxx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
net/mlx5e: Don't overwrite pedit action when multiple pedit used
net/mlx5e: Update hw flows when encap source mac changed
qed*: Advance drivers version to 8.37.0.20
qed: Change verbosity for coalescing message.
qede: Fix system crash on configuring channels.
qed: Consider TX tcs while deriving the max num_queues for PF.
...
Commit 6b7e633fe9 ("tracing: Remove extra zeroing out of the ring
buffer page") removed the only caller of ring_buffer_page_len(). The
function is now unused and may be removed.
Link: http://lkml.kernel.org/r/20181228133847.106177-1-mbenes@suse.cz
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This align the behavior of wakeup tracers with irqsoff latency tracer
that we record stacktrace at the beginning and end of waking up. The
stacktrace shows us what is happening in the kernel.
Link: http://lkml.kernel.org/r/20190116160249.7554-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add these info fields to funcgraph wakeup tracers:
o Show CPU info since the waker could be on a different CPU.
o Show function duration and overhead.
o Show IRQ markers.
Link: http://lkml.kernel.org/r/20190101154614.8887-3-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As the predicat_parse() code is rather complex, commenting subtleties is
important. The switch case statement should be commented to describe that it
is only looking for two '&' or '|' together, which is why the fall through
to an error is after the check.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There is a plan to build the kernel with -Wimplicit-fallthrough and
this place in the code produced a warning (W=1).
This commit remove the following warning:
kernel/trace/trace_events_filter.c:494:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
Link: http://lkml.kernel.org/r/20190114203039.16535-2-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There is a plan to build the kernel with -Wimplicit-fallthrough and
this place in the code produced a warning (W=1).
This commit remove the following warning:
kernel/trace/trace_probe.c:302:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
Link: http://lkml.kernel.org/r/20190114203039.16535-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Al Viro pointed out that since there is only one pipe buffer type to which
new data can be appended, it isn't necessary to have a ->can_merge field in
struct pipe_buf_operations, we can just check for a magic type.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For the original mode of operation it isn't needed, since we report back
errors via PERF_RECORD_LOST records in the ring buffer, but for use in
bpf_perf_event_output() it is convenient to return the errors, basically
-ENOSPC.
Currently bpf_perf_event_output() returns an error indication, the last
thing it does, which is to push it to the ring buffer is that can fail
and if so, this failure won't be reported back to its users, fix it.
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/r/20190118150938.GN5823@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When printing multiple uprobe arguments as strings the output for the
earlier arguments would also include all later string arguments.
This is best explained in an example:
Consider adding a uprobe to a function receiving two strings as
parameters which is at offset 0xa0 in strlib.so and we want to print
both parameters when the uprobe is hit (on x86_64):
$ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
/sys/kernel/debug/tracing/uprobe_events
When the function is called as func("foo", "bar") and we hit the probe,
the trace file shows a line like the following:
[...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"
Note the extra "bar" printed as part of arg1. This behaviour stacks up
for additional string arguments.
The strings are stored in a dynamically growing part of the uprobe
buffer by fetch_store_string() after copying them from userspace via
strncpy_from_user(). The return value of strncpy_from_user() is then
directly used as the required size for the string. However, this does
not take the terminating null byte into account as the documentation
for strncpy_from_user() cleary states that it "[...] returns the
length of the string (not including the trailing NUL)" even though the
null byte will be copied to the destination.
Therefore, subsequent calls to fetch_store_string() will overwrite
the terminating null byte of the most recently fetched string with
the first character of the current string, leading to the
"accumulation" of strings in earlier arguments in the output.
Fix this by incrementing the return value of strncpy_from_user() by
one if we did not hit the maximum buffer size.
Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 5baaa59ef0 ("tracing/probes: Implement 'memory' fetch method for uprobes")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The subsystem-specific message prefix for uprobes was also
"trace_kprobe: " instead of "trace_uprobe: " as described in
the original commit message.
Link: http://lkml.kernel.org/r/20190117133023.19292-1-andreas.ziegler@fau.de
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 7257634135 ("tracing/probe: Show subsystem name in messages")
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Rework of the kprobe/uprobe and synthetic events to consolidate all
the dynamic event code. This will make changes in the future easier.
- Partial rewrite of the function graph tracing infrastructure.
This will allow for multiple users of hooking onto functions
to get the callback (return) of the function. This is the ground
work for having kprobes and function graph tracer using one code base.
- Clean up of the histogram code that will facilitate adding more
features to the histograms in the future.
- Addition of str_has_prefix() and a few use cases. There currently
is a similar function strstart() that is used in a few places, but
only returns a bool and not a length. These instances will be
removed in the future to use str_has_prefix() instead.
- A few other various clean ups as well.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXCawlBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhbcAQCFeT0fWWTUxofBQz5jqsHaRnVg21+9
X4sTldYRYEn4YgEAmWOyiwq7zvrsAu4ZwkNBMeqxn3tVymYHiGOGe3Y4BAw=
=u96o
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Rework of the kprobe/uprobe and synthetic events to consolidate all
the dynamic event code. This will make changes in the future easier.
- Partial rewrite of the function graph tracing infrastructure. This
will allow for multiple users of hooking onto functions to get the
callback (return) of the function. This is the ground work for having
kprobes and function graph tracer using one code base.
- Clean up of the histogram code that will facilitate adding more
features to the histograms in the future.
- Addition of str_has_prefix() and a few use cases. There currently is
a similar function strstart() that is used in a few places, but only
returns a bool and not a length. These instances will be removed in
the future to use str_has_prefix() instead.
- A few other various clean ups as well.
* tag 'trace-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (57 commits)
tracing: Use the return of str_has_prefix() to remove open coded numbers
tracing: Have the historgram use the result of str_has_prefix() for len of prefix
tracing: Use str_has_prefix() instead of using fixed sizes
tracing: Use str_has_prefix() helper for histogram code
string.h: Add str_has_prefix() helper function
tracing: Make function ‘ftrace_exports’ static
tracing: Simplify printf'ing in seq_print_sym
tracing: Avoid -Wformat-nonliteral warning
tracing: Merge seq_print_sym_short() and seq_print_sym_offset()
tracing: Add hist trigger comments for variable-related fields
tracing: Remove hist trigger synth_var_refs
tracing: Use hist trigger's var_ref array to destroy var_refs
tracing: Remove open-coding of hist trigger var_ref management
tracing: Use var_refs[] for hist trigger reference checking
tracing: Change strlen to sizeof for hist trigger static strings
tracing: Remove unnecessary hist trigger struct field
tracing: Fix ftrace_graph_get_ret_stack() to use task and not current
seq_buf: Use size_t for len in seq_buf_puts()
seq_buf: Make seq_buf_puts() null-terminate the buffer
arm64: Use ftrace_graph_get_ret_stack() instead of curr_ret_stack
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlwb7R8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjiID/97oDjMhNT7rwpuMbHw855h62j1hEN/m+N3
FI0uxivYoYZLD+eJRnMcBwHlKjrCX8iJQAcv9ffI3ThtFW7dnZT3atUacaZVR/Dt
IrxdymdBP3qsmuaId5NYBug7rJ+AiqFJKjEvCcSPu5X397J4I3SEbzhfvYLJ/aZX
16o0HJlVVIrcbmq1IP4HwiIIOaKXvPaw04L4z4fpeynRSWG7EAi8NLSnhlR4Rxbb
BTiMkCTsjRCFdyO6da4fvNQKWmPGPa3bJkYy3qR99cvJCeIbQjRyCloQlWNJRRgi
3eJpCHVxqFmN0/+DNTJVQEEr4H8o0AVucrLVct1Jc4pessenkpoUniP8vELqwlng
Z2VHLkhTfCEmvFlk82grrYdNvGATRsrbswt/PlP4T7rBfr1IpDk8kXDWF59EL2dy
ly35Sk3wJGHBl8qa+vEPXOAnaWdqJXuVGpwB4ifOIatOls8mOxwfZjiRc7x05/fC
1O4rR2IfLwRqwoYHs0AJ+h6ohOSn1mkGezl2Tch1VSFcJUOHmuYvraTaUi6hblpA
SslaAoEhO39hRBL0HsvsMeqVWM9uzqvFkLDCfNPdiA81H1258CIbo4vF8z6czCIS
eeXnTJxVhPVbZgb3a1a93SPwM6KIDZFoIijyd+NqjpU94thlnhYD0QEcKJIKH7os
2p4aHs6ktw==
=TRdW
-----END PGP SIGNATURE-----
Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"This is the main pull request for block/storage for 4.21.
Larger than usual, it was a busy round with lots of goodies queued up.
Most notable is the removal of the old IO stack, which has been a long
time coming. No new features for a while, everything coming in this
week has all been fixes for things that were previously merged.
This contains:
- Use atomic counters instead of semaphores for mtip32xx (Arnd)
- Cleanup of the mtip32xx request setup (Christoph)
- Fix for circular locking dependency in loop (Jan, Tetsuo)
- bcache (Coly, Guoju, Shenghui)
* Optimizations for writeback caching
* Various fixes and improvements
- nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
* host and target support for NVMe over TCP
* Error log page support
* Support for separate read/write/poll queues
* Much improved polling
* discard OOM fallback
* Tracepoint improvements
- lightnvm (Hans, Hua, Igor, Matias, Javier)
* Igor added packed metadata to pblk. Now drives without metadata
per LBA can be used as well.
* Fix from Geert on uninitialized value on chunk metadata reads.
* Fixes from Hans and Javier to pblk recovery and write path.
* Fix from Hua Su to fix a race condition in the pblk recovery
code.
* Scan optimization added to pblk recovery from Zhoujie.
* Small geometry cleanup from me.
- Conversion of the last few drivers that used the legacy path to
blk-mq (me)
- Removal of legacy IO path in SCSI (me, Christoph)
- Removal of legacy IO stack and schedulers (me)
- Support for much better polling, now without interrupts at all.
blk-mq adds support for multiple queue maps, which enables us to
have a map per type. This in turn enables nvme to have separate
completion queues for polling, which can then be interrupt-less.
Also means we're ready for async polled IO, which is hopefully
coming in the next release.
- Killing of (now) unused block exports (Christoph)
- Unification of the blk-rq-qos and blk-wbt wait handling (Josef)
- Support for zoned testing with null_blk (Masato)
- sx8 conversion to per-host tag sets (Christoph)
- IO priority improvements (Damien)
- mq-deadline zoned fix (Damien)
- Ref count blkcg series (Dennis)
- Lots of blk-mq improvements and speedups (me)
- sbitmap scalability improvements (me)
- Make core inflight IO accounting per-cpu (Mikulas)
- Export timeout setting in sysfs (Weiping)
- Cleanup the direct issue path (Jianchao)
- Export blk-wbt internals in block debugfs for easier debugging
(Ming)
- Lots of other fixes and improvements"
* tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
kyber: use sbitmap add_wait_queue/list_del wait helpers
sbitmap: add helpers for add/del wait queue handling
block: save irq state in blkg_lookup_create()
dm: don't reuse bio for flushes
nvme-pci: trace SQ status on completions
nvme-rdma: implement polling queue map
nvme-fabrics: allow user to pass in nr_poll_queues
nvme-fabrics: allow nvmf_connect_io_queue to poll
nvme-core: optionally poll sync commands
block: make request_to_qc_t public
nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
nvme-tcp: fix endianess annotations
nvmet-tcp: fix endianess annotations
nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
nvme-pci: only set nr_maps to 2 if poll queues are supported
nvmet: use a macro for default error location
nvmet: fix comparison of a u16 with -1
blk-mq: enable IO poll if .nr_queues of type poll > 0
blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
blk-mq: skip zero-queue maps in blk_mq_map_swqueue
...
Pull networking updates from David Miller:
1) New ipset extensions for matching on destination MAC addresses, from
Stefano Brivio.
2) Add ipv4 ttl and tos, plus ipv6 flow label and hop limit offloads to
nfp driver. From Stefano Brivio.
3) Implement GRO for plain UDP sockets, from Paolo Abeni.
4) Lots of work from Michał Mirosław to eliminate the VLAN_TAG_PRESENT
bit so that we could support the entire vlan_tci value.
5) Rework the IPSEC policy lookups to better optimize more usecases,
from Florian Westphal.
6) Infrastructure changes eliminating direct manipulation of SKB lists
wherever possible, and to always use the appropriate SKB list
helpers. This work is still ongoing...
7) Lots of PHY driver and state machine improvements and
simplifications, from Heiner Kallweit.
8) Various TSO deferral refinements, from Eric Dumazet.
9) Add ntuple filter support to aquantia driver, from Dmitry Bogdanov.
10) Batch dropping of XDP packets in tuntap, from Jason Wang.
11) Lots of cleanups and improvements to the r8169 driver from Heiner
Kallweit, including support for ->xmit_more. This driver has been
getting some much needed love since he started working on it.
12) Lots of new forwarding selftests from Petr Machata.
13) Enable VXLAN learning in mlxsw driver, from Ido Schimmel.
14) Packed ring support for virtio, from Tiwei Bie.
15) Add new Aquantia AQtion USB driver, from Dmitry Bezrukov.
16) Add XDP support to dpaa2-eth driver, from Ioana Ciocoi Radulescu.
17) Implement coalescing on TCP backlog queue, from Eric Dumazet.
18) Implement carrier change in tun driver, from Nicolas Dichtel.
19) Support msg_zerocopy in UDP, from Willem de Bruijn.
20) Significantly improve garbage collection of neighbor objects when
the table has many PERMANENT entries, from David Ahern.
21) Remove egdev usage from nfp and mlx5, and remove the facility
completely from the tree as it no longer has any users. From Oz
Shlomo and others.
22) Add a NETDEV_PRE_CHANGEADDR so that drivers can veto the change and
therefore abort the operation before the commit phase (which is the
NETDEV_CHANGEADDR event). From Petr Machata.
23) Add indirect call wrappers to avoid retpoline overhead, and use them
in the GRO code paths. From Paolo Abeni.
24) Add support for netlink FDB get operations, from Roopa Prabhu.
25) Support bloom filter in mlxsw driver, from Nir Dotan.
26) Add SKB extension infrastructure. This consolidates the handling of
the auxiliary SKB data used by IPSEC and bridge netfilter, and is
designed to support the needs to MPTCP which could be integrated in
the future.
27) Lots of XDP TX optimizations in mlx5 from Tariq Toukan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1845 commits)
net: dccp: fix kernel crash on module load
drivers/net: appletalk/cops: remove redundant if statement and mask
bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw
net/net_namespace: Check the return value of register_pernet_subsys()
net/netlink_compat: Fix a missing check of nla_parse_nested
ieee802154: lowpan_header_create check must check daddr
net/mlx4_core: drop useless LIST_HEAD
mlxsw: spectrum: drop useless LIST_HEAD
net/mlx5e: drop useless LIST_HEAD
iptunnel: Set tun_flags in the iptunnel_metadata_reply from src
net/mlx5e: fix semicolon.cocci warnings
staging: octeon: fix build failure with XFRM enabled
net: Revert recent Spectre-v1 patches.
can: af_can: Fix Spectre v1 vulnerability
packet: validate address length if non-zero
nfc: af_nfc: Fix Spectre v1 vulnerability
phonet: af_phonet: Fix Spectre v1 vulnerability
net: core: Fix Spectre v1 vulnerability
net: minor cleanup in skb_ext_add()
net: drop the unused helper skb_ext_get()
...
Pull RCU updates from Ingo Molnar:
"The biggest RCU changes in this cycle were:
- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.
- Replace calls of RCU-bh and RCU-sched update-side functions to
their vanilla RCU counterparts. This series is a step towards
complete removal of the RCU-bh and RCU-sched update-side functions.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.
- Miscellaneous fixes.
- Automate generation of the initrd filesystem used for rcutorture
testing.
- Convert spin_is_locked() assertions to instead use lockdep.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- SRCU updates, especially including a fix from Dennis Krein for a
bag-on-head-class bug.
- RCU torture-test updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
rcutorture: Don't do busted forward-progress testing
rcutorture: Use 100ms buckets for forward-progress callback histograms
rcutorture: Recover from OOM during forward-progress tests
rcutorture: Print forward-progress test age upon failure
rcutorture: Print time since GP end upon forward-progress failure
rcutorture: Print histogram of CB invocation at OOM time
rcutorture: Print GP age upon forward-progress failure
rcu: Print per-CPU callback counts for forward-progress failures
rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
rcutorture: Dump grace-period diagnostics upon forward-progress OOM
rcutorture: Prepare for asynchronous access to rcu_fwd_startat
torture: Remove unnecessary "ret" variables
rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
rcutorture: Break up too-long rcu_torture_fwd_prog() function
rcutorture: Remove cbflood facility
torture: Bring any extra CPUs online during kernel startup
rcutorture: Add call_rcu() flooding forward-progress tests
rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
...
There are several locations that compare constants to the beginning of
string variables to determine what commands should be done, then the
constant length is used to index into the string. This is error prone as the
hard coded numbers have to match the size of the constants. Instead, use the
len returned from str_has_prefix() and remove the open coded string length
sizes.
Cc: Joe Perches <joe@perches.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org> (for trace_probe part)
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As str_has_prefix() returns the length on match, we can use that for the
updating of the string pointer instead of recalculating the prefix size.
Cc: Tom Zanussi <zanussi@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There are several instances of strncmp(str, "const", 123), where 123 is the
strlen of the const string to check if "const" is the prefix of str. But
this can be error prone. Use str_has_prefix() instead.
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The tracing histogram code contains a lot of instances of the construct:
strncmp(str, "const", sizeof("const") - 1)
This can be prone to bugs due to typos or bad cut and paste. Use the
str_has_prefix() helper macro instead that removes the need for having two
copies of the constant string.
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In commit 478409dd68 ("tracing: Add hook to function tracing for other
subsystems to use"), a new function ‘ftrace_exports’ was added. Since
this function can be made static, make it so.
Silence the following warning triggered using W=1:
kernel/trace/trace.c:2451:6: warning: no previous prototype for ‘ftrace_exports’ [-Wmissing-prototypes]
Link: http://lkml.kernel.org/r/20180516193012.25390-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
trace_seq_printf(..., "%s", ...) can be done with trace_seq_puts()
instead, avoiding printf overhead. In the second instance, the string
we're copying was just created from an snprintf() to a stack buffer, so
we might as well do that printf directly. This naturally leads to moving
the declaration of the str buffer inside the CONFIG_KALLSYMS guard,
which in turn will make gcc inline the function for !CONFIG_KALLSYMS (it
only has a single caller, but the huge stack frame seems to make gcc not
inline it for CONFIG_KALLSYMS).
Link: http://lkml.kernel.org/r/20181029223542.26175-4-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Building with -Wformat-nonliteral, gcc complains
kernel/trace/trace_output.c: In function ‘seq_print_sym’:
kernel/trace/trace_output.c:356:3: warning: format not a string literal, argument types not checked [-Wformat-nonliteral]
trace_seq_printf(s, fmt, name);
But seq_print_sym only has a single caller which passes "%s" as fmt, so
we might as well just use that directly. That also paves the way for
further cleanups that will actually make that format string go away
entirely.
Link: http://lkml.kernel.org/r/20181029223542.26175-3-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
These two functions are nearly identical, so we can avoid some code
duplication by moving the conditional into a common implementation.
Link: http://lkml.kernel.org/r/20181029223542.26175-2-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
All var_refs are now handled uniformly and there's no reason to treat
the synth_refs in a special way now, so remove them and associated
functions.
Link: http://lkml.kernel.org/r/b4d3470526b8f0426dcec125399dad9ad9b8589d.1545161087.git.tom.zanussi@linux.intel.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since every var ref for a trigger has an entry in the var_ref[] array,
use that to destroy the var_refs, instead of piecemeal via the field
expressions.
This allows us to avoid having to keep and treat differently separate
lists for the action-related references, which future patches will
remove.
Link: http://lkml.kernel.org/r/fad1a164f0e257c158e70d6eadbf6c586e04b2a2.1545161087.git.tom.zanussi@linux.intel.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Have create_var_ref() manage the hist trigger's var_ref list, rather
than having similar code doing it in multiple places. This cleans up
the code and makes sure var_refs are always accounted properly.
Also, document the var_ref-related functions to make what their
purpose clearer.
Link: http://lkml.kernel.org/r/05ddae93ff514e66fc03897d6665231892939913.1545161087.git.tom.zanussi@linux.intel.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since all the variable reference hist_fields are collected into
hist_data->var_refs[] array, there's no need to go through all the
fields looking for them, or in separate arrays like synth_var_refs[],
which will be going away soon anyway.
This also allows us to get rid of some unnecessary code and functions
currently used for the same purpose.
Link: http://lkml.kernel.org/r/1545246556.4239.7.camel@gmail.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There's no need to use strlen() for static strings when the length is
already known, so update trace_events_hist.c with sizeof() for those
cases.
Link: http://lkml.kernel.org/r/e3e754f2bd18e56eaa8baf79bee619316ebf4cfc.1545161087.git.tom.zanussi@linux.intel.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The function ftrace_graph_get_ret_stack() takes a task struct descriptor but
uses current as the task to perform the operations on. In pretty much all
cases the task decriptor is the same as current, so this wasn't an issue.
But there is a case in the ARM architecture that passes in a task that is
not current, and expects a result from that task, and this code breaks it.
Fixes: 51584396cff5 ("arm64: Use ftrace_graph_get_ret_stack() instead of curr_ret_stack")
Reported-by: James Morse <james.morse@arm.com>
Tested-by: James Morse <james.morse@arm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-12-21
The following pull-request contains BPF updates for your *net-next* tree.
There is a merge conflict in test_verifier.c. Result looks as follows:
[...]
},
{
"calls: cross frame pruning",
.insns = {
[...]
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.errstr_unpriv = "function calls to other bpf functions are allowed for root only",
.result_unpriv = REJECT,
.errstr = "!read_ok",
.result = REJECT,
},
{
"jset: functional",
.insns = {
[...]
{
"jset: unknown const compare not taken",
.insns = {
BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
BPF_FUNC_get_prandom_u32),
BPF_JMP_IMM(BPF_JSET, BPF_REG_0, 1, 1),
BPF_LDX_MEM(BPF_B, BPF_REG_8, BPF_REG_9, 0),
BPF_EXIT_INSN(),
},
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.errstr_unpriv = "!read_ok",
.result_unpriv = REJECT,
.errstr = "!read_ok",
.result = REJECT,
},
[...]
{
"jset: range",
.insns = {
[...]
},
.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
.result_unpriv = ACCEPT,
.result = ACCEPT,
},
The main changes are:
1) Various BTF related improvements in order to get line info
working. Meaning, verifier will now annotate the corresponding
BPF C code to the error log, from Martin and Yonghong.
2) Implement support for raw BPF tracepoints in modules, from Matt.
3) Add several improvements to verifier state logic, namely speeding
up stacksafe check, optimizations for stack state equivalence
test and safety checks for liveness analysis, from Alexei.
4) Teach verifier to make use of BPF_JSET instruction, add several
test cases to kselftests and remove nfp specific JSET optimization
now that verifier has awareness, from Jakub.
5) Improve BPF verifier's slot_type marking logic in order to
allow more stack slot sharing, from Jiong.
6) Add sk_msg->size member for context access and add set of fixes
and improvements to make sock_map with kTLS usable with openssl
based applications, from John.
7) Several cleanups and documentation updates in bpftool as well as
auto-mount of tracefs for "bpftool prog tracelog" command,
from Quentin.
8) Include sub-program tags from now on in bpf_prog_info in order to
have a reliable way for user space to get all tags of the program
e.g. needed for kallsyms correlation, from Song.
9) Add BTF annotations for cgroup_local_storage BPF maps and
implement bpf fs pretty print support, from Roman.
10) Fix bpftool in order to allow for cross-compilation, from Ivan.
11) Update of bpftool license to GPLv2-only + BSD-2-Clause in order
to be compatible with libbfd and allow for Debian packaging,
from Jakub.
12) Remove an obsolete prog->aux sanitation in dump and get rid of
version check for prog load, from Daniel.
13) Fix a memory leak in libbpf's line info handling, from Prashant.
14) Fix cpumap's frame alignment for build_skb() so that skb_shared_info
does not get unaligned, from Jesper.
15) Fix test_progs kselftest to work with older compilers which are less
smart in optimizing (and thus throwing build error), from Stanislav.
16) Cleanup and simplify AF_XDP socket teardown, from Björn.
17) Fix sk lookup in BPF kselftest's test_sock_addr with regards
to netns_id argument, from Andrey.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Distributions build drivers as modules, including network and filesystem
drivers which export numerous tracepoints. This enables
bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
detector found some allocations that were not freed correctly.
This fixes a couple of leaks in the event trigger code as well as
in adding function trace filters in trace instances.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXBAHphQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qphzAP4mTz45V9gq9vyXCVPPzg8T6lV4ZjJh
bPaumlHGumaJHAD9FipqlhCOCVfv8Qyxv5iWuBpoGKcp37ULb6d+dtM+qg4=
=S1FK
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"While running various ftrace tests on new development code, the
kmemleak detector found some allocations that were not freed
correctly.
This fixes a couple of leaks in the event trigger code as well as in
adding function trace filters in trace instances"
* tag 'trace-v4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix memory leak of instance function hash filters
tracing: Fix memory leak in set_trigger_filter()
tracing: Fix memory leak in create_filter()
The following commands will cause a memory leak:
# cd /sys/kernel/tracing
# mkdir instances/foo
# echo schedule > instance/foo/set_ftrace_filter
# rmdir instances/foo
The reason is that the hashes that hold the filters to set_ftrace_filter and
set_ftrace_notrace are not freed if they contain any data on the instance
and the instance is removed.
Found by kmemleak detector.
Cc: stable@vger.kernel.org
Fixes: 591dffdade ("ftrace: Allow for function tracing instance to filter functions")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When create_event_filter() fails in set_trigger_filter(), the filter may
still be allocated and needs to be freed. The caller expects the
data->filter to be updated with the new filter, even if the new filter
failed (we could add an error message by setting set_str parameter of
create_event_filter(), but that's another update).
But because the error would just exit, filter was left hanging and
nothing could free it.
Found by kmemleak detector.
Cc: stable@vger.kernel.org
Fixes: bac5fb97a1 ("tracing: Add and use generic set_trigger_filter() implementation")
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The create_filter() calls create_filter_start() which allocates a
"parse_error" descriptor, but fails to call create_filter_finish() that
frees it.
The op_stack and inverts in predicate_parse() were also not freed.
Found by kmemleak detector.
Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ret_stack should not be accessed directly via the curr_ret_stack
variable on the task_struct. This is because the ret_stack is going to be
converted into a series of longs and not an array of ret_stack structures.
The way that a ret_stack should be retrieved is via the
ftrace_graph_get_ret_stack structure, but it needs to be documented on how
to use it.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The function ftrace_replace_code() is the ftrace engine that does the
work to modify all the nops into the calls to the function callback in
all the functions being traced.
The generic version which is normally called from stop machine, but an
architecture can implement a non stop machine version and still use the
generic ftrace_replace_code(). When an architecture does this,
ftrace_replace_code() may be called from a schedulable context, where
it can allow the code to be preemptible, and schedule out.
In order to allow an architecture to make ftrace_replace_code()
schedulable, a new command flag is added called:
FTRACE_MAY_SLEEP
Which can be or'd to the command that is passed to
ftrace_modify_all_code() that calls ftrace_replace_code() and will have
it call cond_resched() in the loop that modifies the nops into the
calls to the ftrace trampolines.
Link: http://lkml.kernel.org/r/20181204192903.8193-1-anders.roxell@linaro.org
Link: http://lkml.kernel.org/r/20181205183303.828422192@goodmis.org
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a generic method to remove event from dynamic event
list. This is same as other system under ftrace. You
just need to pass the event name with '!', e.g.
# echo p:new_grp/new_event _do_fork > dynamic_events
This creates an event, and
# echo '!p:new_grp/new_event _do_fork' > dynamic_events
Or,
# echo '!p:new_grp/new_event' > dynamic_events
will remove new_grp/new_event event.
Note that this doesn't check the event prefix (e.g. "p:")
strictly, because the "group/event" name must be unique.
Link: http://lkml.kernel.org/r/154140869774.17322.8887303560398645347.stgit@devbox
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The trace_add/remove_event_call_nolock() functions were added to allow
the tace_add/remove_event_call() code be called when the event_mutex
lock was already taken. Now that all callers are done within the
event_mutex, there's no reason to have two different interfaces.
Remove the current wrapper trace_add/remove_event_call()s and rename the
_nolock versions back to the original names.
Link: http://lkml.kernel.org/r/154140866955.17322.2081425494660638846.stgit@devbox
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Rmove unneeded synth_event_mutex. This mutex protects the reference
count in synth_event, however, those operational points are already
protected by event_mutex.
1. In __create_synth_event() and create_or_delete_synth_event(),
those synth_event_mutex clearly obtained right after event_mutex.
2. event_hist_trigger_func() is trigger_hist_cmd.func() which is
called by trigger_process_regex(), which is a part of
event_trigger_regex_write() and this function takes event_mutex.
3. hist_unreg_all() is trigger_hist_cmd.unreg_all() which is called
by event_trigger_regex_open() and it takes event_mutex.
4. onmatch_destroy() and onmatch_create() have long call tree,
but both are finally invoked from event_trigger_regex_write()
and event_trace_del_tracer(), former takes event_mutex, and latter
ensures called under event_mutex locked.
Finally, I ensured there is no resource conflict. For safety,
I added lockdep_assert_held(&event_mutex) for each function.
Link: http://lkml.kernel.org/r/154140864134.17322.4796059721306031894.stgit@devbox
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Use dyn_event framework for synthetic events. This shows
synthetic events on "tracing/dynamic_events" file in addition
to tracing/synthetic_events interface.
User can also define new events via tracing/dynamic_events
with "s:" prefix. So, the new syntax is below;
s:[synthetic/]EVENT_NAME TYPE ARG; [TYPE ARG;]...
To remove events via tracing/dynamic_events, you can use
"-:" prefix as same as other events.
Link: http://lkml.kernel.org/r/154140861301.17322.15454611233735614508.stgit@devbox
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Use dyn_event framework for uprobe events. This shows
uprobe events on "dynamic_events" file.
User can also define new uprobe events via dynamic_events.
Link: http://lkml.kernel.org/r/154140858481.17322.9091293846515154065.stgit@devbox
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Use dyn_event framework for kprobe events. This shows
kprobe events on "tracing/dynamic_events" file.
User can also define new events via tracing/dynamic_events.
Link: http://lkml.kernel.org/r/154140855646.17322.6619219995865980392.stgit@devbox
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add unified dynamic event framework for ftrace kprobes, uprobes
and synthetic events. Those dynamic events can be co-exist on
same file because those syntax doesn't overlap.
This introduces a framework part which provides a unified tracefs
interface and operations.
Link: http://lkml.kernel.org/r/154140852824.17322.12250362185969352095.stgit@devbox
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Integrate similar argument parsers for kprobes and uprobes events
into traceprobe_parse_probe_arg().
Link: http://lkml.kernel.org/r/154140850016.17322.9836787731210512176.stgit@devbox
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since the event_mutex and synth_event_mutex ordering issue
is gone, we can skip existing event check when adding or
deleting events, and some redundant code in error path.
This changes release_all_synth_events() to abort the process
when it hits any error and returns the error code. It succeeds
only if it has no error.
Link: http://lkml.kernel.org/r/154140847194.17322.17960275728005067803.stgit@devbox
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
synthetic event is using synth_event_mutex for protecting
synth_event_list, and event_trigger_write() path acquires
locks as below order.
event_trigger_write(event_mutex)
->trigger_process_regex(trigger_cmd_mutex)
->event_hist_trigger_func(synth_event_mutex)
On the other hand, synthetic event creation and deletion paths
call trace_add_event_call() and trace_remove_event_call()
which acquires event_mutex. In that case, if we keep the
synth_event_mutex locked while registering/unregistering synthetic
events, its dependency will be inversed.
To avoid this issue, current synthetic event is using a 2 phase
process to create/delete events. For example, it searches existing
events under synth_event_mutex to check for event-name conflicts, and
unlocks synth_event_mutex, then registers a new event under event_mutex
locked. Finally, it locks synth_event_mutex and tries to add the
new event to the list. But it can introduce complexity and a chance
for name conflicts.
To solve this simpler, this introduces trace_add_event_call_nolock()
and trace_remove_event_call_nolock() which don't acquire
event_mutex inside. synthetic event can lock event_mutex before
synth_event_mutex to solve the lock dependency issue simpler.
Link: http://lkml.kernel.org/r/154140844377.17322.13781091165954002713.stgit@devbox
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a busy check loop in cleanup_all_probes() before
trying to remove all events in uprobe_events, the same way
that kprobe_events does.
Without this change, writing null to uprobe_events will
try to remove events but if one of them is enabled, it will
stop there leaving some events cleared and others not clceared.
With this change, writing null to uprobe_events makes
sure all events are not enabled before removing events.
So, it clears all events, or returns an error (-EBUSY)
with keeping all events.
Link: http://lkml.kernel.org/r/154140841557.17322.12653952888762532401.stgit@devbox
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
After running several tests, it appears that having the reader wait till
half the buffer is full before starting to read (and causing its own events
to fill up the ring buffer constantly), works well. It keeps trace-cmd (the
main user of this interface) from dominating the traces it records.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a "buffer_percentage" file, that allows users to specify how much of the
buffer (percentage of pages) need to be filled before waking up a task
blocked on a per cpu trace_pipe_raw file.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of just waiting for a page to be full before waking up a pending
reader, allow the reader to pass in a "percentage" of pages that have
content before waking up a reader. This should help keep the process of
reading the events not cause wake ups that constantly cause reading of the
buffer.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Dan Carpenter reviewed the trace_stack.c code and figured he found an off by
one bug.
"From reviewing the code, it seems possible for
stack_trace_max.nr_entries to be set to .max_entries and in that case we
would be reading one element beyond the end of the stack_dump_trace[]
array. If it's not set to .max_entries then the bug doesn't affect
runtime."
Although it looks to be the case, it is not. Because we have:
static unsigned long stack_dump_trace[STACK_TRACE_ENTRIES+1] =
{ [0 ... (STACK_TRACE_ENTRIES)] = ULONG_MAX };
struct stack_trace stack_trace_max = {
.max_entries = STACK_TRACE_ENTRIES - 1,
.entries = &stack_dump_trace[0],
};
And:
stack_trace_max.nr_entries = x;
for (; x < i; x++)
stack_dump_trace[x] = ULONG_MAX;
Even if nr_entries equals max_entries, indexing with it into the
stack_dump_trace[] array will not overflow the array. But if it is the case,
the second part of the conditional that tests stack_dump_trace[nr_entries]
to ULONG_MAX will always be true.
By applying Dan's patch, it removes the subtle aspect of it and makes the if
conditional slightly more efficient.
Link: http://lkml.kernel.org/r/20180620110758.crunhd5bfep7zuiz@kili.mountain
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ret_stack processing is going to change, and that is going
to break anything that is accessing the ret_stack directly. One user is the
function graph profiler. By using the ftrace_graph_get_ret_stack() helper
function, the profiler can access the ret_stack entry without relying on the
implementation details of the stack itself.
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Move the function function_graph_ret_addr() to fgraph.c, as the management
of the curr_ret_stack is going to change, and all the accesses to ret_stack
needs to be done in fgraph.c.
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently the registering of function graph is to pass in a entry and return
function. We need to have a way to associate those functions together where
the entry can determine to run the return hook. Having a structure that
contains both functions will facilitate the process of converting the code
to be able to do such.
This is similar to the way function hooks are enabled (it passes in
ftrace_ops). Instead of passing in the functions to use, a single structure
is passed in to the registering function.
The unregister function is now passed in the fgraph_ops handle. When we
allow more than one callback to the function graph hooks, this will let the
system know which one to remove.
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Rearrange the functions in trace_sched_wakeup.c so that there are fewer
#ifdef CONFIG_FUNCTION_TRACER and #ifdef CONFIG_FUNCTION_GRAPH_TRACER,
instead of having the #ifdefs spread all over.
No functional change is made.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
To make the function graph infrastructure more managable, the code needs to
be in its own file (fgraph.c). Move the code that is specific for managing
the function graph infrastructure out of ftrace.c and into fgraph.c
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When the function profiler is not configured, the "graph_time" option is
meaningless, as the function profiler is the only thing that makes use of
it. Do not expose it if the profiler is not configured.
Link: http://lkml.kernel.org/r/20181123061133.GA195223@google.com
Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In order to move function graph infrastructure into its own file (fgraph.h)
it needs to access various functions and variables in ftrace.c that are
currently static. Create a new file called ftrace-internal.h that holds the
function prototypes and the extern declarations of the variables needed by
fgraph.c as well, and make them global in ftrace.c such that they can be
used outside that file.
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The curr_ret_stack is no longer set to a negative value when a function is
not to be traced by the function graph tracer. Remove the usage of
FTRACE_NOTRACE_DEPTH, as it is no longer needed.
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Prior patches ensured that any bio that interacts with a request_queue
is properly associated with a blkg. This makes bio->bi_css unnecessary
as blkg maintains a reference to blkcg already.
This removes the bio field bi_css and transfers corresponding uses to
access via bi_blkg.
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull RCU changes from Paul E. McKenney:
- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.
- Replace calls of RCU-bh and RCU-sched update-side functions
to their vanilla RCU counterparts. This series is a step
towards complete removal of the RCU-bh and RCU-sched update-side
functions.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.
- Miscellaneous fixes.
- Automate generation of the initrd filesystem used for
rcutorture testing.
- Convert spin_is_locked() assertions to instead use lockdep.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- SRCU updates, especially including a fix from Dennis Krein
for a bag-on-head-class bug.
- RCU torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Change idx variable in DO_TRACE macro to __idx to avoid name conflicts.
A kvm event had "idx" as a parameter and it confused the macro.
- Fix a race where interrupts would be traced when set_graph_function was set.
The previous patch set increased a race window that tricked the function graph
tracer to think it should trace interrupts when it really should not have.
The bug has been there before, but was seldom hit. Only the last patch series
made it more common.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXACrFhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgUuAP9zZ0+PjDIXrkZATogEsMJ1OZKy5AiK
mwT7S85/ouOeCQEAi5JUcSfwB0caq2nYB1GKOKNfDH0ffeVR4wNykcYjngM=
=gjwC
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing fixes from Steven Rostedt:
"Two more fixes:
- Change idx variable in DO_TRACE macro to __idx to avoid name
conflicts. A kvm event had "idx" as a parameter and it confused the
macro.
- Fix a race where interrupts would be traced when set_graph_function
was set. The previous patch set increased a race window that
tricked the function graph tracer to think it should trace
interrupts when it really should not have.
The bug has been there before, but was seldom hit. Only the last
patch series made it more common"
* tag 'trace-v4.20-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/fgraph: Fix set_graph_function from showing interrupts
tracepoint: Use __idx instead of idx in DO_TRACE macro to make it unique
was introduced by a patch that tried to fix one bug, but by doing so created
another bug. As both bugs corrupt the output (but they do not crash the
kernel), I decided to fix the design such that it could have both bugs
fixed. The original fix, fixed time reporting of the function graph tracer
when doing a max_depth of one. This was code that can test how much the
kernel interferes with userspace. But in doing so, it could corrupt the time
keeping of the function profiler.
The issue is that the curr_ret_stack variable was being used for two
different meanings. One was to keep track of the stack pointer on the
ret_stack (shadow stack used by the function graph tracer), and the other
use case was the graph call depth. Although, the two may be closely
related, where they got updated was the issue that lead to the two different
bugs that required the two use cases to be updated differently.
The big issue with this fix is that it requires changing each architecture.
The good news is, I was able to remove a lot of code that was duplicated
within the architectures and place it into a single location. Then I could
make the fix in one place.
I pushed this code into linux-next to let it settle over a week, and before
doing so, I cross compiled all the affected architectures to make sure that
they built fine.
In the mean time, I also pulled in a patch that fixes the sched_switch
previous tasks state output, that was not actually correct.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW/4NPhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnWAAQCyUIRLgYImr81eTl52lxNRsULk+aiI
U29kRFWWU0c40AEA1X9sDF0MgOItbRGfZtnHTZEousXRDaDf4Fge2kF7Egg=
=liQ0
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"While rewriting the function graph tracer, I discovered a design flaw
that was introduced by a patch that tried to fix one bug, but by doing
so created another bug.
As both bugs corrupt the output (but they do not crash the kernel), I
decided to fix the design such that it could have both bugs fixed. The
original fix, fixed time reporting of the function graph tracer when
doing a max_depth of one. This was code that can test how much the
kernel interferes with userspace. But in doing so, it could corrupt
the time keeping of the function profiler.
The issue is that the curr_ret_stack variable was being used for two
different meanings. One was to keep track of the stack pointer on the
ret_stack (shadow stack used by the function graph tracer), and the
other use case was the graph call depth. Although, the two may be
closely related, where they got updated was the issue that lead to the
two different bugs that required the two use cases to be updated
differently.
The big issue with this fix is that it requires changing each
architecture. The good news is, I was able to remove a lot of code
that was duplicated within the architectures and place it into a
single location. Then I could make the fix in one place.
I pushed this code into linux-next to let it settle over a week, and
before doing so, I cross compiled all the affected architectures to
make sure that they built fine.
In the mean time, I also pulled in a patch that fixes the sched_switch
previous tasks state output, that was not actually correct"
* tag 'trace-v4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
sched, trace: Fix prev_state output in sched_switch tracepoint
function_graph: Have profiler use curr_ret_stack and not depth
function_graph: Reverse the order of pushing the ret_stack and the callback
function_graph: Move return callback before update of curr_ret_stack
function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack
function_graph: Make ftrace_push_return_trace() static
sparc/function_graph: Simplify with function_graph_enter()
sh/function_graph: Simplify with function_graph_enter()
s390/function_graph: Simplify with function_graph_enter()
riscv/function_graph: Simplify with function_graph_enter()
powerpc/function_graph: Simplify with function_graph_enter()
parisc: function_graph: Simplify with function_graph_enter()
nds32: function_graph: Simplify with function_graph_enter()
MIPS: function_graph: Simplify with function_graph_enter()
microblaze: function_graph: Simplify with function_graph_enter()
arm64: function_graph: Simplify with function_graph_enter()
ARM: function_graph: Simplify with function_graph_enter()
x86/function_graph: Simplify with function_graph_enter()
function_graph: Create function_graph_enter() to consolidate architecture code
In order to make the function graph infrastructure more generic, there can
not be code specific for the function_graph tracer in the generic code. This
includes the set_graph_notrace logic, that stops all graph calls when a
function in the set_graph_notrace is hit.
By using the trace_recursion mask, we can use a bit in the current
task_struct to implement the notrace code, and move the logic out of
fgraph.c and into trace_functions_graph.c and keeps it affecting only the
tracer and not all call graph callbacks.
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As the function graph infrastructure can be used by thing other than
tracing, moving the code to its own file out of the trace_functions_graph.c
code makes more sense.
The fgraph.c file will only contain the infrastructure required to hook into
functions and their return code.
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 588ca1786f2dd ("function_graph: Use new curr_ret_depth to manage
depth instead of curr_ret_stack") removed a parameter from the call
ftrace_push_return_trace() that made it so that the entire call was under 80
characters, but it did not remove the line break. There's no reason to break
that line up, so make it a single line.
Link: http://lkml.kernel.org/r/20181122100322.GN2131@hirez.programming.kicks-ass.net
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The tracefs file set_graph_function is used to only function graph functions
that are listed in that file (or all functions if the file is empty). The
way this is implemented is that the function graph tracer looks at every
function, and if the current depth is zero and the function matches
something in the file then it will trace that function. When other functions
are called, the depth will be greater than zero (because the original
function will be at depth zero), and all functions will be traced where the
depth is greater than zero.
The issue is that when a function is first entered, and the handler that
checks this logic is called, the depth is set to zero. If an interrupt comes
in and a function in the interrupt handler is traced, its depth will be
greater than zero and it will automatically be traced, even if the original
function was not. But because the logic only looks at depth it may trace
interrupts when it should not be.
The recent design change of the function graph tracer to fix other bugs
caused the depth to be zero while the function graph callback handler is
being called for a longer time, widening the race of this happening. This
bug was actually there for a longer time, but because the race window was so
small it seldom happened. The Fixes tag below is for the commit that widen
the race window, because that commit belongs to a series that will also help
fix the original bug.
Cc: stable@kernel.org
Fixes: 39eb456dac ("function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack")
Reported-by: Joe Lawrence <joe.lawrence@redhat.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The profiler uses trace->depth to find its entry on the ret_stack, but the
depth may not match the actual location of where its entry is (if an
interrupt were to preempt the processing of the profiler for another
function, the depth and the curr_ret_stack will be different).
Have it use the curr_ret_stack as the index to find its ret_stack entry
instead of using the depth variable, as that is no longer guaranteed to be
the same.
Cc: stable@kernel.org
Fixes: 03274a3ffb ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The function graph profiler uses the ret_stack to store the "subtime" and
reuse it by nested functions and also on the return. But the current logic
has the profiler callback called before the ret_stack is updated, and it is
just modifying the ret_stack that will later be allocated (it's just lucky
that the "subtime" is not touched when it is allocated).
This could also cause a crash if we are at the end of the ret_stack when
this happens.
By reversing the order of the allocating the ret_stack and then calling the
callbacks attached to a function being traced, the ret_stack entry is no
longer used before it is allocated.
Cc: stable@kernel.org
Fixes: 03274a3ffb ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In the past, curr_ret_stack had two functions. One was to denote the depth
of the call graph, the other is to keep track of where on the ret_stack the
data is used. Although they may be slightly related, there are two cases
where they need to be used differently.
The one case is that it keeps the ret_stack data from being corrupted by an
interrupt coming in and overwriting the data still in use. The other is just
to know where the depth of the stack currently is.
The function profiler uses the ret_stack to save a "subtime" variable that
is part of the data on the ret_stack. If curr_ret_stack is modified too
early, then this variable can be corrupted.
The "max_depth" option, when set to 1, will record the first functions going
into the kernel. To see all top functions (when dealing with timings), the
depth variable needs to be lowered before calling the return hook. But by
lowering the curr_ret_stack, it makes the data on the ret_stack still being
used by the return hook susceptible to being overwritten.
Now that there's two variables to handle both cases (curr_ret_depth), we can
move them to the locations where they can handle both cases.
Cc: stable@kernel.org
Fixes: 03274a3ffb ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, the depth of the ret_stack is determined by curr_ret_stack index.
The issue is that there's a race between setting of the curr_ret_stack and
calling of the callback attached to the return of the function.
Commit 03274a3ffb ("tracing/fgraph: Adjust fgraph depth before calling
trace return callback") moved the calling of the callback to after the
setting of the curr_ret_stack, even stating that it was safe to do so, when
in fact, it was the reason there was a barrier() there (yes, I should have
commented that barrier()).
Not only does the curr_ret_stack keep track of the current call graph depth,
it also keeps the ret_stack content from being overwritten by new data.
The function profiler, uses the "subtime" variable of ret_stack structure
and by moving the curr_ret_stack, it allows for interrupts to use the same
structure it was using, corrupting the data, and breaking the profiler.
To fix this, there needs to be two variables to handle the call stack depth
and the pointer to where the ret_stack is being used, as they need to change
at two different locations.
Cc: stable@kernel.org
Fixes: 03274a3ffb ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As all architectures now call function_graph_enter() to do the entry work,
no architecture should ever call ftrace_push_return_trace(). Make it static.
This is needed to prepare for a fix of a design bug on how the curr_ret_stack
is used.
Cc: stable@kernel.org
Fixes: 03274a3ffb ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now that synchronize_rcu() waits for preempt-disable regions of code
as well as RCU read-side critical sections, synchronize_sched() can
be replaced by synchronize_rcu(). Similarly, call_rcu_sched() can be
replaced by call_rcu(). This commit therefore makes these changes.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: <linux-kernel@vger.kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently all the architectures do basically the same thing in preparing the
function graph tracer on entry to a function. This code can be pulled into a
generic location and then this will allow the function graph tracer to be
fixed, as well as extended.
Create a new function graph helper function_graph_enter() that will call the
hook function (ftrace_graph_entry) and the shadow stack operation
(ftrace_push_return_trace), and remove the need of the architecture code to
manage the shadow stack.
This is needed to prepare for a fix of a design bug on how the curr_ret_stack
is used.
Cc: stable@kernel.org
Fixes: 03274a3ffb ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
A format string consisting of "%p" or "%s" followed by an invalid
specifier (e.g. "%p%\n" or "%s%") could pass the check which
would make format_decode (lib/vsprintf.c) to warn.
Fixes: 9c959c863f ("tracing: Allow BPF programs to call bpf_trace_printk()")
Reported-by: syzbot+1ec5c5ec949c4adaa0c4@syzkaller.appspotmail.com
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Fix strpbrk()'s argument order, it must pass acceptable string
in 2nd argument. Note that this can cause a kernel panic where
it recovers backup character to code->data.
Link: http://lkml.kernel.org/r/154108256792.2604.1816052586385217811.stgit@devbox
Fixes: a6682814f3 ("tracing/kprobes: Allow kprobe-events to record module symbol")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvchGgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpj/1D/4kEQx4ncnFoZk8QshHV1L++rH3BbcLjQDd
Wbh9ZSIQdI/gHTzS6bE7x3YfcbpMWPMO3+jFawdfRiFTEjlF8vQ+mnJ+Btb3z4D6
mGEeFGVhHExlp2a0x/Ma8YWVNlMB7BE8Tq73bZEVMY+9lbpmDW/vp7Sfa87LBDKQ
ZmY+My+VdHN7qLtQ7t3W/HtpbU+kcXMMd3ICjK4i+ofXy6mynk4+oQ2jwyXc5L86
UCJCsTsSRr3CgbnkW/uprHo0XHk8i7O/4C3oR+x4pAIxCCa9g+vmw0EO9fvi/2iQ
qe8jKdm7Y09xu/TiPBa7iz45tdh0cNMJKo3OezmSF9Np+r69KL5C/U4GRPKN3Iwm
keoqn14ScABkYMSe4ys1AdEgKD6bNUaW3r/lJxTH2oUR23mjnCLp7c4WD/G+MlbB
CzoakQyCHTZmDFLr2Kc8bkjmpil2T2UFfmLIDAu30LWIYeSGpiIO/V+g1foJMF2f
06ERltNvgX1BJjoh4NSWySLEf1ZtkUU60NeATRol6gwhnIyLrHsgfm6OEhqlW/7x
Xc1BWyzX7K6c3Dskk/u5aSRyXOyRC9KkMt3/2XexeDNHkte9yMH0IgSvopPBuER8
+iPvPjNp7ychTKZB3zpSnlqGgePTjbufIEBtO3OyUmDZKjUqxahtxkQfmPhoclu+
XdR4ArcqNg==
=0zM4
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20181102' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
"The biggest part of this pull request is the revert of the blkcg
cleanup series. It had one fix earlier for a stacked device issue, but
another one was reported. Rather than play whack-a-mole with this,
revert the entire series and try again for the next kernel release.
Apart from that, only small fixes/changes.
Summary:
- Indentation fixup for mtip32xx (Colin Ian King)
- The blkcg cleanup series revert (Dennis Zhou)
- Two NVMe fixes. One fixing a regression in the nvme request
initialization in this merge window, causing nvme-fc to not work.
The other is a suspend/resume p2p resource issue (James, Keith)
- Fix sg discard merge, allowing us to merge in cases where we didn't
before (Jianchao Wang)
- Call rq_qos_exit() after the queue is frozen, preventing a hang
(Ming)
- Fix brd queue setup, fixing an oops if we fail setting up all
devices (Ming)"
* tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
nvme-pci: fix conflicting p2p resource adds
nvme-fc: fix request private initialization
blkcg: revert blkcg cleanups series
block: brd: associate with queue until adding disk
block: call rq_qos_exit() after queue is frozen
mtip32xx: clean an indentation issue, remove extraneous tabs
block: fix the DISCARD request merge
Move the Dell dcdbas and dell_rbu drivers into platform/drivers/x86 as
they are closely coupled with other drivers in this location.
Improve _init* usage for acerhdf and fix some usage issues with messages
and module parameters.
Simplify asus-wmi by calling ACPI/WMI methods directly, eliminating
workqueue overhead, eliminate double reporting of keyboard backlight.
Fix wake from USB failure on Bay Trail devices (intel_int0002_vgpio).
Notify intel_telemetry users when IPC1 device is not enabled.
Update various drivers with new laptop model IDs.
Update several intel drivers to use SPDX identifers and order headers
alphabetically.
The following is an automated git shortlog grouped by driver:
Add Intel AtomISP2 dummy / power-management driver:
- Add Intel AtomISP2 dummy / power-management driver
lg-laptop:
- Add LG Gram laptop special features driver
HID:
- asus: only support backlight when it's not driven by WMI
MAINTAINERS:
- intel_telemetry: Update maintainers info
- intel_pmc_core: Update MAINTAINERS
- Update maintainer for dcdbas and dell_rbu
- Use my infradead account exclusively for PDx86 work
acerhdf:
- restructure to allow large BIOS table be __initconst
- mark appropriate content with __init prefix
- Add BIOS entry for Gateway LT31 v1.3307
- Remove cut-and-paste trap from instructions
- Enable ability to list supported systems
- clarify modinfo messages for BIOS override
asus-wmi:
- export function for evaluating WMI methods
- Only notify kbd LED hw_change by fn-key pressed
- Simplify the keyboard brightness updating process
firmware:
- dcdbas: include linux/io.h
- dcdbas: Move dcdbas to drivers/platform/x86
- dell_rbu: Move dell_rbu to drivers/platform/x86
- dcdbas: Add support for WSMT ACPI table
- dell_rbu: Make payload memory uncachable
ideapad-laptop:
- Add Y530-15ICH to no_hw_rfkill
- Use __func__ instead of read_ec_cmd in pr_err
intel-hid:
- Convert to use SPDX identifier
intel-ips:
- Convert to use SPDX identifier
intel-rst:
- Convert to use SPDX identifier
- Sort headers alphabetically
intel-smartconnect:
- Convert to use SPDX identifier
- Sort headers alphabetically
intel-wmi-thunderbolt:
- Add dynamic debugging
- Convert to use SPDX identifier
intel_bxtwc_tmu:
- Convert to use SPDX identifier
intel_cht_int33fe:
- Convert to use SPDX identifier
intel_chtdc_ti_pwrbtn:
- Add SPDX identifier
intel_int0002_vgpio:
- Convert to use SPDX identifier
- Implement irq_set_wake
- Enable the driver on Bay Trail platforms
intel_menlow:
- Convert to use SPDX identifier
- Sort headers alphabetically
intel_mid_powerbtn:
- Convert to use SPDX identifier
- Remove unnecessary init.h inclusion
- Get rid of custom ICPU() macro
intel_mid_thermal:
- Convert to use SPDX identifier
- Sort headers alphabetically
intel_oaktrail:
- Convert to use SPDX identifier
- Sort headers alphabetically
intel_pmc:
- Convert to use SPDX identifier
- Sort headers alphabetically
intel_punit_ipc:
- Convert to use SPDX identifier
- Sort headers alphabetically
intel_scu_ipc:
- Convert to use SPDX identifier
- Sort headers alphabetically
intel_telemetry:
- Get rid of custom macro
- report debugfs failure
- Convert to use SPDX identifier
intel_turbo_max_3:
- Convert to use SPDX identifier
- Sort headers alphabetically
mlx-platform:
- Properly use mlxplat_mlxcpld_msn201x_items
touchscreen_dmi:
- Add min-x and min-y settings for various models
- Add info for the Onda V80 Plus v3 tablet
- Add info for the Trekstor Primetab T13B tablet
- Add info for the Trekstor Primebook C11 convertible
tracing:
- Trivia spelling fix containerof() -> container_of()
wmi:
- declare device_type structure as constant
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJb2mDCAAoJEKbMaAwKp364KdkH/Rar8cenw7uFeVzT5o4fGzTn
0Oo1Pnwh+hPgmQwmf8O/mHsa/1Zb9nX4nNLU/ysDUjAAZ9wZLoB57HlImQquPA74
c8lXTz5MVF140WB1f1Ck5PmeonJR2Xve7HTYCoBmGo71jOxHxCgJbp1Giiho0Imd
3NXzzahEnAiy/L5PhqqYTmIfTeAHQ2Fo7GkHCjMnS20FP2fBtvPrc7db03F+FZ4y
QTfbT0oR42NdeW21AbCRGY3P47axB3xDJX4LOYrb4uo4CIzd1yatlHk407NRRyv8
pab6aeJR23FouA/tcvv+T36D3vqRw78e6BWdoyowWOd7CUOgjtSC3lKGzfT6hSY=
=0BDZ
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v4.20-1' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver updates from Darren Hart:
- Move the Dell dcdbas and dell_rbu drivers into platform/drivers/x86
as they are closely coupled with other drivers in this location.
- Improve _init* usage for acerhdf and fix some usage issues with
messages and module parameters.
- Simplify asus-wmi by calling ACPI/WMI methods directly, eliminating
workqueue overhead, eliminate double reporting of keyboard backlight.
- Fix wake from USB failure on Bay Trail devices (intel_int0002_vgpio).
- Notify intel_telemetry users when IPC1 device is not enabled.
- Update various drivers with new laptop model IDs.
- Update several intel drivers to use SPDX identifers and order headers
alphabetically.
* tag 'platform-drivers-x86-v4.20-1' of git://git.infradead.org/linux-platform-drivers-x86: (64 commits)
HID: asus: only support backlight when it's not driven by WMI
platform/x86: asus-wmi: export function for evaluating WMI methods
platform/x86: asus-wmi: Only notify kbd LED hw_change by fn-key pressed
platform/x86: wmi: declare device_type structure as constant
platform/x86: ideapad: Add Y530-15ICH to no_hw_rfkill
platform/x86: Add Intel AtomISP2 dummy / power-management driver
platform/x86: touchscreen_dmi: Add min-x and min-y settings for various models
platform/x86: touchscreen_dmi: Add info for the Onda V80 Plus v3 tablet
platform/x86: touchscreen_dmi: Add info for the Trekstor Primetab T13B tablet
platform/x86: intel_telemetry: Get rid of custom macro
platform/x86: intel_telemetry: report debugfs failure
MAINTAINERS: intel_telemetry: Update maintainers info
platform/x86: Add LG Gram laptop special features driver
platform/x86: asus-wmi: Simplify the keyboard brightness updating process
platform/x86: touchscreen_dmi: Add info for the Trekstor Primebook C11 convertible
platform/x86: mlx-platform: Properly use mlxplat_mlxcpld_msn201x_items
MAINTAINERS: intel_pmc_core: Update MAINTAINERS
firmware: dcdbas: include linux/io.h
platform/x86: intel-wmi-thunderbolt: Add dynamic debugging
platform/x86: intel-wmi-thunderbolt: Convert to use SPDX identifier
...
Back in January I posted patches to create function based events. These were
the events that you suggested I make to allow developers to easily create
events in code where no trace event exists. After posting those changes for
review, it was suggested that we implement this instead with kprobes.
The problem with kprobes is that the interface is too complex and needs to
be simplified. Masami Hiramatsu posted patches in March and I've been
playing with them a bit. There's been a bit of clean up in the kprobe code
that was inspired by the function based event patches, and a couple of
enhancements to the kprobe event interface.
- If the arch supports it (we added support for x86), you can place a
kprobe event at the start of a function and use $arg1, $arg2, etc
to reference the arguments of a function. (Before you needed to know
what register or where on the stack the argument was).
- The second is a way to see array of events. For example, if you reference
a mac address, you can add:
echo 'p:mac ip_rcv perm_addr=+574($arg2):x8[6]' > kprobe_events
And this will produce:
mac: (ip_rcv+0x0/0x140) perm_addr={0x52,0x54,0x0,0xc0,0x76,0xec}
Other changes include
- Exporting trace_dump_stack to modules
- Have the stack tracer trace the entire stack (stop trying to remove
tracing itself, as we keep removing too much).
- Added support for SDT in uprobes
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW9hdjxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qmtbAP9GS/o2WSvsYLSIw4+mF94eCL06lUxp
rRrktkEofm/PagEAl2JNmvHrAJN+LIrajqXTbwlZ7Ckk1rZhCW41Am7qnQs=
=sTUM
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The biggest change here is the updates to kprobes
Back in January I posted patches to create function based events.
These were the events that you suggested I make to allow developers to
easily create events in code where no trace event exists. After
posting those changes for review, it was suggested that we implement
this instead with kprobes.
The problem with kprobes is that the interface is too complex and
needs to be simplified. Masami Hiramatsu posted patches in March and
I've been playing with them a bit. There's been a bit of clean up in
the kprobe code that was inspired by the function based event patches,
and a couple of enhancements to the kprobe event interface.
- If the arch supports it (we added support for x86), you can place a
kprobe event at the start of a function and use $arg1, $arg2, etc
to reference the arguments of a function. (Before you needed to
know what register or where on the stack the argument was).
- The second is a way to see array of events. For example, if you
reference a mac address, you can add:
echo 'p:mac ip_rcv perm_addr=+574($arg2):x8[6]' > kprobe_events
And this will produce:
mac: (ip_rcv+0x0/0x140) perm_addr={0x52,0x54,0x0,0xc0,0x76,0xec}
Other changes include
- Exporting trace_dump_stack to modules
- Have the stack tracer trace the entire stack (stop trying to remove
tracing itself, as we keep removing too much).
- Added support for SDT in uprobes"
[ SDT - "Statically Defined Tracing" are userspace markers for tracing.
Let's not use random TLA's in explanations unless they are fairly
well-established as generic (at least for kernel people) - Linus ]
* tag 'trace-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (24 commits)
tracing: Have stack tracer trace full stack
tracing: Export trace_dump_stack to modules
tracing: probeevent: Fix uninitialized used of offset in parse args
tracing/kprobes: Allow kprobe-events to record module symbol
tracing/kprobes: Check the probe on unloaded module correctly
tracing/uprobes: Fix to return -EFAULT if copy_from_user failed
tracing: probeevent: Add $argN for accessing function args
x86: ptrace: Add function argument access API
tracing: probeevent: Add array type support
tracing: probeevent: Add symbol type
tracing: probeevent: Unify fetch_insn processing common part
tracing: probeevent: Append traceprobe_ for exported function
tracing: probeevent: Return consumed bytes of dynamic area
tracing: probeevent: Unify fetch type tables
tracing: probeevent: Introduce new argument fetching code
tracing: probeevent: Remove NOKPROBE_SYMBOL from print functions
tracing: probeevent: Cleanup argument field definition
tracing: probeevent: Cleanup print argument functions
trace_uprobe: support reference counter in fd-based uprobe
perf probe: Support SDT markers having reference counter (semaphore)
...
error return value, and the other is for the self tests.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW9hYsBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiWPAQCARhYpeiWNHYGAirpI1vxaNHzutZvP
j6eaqRXu0JDfsQD7Bb19KxSHUEaPSpZIo68N5OJkoTv9UxD2KYuRysu4VQ8=
=y1Gc
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19-rc8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Masami had a couple more fixes to the synthetic events. One was a
proper error return value, and the other is for the self tests"
* tag 'trace-v4.19-rc8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
selftests/ftrace: Fix synthetic event test to delete event correctly
tracing: Return -ENOENT if there is no target synthetic event
Return -ENOENT error if there is no target synthetic event.
This notices an operation failure to user as below;
# echo 'wakeup_latency u64 lat; pid_t pid;' > synthetic_events
# echo '!wakeup' >> synthetic_events
sh: write error: No such file or directory
Link: http://lkml.kernel.org/r/154013449986.25576.9487131386597290172.stgit@devbox
Acked-by: Tom Zanussi <zanussi@linux.intel.com>
Tested-by: Tom Zanussi <zanussi@linux.intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Rajvi Jingar <rajvi.jingar@intel.com>
Cc: stable@vger.kernel.org
Fixes: 4b147936fa ('tracing: Add support for 'synthetic' events')
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The stack tracer traces every function call checking the current stack (in
non interrupt context), looking for the deepest stack, and saving it when it
finds a new max depth. The problem is that it calls save_stack_trace(), and
with the new ORC unwinder, it can skip too much. As it looks at the ip of
the function call in the backtrace to find where it should start, it doesn't
need to skip anything.
The stack trace selftest would fail when the kernel was complied with the
ORC UNDWINDER enabled. Without skipping functions when doing the stack
trace, it now passes again.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww
3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00
VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5
WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY
lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d
5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH
TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm
C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie
WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5
wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m
jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+
J0oh0FHBDg==
=LRTT
-----END PGP SIGNATURE-----
Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"This is the main pull request for block changes for 4.20. This
contains:
- Series enabling runtime PM for blk-mq (Bart).
- Two pull requests from Christoph for NVMe, with items such as;
- Better AEN tracking
- Multipath improvements
- RDMA fixes
- Rework of FC for target removal
- Fixes for issues identified by static checkers
- Fabric cleanups, as prep for TCP transport
- Various cleanups and bug fixes
- Block merging cleanups (Christoph)
- Conversion of drivers to generic DMA mapping API (Christoph)
- Series fixing ref count issues with blkcg (Dennis)
- Series improving BFQ heuristics (Paolo, et al)
- Series improving heuristics for the Kyber IO scheduler (Omar)
- Removal of dangerous bio_rewind_iter() API (Ming)
- Apply single queue IPI redirection logic to blk-mq (Ming)
- Set of fixes and improvements for bcache (Coly et al)
- Series closing a hotplug race with sysfs group attributes (Hannes)
- Set of patches for lightnvm:
- pblk trace support (Hans)
- SPDX license header update (Javier)
- Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
specs behind a common core interface. (Javier, Matias)
- Enable pblk to use a common interface to retrieve chunk metadata
(Matias)
- Bug fixes (Various)
- Set of fixes and updates to the blk IO latency target (Josef)
- blk-mq queue number updates fixes (Jianchao)
- Convert a bunch of drivers from the old legacy IO interface to
blk-mq. This will conclude with the removal of the legacy IO
interface itself in 4.21, with the rest of the drivers (me, Omar)
- Removal of the DAC960 driver. The SCSI tree will introduce two
replacement drivers for this (Hannes)"
* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
block: setup bounce bio_sets properly
blkcg: reassociate bios when make_request() is called recursively
blkcg: fix edge case for blk_get_rl() under memory pressure
nvme-fabrics: move controller options matching to fabrics
nvme-rdma: always have a valid trsvcid
mtip32xx: fully switch to the generic DMA API
rsxx: switch to the generic DMA API
umem: switch to the generic DMA API
sx8: switch to the generic DMA API
sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
skd: switch to the generic DMA API
ubd: remove use of blk_rq_map_sg
nvme-pci: remove duplicate check
drivers/block: Remove DAC960 driver
nvme-pci: fix hot removal during error handling
nvmet-fcloop: suppress a compiler warning
nvme-core: make implicit seed truncation explicit
nvmet-fc: fix kernel-doc headers
nvme-fc: rework the request initialization code
nvme-fc: introduce struct nvme_fcp_op_w_sgl
...
Fix synthetic event to allow independent semicolon at end.
The synthetic_events interface accepts a semicolon after the
last word if there is no space.
# echo "myevent u64 var;" >> synthetic_events
But if there is a space, it returns an error.
# echo "myevent u64 var ;" > synthetic_events
sh: write error: Invalid argument
This behavior is difficult for users to understand. Let's
allow the last independent semicolon too.
Link: http://lkml.kernel.org/r/153986835420.18251.2191216690677025744.stgit@devbox
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: stable@vger.kernel.org
Fixes: commit 4b147936fa ("tracing: Add support for 'synthetic' events")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix synthetic event to accept unsigned modifier for its field type
correctly.
Currently, synthetic_events interface returns error for "unsigned"
modifiers as below;
# echo "myevent unsigned long var" >> synthetic_events
sh: write error: Invalid argument
This is because argv_split() breaks "unsigned long" into "unsigned"
and "long", but parse_synth_field() doesn't expected it.
With this fix, synthetic_events can handle the "unsigned long"
correctly like as below;
# echo "myevent unsigned long var" >> synthetic_events
# cat synthetic_events
myevent unsigned long var
Link: http://lkml.kernel.org/r/153986832571.18251.8448135724590496531.stgit@devbox
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: stable@vger.kernel.org
Fixes: commit 4b147936fa ("tracing: Add support for 'synthetic' events")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The preemptirq_delay_test module is used for the ftrace selftest code that
tests the latency tracers. The problem is that it uses ktime for the delay
loop, and then checks the tracer to see if the delay loop is caught, but the
tracer uses trace_clock_local() which uses various different other clocks to
measure the latency. As ktime uses the clock cycles, and the code then
converts that to nanoseconds, it causes rounding errors, and the preemptirq
latency tests are failing due to being off by 1 (it expects to see a delay
of 500000 us, but the delay is only 499999 us). This is happening due to a
rounding error in the ktime (which is totally legit). The purpose of the
test is to see if it can catch the delay, not to test the accuracy between
trace_clock_local() and ktime_get(). Best to use apples to apples, and have
the delay loop use the same clock as the latency tracer does.
Cc: stable@vger.kernel.org
Fixes: f96e8577da ("lib: Add module for testing preemptoff/irqsoff latency tracers")
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Dan's smatch utility found an uninitialized use of offset in a path in
parse_probe_args(). Unless an offset is specifically specified for commands
that allow them, it should default to zero.
Link: http://lkml.kernel.org/r/20181012134246.5doqaobxunlqqs53@mwanda
Fixes: 533059281e ("tracing: probeevent: Introduce new argument fetching code")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> (smatch)
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Allow kprobe-events to record module symbols.
Since data symbols in a non-loaded module doesn't exist, it fails to
define such symbol as an argument of kprobe-event. But if the kprobe
event is defined on that module, we can defer to resolve the symbol
address.
Note that if given symbol is not found, the event is kept unavailable.
User can enable it but the event is not recorded.
Link: http://lkml.kernel.org/r/153547312336.26502.11432902826345374463.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Current kprobe event doesn't checks correctly whether the
given event is on unloaded module or not. It just checks
the event has ":" in the name.
That is not enough because if we define a probe on non-exist
symbol on loaded module, it allows to define that (with
warning message)
To ensure it correctly, this searches the module name on
loaded module list and only if there is not, it allows to
define it. (this event will be available when the target
module is loaded)
Link: http://lkml.kernel.org/r/153547309528.26502.8300278470528281328.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fix probe_mem_read() to return -EFAULT if copy_from_user()
failed. The copy_from_user() returns remaining bytes
when it failed, but probe_mem_read() caller expects it
returns error code like as probe_kernel_read().
Link: http://lkml.kernel.org/r/153547306719.26502.8353484532699160223.stgit@devbox
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add $argN special fetch variable for accessing function
arguments. This allows user to trace the Nth argument easily
at the function entry.
Note that this returns most probably assignment of registers
and stacks. In some case, it may not work well. If you need
to access correct registers or stacks you should use perf-probe.
Link: http://lkml.kernel.org/r/152465888632.26224.3412465701570253696.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add array type support for probe events.
This allows user to get arraied types from memory address.
The array type syntax is
TYPE[N]
Where TYPE is one of types (u8/16/32/64,s8/16/32/64,
x8/16/32/64, symbol, string) and N is a fixed value less
than 64.
The string array type is a bit different from other types. For
other base types, <base-type>[1] is equal to <base-type>
(e.g. +0(%di):x32[1] is same as +0(%di):x32.) But string[1] is not
equal to string. The string type itself represents "char array",
but string array type represents "char * array". So, for example,
+0(%di):string[1] is equal to +0(+0(%di)):string.
Link: http://lkml.kernel.org/r/152465891533.26224.6150658225601339931.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add "symbol" type to probeevent, which is an alias of u32 or u64
(depends on BITS_PER_LONG). This shows the result value in
symbol+offset style. This type is only available with kprobe
events.
Link: http://lkml.kernel.org/r/152465882860.26224.14779072294412467338.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Unify the fetch_insn bottom process (from stage 2: dereference
indirect data) from kprobe and uprobe events, since those are
mostly same.
Link: http://lkml.kernel.org/r/152465879965.26224.8547240824606804815.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cleanup string fetching routine so that returns the consumed
bytes of dynamic area and store the string information as
data_loc format instead of data_rloc.
This simplifies the fetcharg loop.
Link: http://lkml.kernel.org/r/152465874163.26224.12125143907501289031.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Unify {k,u}probe_fetch_type_table to probe_fetch_type_table
because the main difference of those type tables (fetcharg
methods) are gone. Now we can consolidate it.
Link: http://lkml.kernel.org/r/152465871274.26224.13999436317830479698.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Replace {k,u}probe event argument fetching framework with switch-case based.
Currently that is implemented with structures, macros and chain of
function-pointers, which is more complicated than necessary and may get a
performance penalty by retpoline.
This simplify that with an array of "fetch_insn" (opcode and oprands), and
make process_fetch_insn() just interprets it. No function pointers are used.
Link: http://lkml.kernel.org/r/152465868340.26224.2551120475197839464.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Remove unneeded NOKPROBE_SYMBOL from print functions since
the print functions are only used when printing out the
trace data, and not from kprobe handler.
Link: http://lkml.kernel.org/r/152465865422.26224.10111548170594014954.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cleanup the print-argument function to decouple it into
print-name and print-value, so that it can support more
flexible expression, like array type.
Link: http://lkml.kernel.org/r/152465859635.26224.13452846788717102315.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This patch enables uprobes with reference counter in fd-based uprobe.
Highest 32 bits of perf_event_attr.config is used to stored offset
of the reference count (semaphore).
Format information in /sys/bus/event_source/devices/uprobe/format/ is
updated to reflect this new feature.
Link: http://lkml.kernel.org/r/20181002053636.1896903-1-songliubraving@fb.com
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
=dMl4
-----END PGP SIGNATURE-----
Merge tag 'v4.19-rc6' into for-4.20/block
Merge -rc6 in, for two reasons:
1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is the only location on kernel that has wrong spelling
of the container_of() helper. Fix it.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
We assume to have only one reference counter for one uprobe.
Don't allow user to add multiple trace_uprobe entries having
same inode+offset but different reference counter.
Ex,
# echo "p:sdt_tick/loop2 /home/ravi/tick:0x6e4(0x10036)" > uprobe_events
# echo "p:sdt_tick/loop2_1 /home/ravi/tick:0x6e4(0xfffff)" >> uprobe_events
bash: echo: write error: Invalid argument
# dmesg
trace_kprobe: Reference counter offset mismatch.
There is one exception though:
When user is trying to replace the old entry with the new
one, we allow this if the new entry does not conflict with
any other existing entries.
Link: http://lkml.kernel.org/r/20180820044250.11659-4-ravi.bangoria@linux.ibm.com
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. Applications like PostgreSQL, MySQL,
Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc
have these markers embedded in them. These markers are added by developer
at important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. In case the overhead is more, execution of it can be
omitted by runtime if() condition when no one is tracing on the marker:
if (reference_counter > 0) {
Execute marker instructions;
}
Default value of reference counter is 0. Tracer has to increment the
reference counter before tracing on a marker and decrement it when
done with the tracing.
Implement the reference counter logic in core uprobe. User will be
able to use it from trace_uprobe as well as from kernel module. New
trace_uprobe definition with reference counter will now be:
<path>:<offset>[(ref_ctr_offset)]
where ref_ctr_offset is an optional field. For kernel module, new
variant of uprobe_register() has been introduced:
uprobe_register_refctr(inode, offset, ref_ctr_offset, consumer)
No new variant for uprobe_unregister() because it's assumed to have
only one reference counter for one uprobe.
[1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
Note: 'reference counter' is called as 'semaphore' in original Dtrace
(or Systemtap, bcc and even in ELF) documentation and code. But the
term 'semaphore' is misleading in this context. This is just a counter
used to hold number of tracers tracing on a marker. This is not really
used for any synchronization. So we are calling it a 'reference counter'
in kernel / perf code.
Link: http://lkml.kernel.org/r/20180820044250.11659-2-ravi.bangoria@linux.ibm.com
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
[Only trace_uprobe.c]
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
By utilizing a temporary variable, we can avoid adding another call to
strchr(). Instead, save the first call to a temp variable, and then use that
variable as the reference to set the event variable.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Prior patches ensured that all bios are now associated with some blkg.
This now makes bio->bi_css unnecessary as blkg maintains a reference to
the blkcg already.
This patch removes the field bi_css and transfers corresponding uses to
access via bi_blkg.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When reducing ring buffer size, pages are removed by scheduling a work
item on each CPU for the corresponding CPU ring buffer. After the pages
are removed from ring buffer linked list, the pages are free()d in a
tight loop. The loop does not give up CPU until all pages are removed.
In a worst case behavior, when lot of pages are to be freed, it can
cause system stall.
After the pages are removed from the list, the free() can happen while
the work is rescheduled. Call cond_resched() in the loop to prevent the
system hangup.
Link: http://lkml.kernel.org/r/20180907223129.71994-1-vnagarnaik@google.com
Cc: stable@vger.kernel.org
Fixes: 83f40318da ("ring-buffer: Make removal of ring buffer pages atomic")
Reported-by: Jason Behmer <jbehmer@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
from being traced by kprobes. During my testing, I found that there's places
that we may want to add kprobes to notrace, thus we may end up changing this
code before 4.19 is released. The history behind this change is that we
found that adding kprobes to various notrace functions caused the kernel to
crashed. We took the safe route and decided not to allow kprobes to trace
any notrace function. But because notrace is added to functions that just
cause weird side effects to the function tracer, but are still safe,
preventing kprobes for all notrace functios may be too much of a big hammer.
One such place is __schedule() is marked notrace, to keep function tracer
from doing strange recursive loops when it gets traced with NEED_RESCHED
set. With this change, one can not add kprobes to the scheduler.
Masami also added code to use gcov on ftrace.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW34RpBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhwzAQDITRob1VjOF8RavMbfBZZuhwHPQROF
XsIFcD+jghFOUgEA1zXX4w+SwOV7S+chpGKMh3DlNKIbYPA+8eiMNFfDsA4=
=u+oq
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Masami found an off by one bug in the code that keeps "notrace"
functions from being traced by kprobes. During my testing, I found
that there's places that we may want to add kprobes to notrace, thus
we may end up changing this code before 4.19 is released.
The history behind this change is that we found that adding kprobes to
various notrace functions caused the kernel to crashed. We took the
safe route and decided not to allow kprobes to trace any notrace
function.
But because notrace is added to functions that just cause weird side
effects to the function tracer, but are still safe, preventing kprobes
for all notrace functios may be too much of a big hammer.
One such place is __schedule() is marked notrace, to keep function
tracer from doing strange recursive loops when it gets traced with
NEED_RESCHED set. With this change, one can not add kprobes to the
scheduler.
Masami also added code to use gcov on ftrace"
* tag 'trace-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Fix to check notrace function with correct range
tracing: Allow gcov profiling on only ftrace subsystem
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlt9on8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpj1xEADKBmJlV9aVyxc5w6XggqAGeHqI4afFrl+v
9fW6WUQMAaBUrr7PMIEJQ0Zm4B7KxgBaEWNtuuj4ULkjpgYm2AuGUuTJSyKz41rS
Ma+KNyCA2Zmq4SvwGFbcdCuCbUqnoxTycscAgCjuDvIYLW0+nFGNc47ibmu9lZIV
33Ef5LrxuCjhC2zyNxEdWpUxDCjoYzock85LW+wYyIYLU9uKdoExS+YmT8U+ebA/
AkXBcxPztNDxwkcsIwgGVoTjwxiowqGz3uueWfyEmYgaCPiNOsxkoNQAtjX4ykQE
hnqnHWyzJkRwbYo7Vd/bRAZXvszKGYE1YcJmu5QrNf0dK5MSq2o5OYJAEJWbucPj
m0R2u7O9qbS2JEnxGrm5+oYJwBzwNY5/Lajr15WkljTqobKnqcvn/Hdgz/XdGtek
0S1QHkkBsF7e+cax8sePWK+O3ilY7pl9CzyZKB/tJngl8A45Jv8xVojg0v3O7oS+
zZib0rwWg/bwR/uN6uPCDcEsQusqL5YovB7m6NRVshwz6cV1zVNp2q+iOulk7KuC
MprW4Du9CJf8HA19XtyJIG1XLstnuz+Exy+i5BiimUJ5InoEFDuj/6OZa6Qaczbo
SrDDvpGtSf4h7czKpE5kV4uZiTOrjuI30TrI+4csdZ7HQIlboxNL72seNTLJs55F
nbLjRM8L6g==
=FS7e
-----END PGP SIGNATURE-----
Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block
Pull more block updates from Jens Axboe:
- Set of bcache fixes and changes (Coly)
- The flush warn fix (me)
- Small series of BFQ fixes (Paolo)
- wbt hang fix (Ming)
- blktrace fix (Steven)
- blk-mq hardware queue count update fix (Jianchao)
- Various little fixes
* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
block/DAC960.c: make some arrays static const, shrinks object size
blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
blk-mq: init hctx sched after update ctx and hctx mapping
block: remove duplicate initialization
tracing/blktrace: Fix to allow setting same value
pktcdvd: fix setting of 'ret' error return for a few cases
block: change return type to bool
block, bfq: return nbytes and not zero from struct cftype .write() method
block, bfq: improve code of bfq_bfqq_charge_time
block, bfq: reduce write overcharge
block, bfq: always update the budget of an entity when needed
block, bfq: readd missing reset of parent-entity service
blk-wbt: fix IO hang in wbt_wait()
block: don't warn for flush on read-only device
bcache: add the missing comments for smp_mb()/smp_wmb()
bcache: remove unnecessary space before ioctl function pointer arguments
bcache: add missing SPDX header
bcache: move open brace at end of function definitions to next line
bcache: add static const prefix to char * array declarations
bcache: fix code comments style
...
Fix within_notrace_func() to check notrace function correctly.
Since the ftrace_location_range(start, end) function checks
the range inclusively (start <= ftrace-loc <= end), the end
address must not include the entry address of next function.
However, within_notrace_func() uses kallsyms_lookup_size_offset()
to get the function size and calculate the end address from
adding the size to the entry address. This means the end address
is the entry address of the next function.
In the result, within_notrace_func() fails to find notrace
function if the next function of the target function is
ftraced.
Let's subtract 1 from the end address so that ftrace_location_range()
can check it correctly.
Link: http://lkml.kernel.org/r/153485669706.16611.17726752296213785504.stgit@devbox
Fixes: commit 45408c4f92 ("tracing: kprobes: Prohibit probing on notrace function")
Reported-by: Michael Rodin <michael@rodin.online>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add GCOV_PROFILE_FTRACE to allow gcov profiling on only files in ftrace
subsystem. This config option will be used for checking kselftest/ftrace
coverage.
Link: http://lkml.kernel.org/r/153483647755.32472.4746349899604275441.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of
a lot of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockde and the latency tracers
just get called directly (without using the trace events).
But because the original change cleaned up the code very nicely
we kept that, as well as the trace events for preempt and irqs
disabling, but they are limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not
allow them to be called in NMI context. Waiting till Paul makes
an NMI safe SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested
before the merge window opened.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW3ruhRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qiM7AP47NhYdSnCFCRUJfrt6PovXmQtuCHt3
c3QMoGGdvzh9YAEAqcSXwh7uLhpHUp1LjMAPkXdZVwNddf4zJQ1zyxQ+EAU=
=vgEr
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Restructure of lockdep and latency tracers
This is the biggest change. Joel Fernandes restructured the hooks
from irqs and preemption disabling and enabling. He got rid of a lot
of the preprocessor #ifdef mess that they caused.
He turned both lockdep and the latency tracers to use trace events
inserted in the preempt/irqs disabling paths. But unfortunately,
these started to cause issues in corner cases. Thus, parts of the
code was reverted back to where lockdep and the latency tracers just
get called directly (without using the trace events). But because the
original change cleaned up the code very nicely we kept that, as well
as the trace events for preempt and irqs disabling, but they are
limited to not being called in NMIs.
- Have trace events use SRCU for "rcu idle" calls. This was required
for the preempt/irqs off trace events. But it also had to not allow
them to be called in NMI context. Waiting till Paul makes an NMI safe
SRCU API.
- New notrace SRCU API to allow trace events to use SRCU.
- Addition of mcount-nop option support
- SPDX headers replacing GPL templates.
- Various other fixes and clean ups.
- Some fixes are marked for stable, but were not fully tested before
the merge window opened.
* tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
tracing: Fix SPDX format headers to use C++ style comments
tracing: Add SPDX License format tags to tracing files
tracing: Add SPDX License format to bpf_trace.c
blktrace: Add SPDX License format header
s390/ftrace: Add -mfentry and -mnop-mcount support
tracing: Add -mcount-nop option support
tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
tracing: Handle CC_FLAGS_FTRACE more accurately
Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Uprobes: Simplify uprobe_register() body
tracepoints: Free early tracepoints after RCU is initialized
uprobes: Use synchronize_rcu() not synchronize_sched()
tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
ftrace: Remove unused pointer ftrace_swapper_pid
tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing/irqsoff: Handle preempt_count for different configs
tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
tracing: irqsoff: Account for additional preempt_disable
trace: Use rcu_dereference_raw for hooks from trace-event subsystem
tracing/kprobes: Fix within_notrace_func() to check only notrace functions
...
The Linux kernel adopted the SPDX License format headers to ease license
compliance management, and uses the C++ '//' style comments for the SPDX
header tags. Some files in the tracing directory used the C style /* */
comments for them. To be consistent across all files, replace the /* */
C style SPDX tags with the C++ // SPDX tags.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Masami Hiramatsu reported:
Current trace-enable attribute in sysfs returns an error
if user writes the same setting value as current one,
e.g.
# cat /sys/block/sda/trace/enable
0
# echo 0 > /sys/block/sda/trace/enable
bash: echo: write error: Invalid argument
# echo 1 > /sys/block/sda/trace/enable
# echo 1 > /sys/block/sda/trace/enable
bash: echo: write error: Device or resource busy
But this is not a preferred behavior, it should ignore
if new setting is same as current one. This fixes the
problem as below.
# cat /sys/block/sda/trace/enable
0
# echo 0 > /sys/block/sda/trace/enable
# echo 1 > /sys/block/sda/trace/enable
# echo 1 > /sys/block/sda/trace/enable
Link: http://lkml.kernel.org/r/20180816103802.08678002@gandalf.local.home
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: cd649b8bb8 ("blktrace: remove sysfs_blk_trace_enable_show/store()")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-mcount-nop gcc option generates the calls to the profiling functions
as nops which allows to avoid patching mcount jump with NOP instructions
initially.
-mcount-nop gcc option will be activated if platform selects
HAVE_NOP_MCOUNT and gcc actually supports it.
In addition to that CC_USING_NOP_MCOUNT is defined and could be used by
architectures to adapt ftrace patching behavior.
Link: http://lkml.kernel.org/r/patch-3.thread-aa7b8d.git-e02ed2dc082b.your-ad-here.call-01533557518-ext-9465@work.hours
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJbct5hAAoJEFKgDEdIgJTysFoP/jRoozkNpUr4fqcbN+1GCSjY
/vaWMJ6cQBHm9b5AE+EYwomTqx5gkaaAw8oR3tnHvBa2NGsE4xzO2LUyuH95FgMo
cE74MGS492xeCskEBHk0pbrdD+DOuIIqWlQcDP0mH0OhlBiuW/+j72HhDfOU5yTv
uE/X+SyxnYgWW9rKGrkmHqVwJPiEgjtr1tsfsewWVF4MNndCAFYeOgpFZBfhBdnq
g57IRjTHLnkAClFrOzUpGIZ/tvN6iyENug860EORYFzSO2SVZ3Jm7v8AcXLQ6lj0
LKmBAoFWeJ95rqJS+rBq3s0hWQLtbVJgQyz2ujN5+a+oZ/+brtQmTHjD9Nri0cSf
cKUmSfMX4tqdMcGjc7B2RqAtdthst2kYy6RXlp4BzYcdoxDr4U45VTqS8gZQuPSt
cdLUgoUEEX7nt1zlQER6GDzWc3FVKDXSHbDM+4OykFF3k1JCgoDq0kemAmsZw2A6
2QGs7RqBKYidR7DB+beZj2YV0ihRkykwb5c+NUASJV5mIB8TjjmOEw52Q0Fc9Hl1
WbA0wm/+pNG7ur9hF0VZ7zy9Rfxgx2Ec5ekWGTgt8xog3B1Dsqix5RpGRDX/f1u8
/dJxHD70pzADwdXXF5hrdCRtCXsTd/27SnWREYhMhfqwu/awc3boOKonOITCYmvv
7eiffcIaUbfiKsIhgrf+
=TTFG
-----END PGP SIGNATURE-----
Merge tag 'printk-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk
Pull printk updates from Petr Mladek:
- Different vendors have a different expectation about a console
quietness. Make it configurable to reduce bike-shedding about the
upstream default
- Decide about the message visibility when the message is stored. It
avoids races caused by a delayed console handling
- Always store printk() messages into the per-CPU buffers again in NMI.
The only exception is when flushing trace log in panic(). There the
risk of loosing messages is worth an eventual reordering
- Handle invalid %pO printf modifiers correctly
- Better handle %p printf modifier tests before crng is initialized
- Some clean up
* tag 'printk-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
lib/vsprintf: Do not handle %pO[^F] as %px
printk: Fix warning about unused suppress_message_printing
printk/nmi: Prevent deadlock when accessing the main log buffer in NMI
printk: Create helper function to queue deferred console handling
printk: Split the code for storing a message into the log buffer
printk: Clean up syslog_print_all()
printk: Remove unnecessary kmalloc() from syslog during clear
printk: Make CONSOLE_LOGLEVEL_QUIET configurable
printk: make sure to print log on console.
lib/test_printf.c: accept "ptrval" as valid result for plain 'p' tests
small fixes and updates. We also have new ktime_get_*() docs from Arnd,
some kernel-doc fixes, a new set of Italian translations (non so se vale la
pena, ma non fa male - speriamo bene), and some extensive early
memory-management documentation improvements from Mike Rapoport.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbcZVtAAoJEI3ONVYwIuV64ekP/jgTlMi/fErRu6zlsrsWgiIK
ir8ueCQ1OSiwOA+N2fUb+2+zlLlfTLgQ+o5IwmZk6rizG87fQ3Rp6i+bvYAZWITh
YUuls3VhtRlJZqG1EW7gww1Q2IhRO6GhcpsIamAvhrSLFPaCKiN3JomJi/X47Pfj
Ibl24HsruI2fDM1JwWRwCtE5J6vCL9lH1/5v4zVv7xdrVgTrwkZ/hAsE7HBNNat5
dSku2u9HSAXa4KR4sLWrVJ8UI5+fylwilz/57HhCeduQDwKCHE/mfhxLdqL4Oa4q
oHTCNq2zTUj4w7GTvHS1g0P3y/iWMYjAzH2is+BokilpIC65NwwsKx2ybZd3Srdh
zwP/kYk5U+mYSgdDlyNqwPCibw8KDXB3srKMzyQSN6tkosKCOHFSXF0Js0eupi7t
NqmGigl3Qozj1uvU6Wy7vh58u+GFeuO4PF566t2m70Jp0cWzuVKLrBvgNO1X37rB
aEBrpOYB/H54t/qf79IFW//pptWXFNZ3S9AgyDVIcmX5C2ihaCoaPNRTom+KbH/D
QEoH9rwWSoCi2DGoR83D+G8thCUfB4yfEGulSSIA4pUR7qvIR5rd1ZioI/qtgAHm
l7MjTbLpPwiMnpFkBrxxxlFFb4gbETakMBGYoYee8ww5WbQLu0qA93AbwIXyjhE8
mqCOLyBdCAZ0mNxqPSsc
=x/P0
-----END PGP SIGNATURE-----
Merge tag 'docs-4.19' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"This was a moderately busy cycle for docs, with the usual collection
of small fixes and updates.
We also have new ktime_get_*() docs from Arnd, some kernel-doc fixes,
a new set of Italian translations (non so se vale la pena, ma non fa
male - speriamo bene), and some extensive early memory-management
documentation improvements from Mike Rapoport"
* tag 'docs-4.19' of git://git.lwn.net/linux: (52 commits)
Documentation: corrections to console/console.txt
Documentation: add ioctl number entry for v4l2-subdev.h
Remove gendered language from management style documentation
scripts/kernel-doc: Escape all literal braces in regexes
docs/mm: add description of boot time memory management
docs/mm: memblock: add overview documentation
docs/mm: memblock: add kernel-doc description for memblock types
docs/mm: memblock: add kernel-doc comments for memblock_add[_node]
docs/mm: memblock: update kernel-doc comments
mm/memblock: add a name for memblock flags enumeration
docs/mm: bootmem: add overview documentation
docs/mm: bootmem: add kernel-doc description of 'struct bootmem_data'
docs/mm: bootmem: fix kernel-doc warnings
docs/mm: nobootmem: fixup kernel-doc comments
mm/bootmem: drop duplicated kernel-doc comments
Documentation: vm.txt: Adding 'nr_hugepages_mempolicy' parameter description.
doc:it_IT: translation for kernel-hacking
docs: Fix the reference labels in Locking.rst
doc: tracing: Fix a typo of trace_stat
mm: Introduce new type vm_fault_t
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
Zolos5H+JA==
=b9ib
-----END PGP SIGNATURE-----
Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.
This pull request contains:
- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)
- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes
- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)
- Series improving how we handle sense information on the stack
(Kees)
- Lightnvm fixes and updates/improvements (Mathias/Javier et al)
- Zoned device support for null_blk (Matias)
- AIX partition fixes (Mauricio Faria de Oliveira)
- DIF checksum code made generic (Max Gurtovoy)
- Add support for discard in iostats (Michael Callahan / Tejun)
- Set of updates for BFQ (Paolo)
- Removal of async write support for bsg (Christoph)
- Bio page dirtying and clone fixups (Christoph)
- Set of bcache fix/changes (via Coly)
- Series improving blk-mq queue setup/teardown speed (Ming)
- Series improving merging performance on blk-mq (Ming)
- Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...
While debugging another bug, I was looking at all the synchronize*()
functions being used in kernel/trace, and noticed that trace_uprobes was
using synchronize_sched(), with a comment to synchronize with
{u,ret}_probe_trace_func(). When looking at those functions, the data is
protected with "rcu_read_lock()" and not with "rcu_read_lock_sched()". This
is using the wrong synchronize_*() function.
Link: http://lkml.kernel.org/r/20180809160553.469e1e32@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 70ed91c6ec ("tracing/uprobes: Support ftrace_event_file base multibuffer")
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now that some trace events can be protected by srcu_read_lock(tracepoint_srcu),
we need to make sure all locations that depend on this are also protected.
There were many places that did a synchronize_sched() thinking that it was
enough to protect againts access to trace events. This use to be the case,
but now that we use SRCU for _rcuidle() trace events, they may not be
protected by synchronize_sched(), as they may be called in paths that RCU is
not watching for preempt disable.
Fixes: e6753f23d9 ("tracepoint: Make rcuidle tracepoint callers use SRCU")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pointer ftrace_swapper_pid is defined but is never used hence it is
redundant and can be removed. The use of this variable was removed
in commit 345ddcc882 ("ftrace: Have set_ftrace_pid use the bitmap
like events do").
Cleans up clang warning:
warning: 'ftrace_swapper_pid' defined but not used [-Wunused-const-variable=]
Link: http://lkml.kernel.org/r/20180809125609.13142-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Joel Fernandes created a nice patch that cleaned up the duplicate hooks used
by lockdep and irqsoff latency tracer. It made both use tracepoints. But the
latency tracer is triggering warnings when using tracepoints to call into
the latency tracer's routines. Mainly, they can be called from NMI context.
If that happens, then the SRCU may not work properly because on some
architectures, SRCU is not safe to be called in both NMI and non-NMI
context.
This is a partial revert of the clean up patch c3bc8fd637 ("tracing:
Centralize preemptirq tracepoints and unify their usage") that adds back the
direct calls into the latency tracer. It also only calls the trace events
when not in NMI.
Link: http://lkml.kernel.org/r/20180809210654.622445925@goodmis.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Fixes: c3bc8fd637 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
I was hitting the following warning:
WARNING: CPU: 0 PID: 1 at kernel/trace/trace_irqsoff.c:631 tracer_hardirqs_off+0x15/0x2a
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc6-test+ #13
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
EIP: tracer_hardirqs_off+0x15/0x2a
Code: ff 85 c0 74 0e 8b 45 00 8b 50 04 8b 45 04 e8 35 ff ff ff 5d c3 55 64 a1 cc 37 51 c1 a9 ff ff ff 7f 89 e5 53 89 d3 89 ca 75 02 <0f> 0b e8 90 fc ff ff 85 c0 74 07 89 d8 e8 0c ff ff ff 5b 5d c3 55
EAX: 80000000 EBX: c04337f0 ECX: c04338e3 EDX: c04338e3
ESI: c04337f0 EDI: c04338e3 EBP: f2aa1d68 ESP: f2aa1d64
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210046
CR0: 80050033 CR2: 00000000 CR3: 01668000 CR4: 001406f0
Call Trace:
trace_irq_disable_rcuidle+0x63/0x6c
trace_hardirqs_off+0x26/0x30
default_send_IPI_mask_allbutself_logical+0x31/0x93
default_send_IPI_allbutself+0x37/0x48
native_send_call_func_ipi+0x4d/0x6a
smp_call_function_many+0x165/0x19d
? add_nops+0x34/0x34
? trace_hardirqs_on_caller+0x2d/0x2d
? add_nops+0x34/0x34
smp_call_function+0x1f/0x23
on_each_cpu+0x15/0x43
? trace_hardirqs_on_caller+0x2d/0x2d
? trace_hardirqs_on_caller+0x2d/0x2d
? trace_irq_disable_rcuidle+0x1/0x6c
text_poke_bp+0xa0/0xc2
? trace_hardirqs_on_caller+0x2d/0x2d
arch_jump_label_transform+0xa7/0xcb
? trace_irq_disable_rcuidle+0x5/0x6c
__jump_label_update+0x3e/0x6d
jump_label_update+0x7d/0x81
static_key_slow_inc_cpuslocked+0x58/0x6d
static_key_slow_inc+0x19/0x20
tracepoint_probe_register_prio+0x19e/0x1d1
? start_critical_timings+0x1c/0x1c
tracepoint_probe_register+0xf/0x11
irqsoff_tracer_init+0x21/0xf2
tracer_init+0x16/0x1a
trace_selftest_startup_irqsoff+0x25/0xc4
run_tracer_selftest+0xca/0x131
register_tracer+0xd5/0x172
? trace_event_define_fields_preemptirq_template+0x45/0x45
init_irqsoff_tracer+0xd/0x11
do_one_initcall+0xab/0x1e8
? rcu_read_lock_sched_held+0x3d/0x44
? trace_initcall_level+0x52/0x86
kernel_init_freeable+0x195/0x21a
? rest_init+0xb4/0xb4
kernel_init+0xd/0xe4
ret_from_fork+0x2e/0x38
It is due to running a CONFIG_PREEMPT_VOLUNTARY kernel, which would trigger
this warning every time:
WARN_ON_ONCE(preempt_count());
Because on CONFIG_PREEMPT_VOLUNTARY, preempt_count() is always zero.
This warning is to make sure preempt_count is set because
tracer_hardirqs_on() does a preempt_enable_notrace() to make the
preempt_trace() work properly, as being called by a trace event, the trace
event code disables preemption, and the tracer wants to know what the
preemption was before it was called.
Instead of enabling preemption like this, just record the preempt_count,
subtract PREEMPT_DISABLE_OFFSET from it (which is zero with !CONFIG_PREEMPT
set), and pass that value to the necessary functions, which should use the
passed in parameter instead of calling preempt_count() directly.
Fixes: da5b3ebb45 ("tracing: irqsoff: Account for additional preempt_disable")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Joel Fernandes created a nice patch that cleaned up the duplicate hooks used
by lockdep and irqsoff latency tracer. It made both use tracepoints. But it
caused lockdep to trigger several false positives. We have not figured out
why yet, but removing lockdep from using the trace event hooks and just call
its helper functions directly (like it use to), makes the problem go away.
This is a partial revert of the clean up patch c3bc8fd637 ("tracing:
Centralize preemptirq tracepoints and unify their usage") that adds direct
calls for lockdep, but also keeps most of the clean up done to get rid of
the horrible preprocessor if statements.
Link: http://lkml.kernel.org/r/20180806155058.5ee875f4@gandalf.local.home
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Fixes: c3bc8fd637 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Recently we tried to make the preemptirqsoff tracer to use irqsoff
tracepoint probes. However this causes issues as reported by Masami:
[2.271078] Testing tracer preemptirqsoff: .. no entries found ..FAILED!
[2.381015] WARNING: CPU: 0 PID: 1 at /home/mhiramat/ksrc/linux/kernel/
trace/trace.c:1512 run_tracer_selftest+0xf3/0x154
This is due to the tracepoint code increasing the preempt nesting count
by calling an additional preempt_disable before calling into the
preemptoff tracer which messes up the preempt_count() check in
tracer_hardirqs_off.
To fix this, make the irqsoff tracer probes balance the additional outer
preempt_disable with a preempt_enable_notrace.
The other way to fix this is to just use SRCU for all tracepoints.
However we can't do that because we can't use NMIs from RCU context.
Link: http://lkml.kernel.org/r/20180806034049.67949-1-joel@joelfernandes.org
Fixes: c3bc8fd637 ("tracing: Centralize preemptirq tracepoints and unify their usage")
Fixes: e6753f23d9 ("tracepoint: Make rcuidle tracepoint callers use SRCU")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
vJJlibI=
=42a5
-----END PGP SIGNATURE-----
Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.
Signed-of-by: Jens Axboe <axboe@kernel.dk>
Fix within_notrace_func() to check only notrace functions and to ignore the
kprobe-event which can not solve symbol addresses.
within_notrace_func() returns true if the given kprobe events probe point
seems to be out-of-range. But that is not the correct place to check for it,
it should be done in kprobes afterwards.
kprobe-events allow users to define a probe point on "currently unloaded
module" so that it can trace the event during module load. In this case, the
user will put a probe on a symbol which is not in kallsyms yet and it hits
the within_notrace_func(). As a result, kprobe-events always refuses if
user defines a probe on a "currenly unloaded module".
Fixes: commit 45408c4f92 ("tracing: kprobes: Prohibit probing on notrace function")
Link: http://lkml.kernel.org/r/153319624799.29007.13604430345640129926.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Return statements in functions returning bool should use true or false
instead of an integer value.
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20180802010056.GA31012@embeddedor.com
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The value of ring_buffer_record_is_set_on() is either true or false, so have
its return value be bool.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The value of ring_buffer_record_is_on() is either true or false, so have its
return value be bool.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There's code that expects tracer_tracing_is_on() to be either true or false,
not some random number. Currently, it should only return one or zero, but
just in case, change its return value to bool, to enforce it.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The start function of the hwlat tracer should never be called when the hwlat
thread already exists. If it is called, do a WARN_ON().
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The hwlat tracer uses a kernel thread to measure latencies. The function
that creates this kernel thread, start_kthread(), can be called when the
tracer is initialized and when the tracer is explicitly enabled.
start_kthread() does not check if there is an existing hwlat kernel
thread and will create a new one each time it is called.
This causes the reference to the previous thread to be lost. Without the
thread reference, the old kernel thread becomes unstoppable and
continues to use CPU time even after the hwlat tracer has been disabled.
This problem can be observed when a system is booted with tracing
enabled and the hwlat tracer is configured like this:
echo hwlat > current_tracer; echo 1 > tracing_on
Add the missing check for an existing kernel thread in start_kthread()
to prevent this problem. This function and the rest of the hwlat kernel
thread setup and teardown are already serialized because they are called
through the tracer core code with trace_type_lock held.
[
Note, this only fixes the symptom. The real fix was not to call
this function when tracing_on was already one. But this still makes
the code more robust, so we'll add it.
]
Link: http://lkml.kernel.org/r/1533120354-22923-1-git-send-email-erica.bugden@linutronix.de
Signed-off-by: Erica Bugden <erica.bugden@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, when one echo's in 1 into tracing_on, the current tracer's
"start()" function is executed, even if tracing_on was already one. This can
lead to strange side effects. One being that if the hwlat tracer is enabled,
and someone does "echo 1 > tracing_on" into tracing_on, the hwlat tracer's
start() function is called again which will recreate another kernel thread,
and make it unable to remove the old one.
Link: http://lkml.kernel.org/r/1533120354-22923-1-git-send-email-erica.bugden@linutronix.de
Cc: stable@vger.kernel.org
Fixes: 2df8f8a6a8 ("tracing: Fix regression with irqsoff tracer and tracing_on file")
Reported-by: Erica Bugden <erica.bugden@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This patch detaches the preemptirq tracepoints from the tracers and
keeps it separate.
Advantages:
* Lockdep and irqsoff event can now run in parallel since they no longer
have their own calls.
* This unifies the usecase of adding hooks to an irqsoff and irqson
event, and a preemptoff and preempton event.
3 users of the events exist:
- Lockdep
- irqsoff and preemptoff tracers
- irqs and preempt trace events
The unification cleans up several ifdefs and makes the code in preempt
tracer and irqsoff tracers simpler. It gets rid of all the horrific
ifdeferry around PROVE_LOCKING and makes configuration of the different
users of the tracepoints more easy and understandable. It also gets rid
of the time_* function calls from the lockdep hooks used to call into
the preemptirq tracer which is not needed anymore. The negative delta in
lines of code in this patch is quite large too.
In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
as a single point for registering probes onto the tracepoints. With
this,
the web of config options for preempt/irq toggle tracepoints and its
users becomes:
PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING
| | \ | |
\ (selects) / \ \ (selects) /
TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS
\ /
\ (depends on) /
PREEMPTIRQ_TRACEPOINTS
Other than the performance tests mentioned in the previous patch, I also
ran the locking API test suite. I verified that all tests cases are
passing.
I also injected issues by not registering lockdep probes onto the
tracepoints and I see failures to confirm that the probes are indeed
working.
This series + lockdep probes not registered (just to inject errors):
[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12:FAILED|FAILED| ok |
[ 0.000000] sirq-safe-A => hirqs-on/21:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + irqs-on/12:FAILED|FAILED| ok |
[ 0.000000] soft-safe-A + irqs-on/12:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + irqs-on/21:FAILED|FAILED| ok |
[ 0.000000] soft-safe-A + irqs-on/21:FAILED|FAILED| ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |
With this series + lockdep probes registered, all locking tests pass:
[ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/12: ok | ok | ok |
[ 0.000000] sirq-safe-A => hirqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/12: ok | ok | ok |
[ 0.000000] hard-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] soft-safe-A + irqs-on/21: ok | ok | ok |
[ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
[ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |
Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Move selftest function to its own compile unit so it can be compiled
with the ftrace cflags (CC_FLAGS_FTRACE) allowing it to be probed
during the ftrace startup tests.
Link: http://lkml.kernel.org/r/153294604271.32740.16490677128630177030.stgit@devbox
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Prohibit kprobe-events probing on notrace functions. Since probing on a
notrace function can cause a recursive event call. In most cases those are just
skipped, but in some cases it falls into an infinite recursive call.
This protection can be disabled by the kconfig
CONFIG_KPROBE_EVENTS_ON_NOTRACE=y, but it is highly recommended to keep it
"n" for normal kernel builds. Note that this is only available if "kprobes on
ftrace" has been implemented on the target arch and CONFIG_KPROBES_ON_FTRACE=y.
Link: http://lkml.kernel.org/r/153294601436.32740.10557881188933661239.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Francis Deslauriers <francis.deslauriers@efficios.com>
[ Slight grammar and spelling fixes ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Automatically found by kbuild test robot.
Fixes: ffdc73a3b2ad ("lib: Add module for testing preemptoff/irqsoff latency tracers")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The function enable_trace_kprobe() performs slightly differently if the file
parameter is passed in as NULL on non-NULL. Instead of checking file twice,
move the code between the two tests into a static inline helper function to
make the code easier to follow.
Link: http://lkml.kernel.org/r/20180725224728.7b1d5db2@vmware.local.home
Link: http://lkml.kernel.org/r/20180726121152.4dd54330@gandalf.local.home
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The name of the directory for per-cpu function statistics
is trace_stat, not trace_stats.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Remove ftrace_nr_registered_ops() because it is no longer used.
ftrace_nr_registered_ops() has been introduced by commit ea701f11da
("ftrace: Add selftest to test function trace recursion protection"), but
its caller has been removed by commit 05cbbf643b ("tracing: Fix selftest
function recursion accounting"). So it is not called anymore.
Link: http://lkml.kernel.org/r/153260907227.12474.5234899025934963683.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Remove using_ftrace_ops_list_func() since it is no longer used.
Using ftrace_ops_list_func() has been introduced by commit 7eea4fce02
("tracing/stack_trace: Skip 4 instead of 3 when using ftrace_ops_list_func")
as a helper function, but its caller has been removed by commit 72ac426a5b
("tracing: Clean up stack tracing and fix fentry updates"). So it is not
called anymore.
Link: http://lkml.kernel.org/r/153260904427.12474.9952096317439329851.stgit@devbox
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Nothing uses unregister_trigger() outside of trace_events_trigger.c file,
thus it should be static. Not sure why this was ever converted, because
its counter part, register_trigger(), was always static.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Here we introduce a test module for introducing a long preempt or irq
disable delay in the kernel which the preemptoff or irqsoff tracers can
detect. This module is to be used only for test purposes and is default
disabled.
Following is the expected output (only briefly shown) that can be parsed
to verify that the tracers are working correctly. We will use this from
the kselftests in future patches.
For the preemptoff tracer:
echo preemptoff > /d/tracing/current_tracer
sleep 1
insmod ./preemptirq_delay_test.ko test_mode=preempt delay=500000
sleep 1
bash-4.3# cat /d/tracing/trace
preempt -1066 2...2 0us@: preemptirq_delay_run <-preemptirq_delay_run
preempt -1066 2...2 500002us : preemptirq_delay_run <-preemptirq_delay_run
preempt -1066 2...2 500004us : tracer_preempt_on <-preemptirq_delay_run
preempt -1066 2...2 500012us : <stack trace>
=> kthread
=> ret_from_fork
For the irqsoff tracer:
echo irqsoff > /d/tracing/current_tracer
sleep 1
insmod ./preemptirq_delay_test.ko test_mode=irq delay=500000
sleep 1
bash-4.3# cat /d/tracing/trace
irq dis -1069 1d..1 0us@: preemptirq_delay_run
irq dis -1069 1d..1 500001us : preemptirq_delay_run
irq dis -1069 1d..1 500002us : tracer_hardirqs_on <-preemptirq_delay_run
irq dis -1069 1d..1 500005us : <stack trace>
=> ret_from_fork
Link: http://lkml.kernel.org/r/20180712213611.GA8743@joelaf.mtv.corp.google.com
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Julia Cartwright <julia@ni.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Glexiner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[ Erick is a co-developer of this commit ]
Signed-off-by: Erick Reyes <erickreyes@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Split reset functions into seperate functions in preparation
of future patches that need to do tracer specific reset.
Link: http://lkml.kernel.org/r/20180628182149.226164-4-joel@joelfernandes.org
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit 57ea2a34ad ("tracing/kprobes: Fix trace_probe flags on
enable_trace_kprobe() failure") added an if statement that depends on another
if statement that gcc doesn't see will initialize the "link" variable and
gives the warning:
"warning: 'link' may be used uninitialized in this function"
It is really a false positive, but to quiet the warning, and also to make
sure that it never actually is used uninitialized, initialize the "link"
variable to NULL and add an if (!WARN_ON_ONCE(!link)) where the compiler
thinks it could be used uninitialized.
Cc: stable@vger.kernel.org
Fixes: 57ea2a34ad ("tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There was a case that triggered a double free in event_trigger_callback()
due to the called reg() function freeing the trigger_data and then it
getting freed again by the error return by the caller. The solution there
was to up the trigger_data ref count.
Code inspection found that event_enable_trigger_func() has the same issue,
but is not as easy to trigger (requires harder to trigger failures). It
needs to be solved slightly different as it needs more to clean up when the
reg() function fails.
Link: http://lkml.kernel.org/r/20180725124008.7008e586@gandalf.local.home
Cc: stable@vger.kernel.org
Fixes: 7862ad1846 ("tracing: Add 'enable_event' and 'disable_event' event trigger commands")
Reivewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If enable_trace_kprobe fails to enable the probe in enable_k(ret)probe
it returns an error, but does not unset the tp flags it set previously.
This results in a probe being considered enabled and failures like being
unable to remove the probe through kprobe_events file since probes_open()
expects every probe to be disabled.
Link: http://lkml.kernel.org/r/20180725102826.8300-1-asavkov@redhat.com
Link: http://lkml.kernel.org/r/20180725142038.4765-1-asavkov@redhat.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 41a7dd420c ("tracing/kprobes: Support ftrace_event_file base multibuffer")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Maintain the tracing on/off setting of the ring_buffer when switching
to the trace buffer snapshot.
Taking a snapshot is done by swapping the backup ring buffer
(max_tr_buffer). But since the tracing on/off setting is defined
by the ring buffer, when swapping it, the tracing on/off setting
can also be changed. This causes a strange result like below:
/sys/kernel/debug/tracing # cat tracing_on
1
/sys/kernel/debug/tracing # echo 0 > tracing_on
/sys/kernel/debug/tracing # cat tracing_on
0
/sys/kernel/debug/tracing # echo 1 > snapshot
/sys/kernel/debug/tracing # cat tracing_on
1
/sys/kernel/debug/tracing # echo 1 > snapshot
/sys/kernel/debug/tracing # cat tracing_on
0
We don't touch tracing_on, but snapshot changes tracing_on
setting each time. This is an anomaly, because user doesn't know
that each "ring_buffer" stores its own tracing-enable state and
the snapshot is done by swapping ring buffers.
Link: http://lkml.kernel.org/r/153149929558.11274.11730609978254724394.stgit@devbox
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Hiraku Toyooka <hiraku.toyooka@cybertrust.co.jp>
Cc: stable@vger.kernel.org
Fixes: debdd57f51 ("tracing: Make a snapshot feature available from userspace")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
[ Updated commit log and comment in the code ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Running the following:
# cd /sys/kernel/debug/tracing
# echo 500000 > buffer_size_kb
[ Or some other number that takes up most of memory ]
# echo snapshot > events/sched/sched_switch/trigger
Triggers the following bug:
------------[ cut here ]------------
kernel BUG at mm/slub.c:296!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
CPU: 6 PID: 6878 Comm: bash Not tainted 4.18.0-rc6-test+ #1066
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
RIP: 0010:kfree+0x16c/0x180
Code: 05 41 0f b6 72 51 5b 5d 41 5c 4c 89 d7 e9 ac b3 f8 ff 48 89 d9 48 89 da 41 b8 01 00 00 00 5b 5d 41 5c 4c 89 d6 e9 f4 f3 ff ff <0f> 0b 0f 0b 48 8b 3d d9 d8 f9 00 e9 c1 fe ff ff 0f 1f 40 00 0f 1f
RSP: 0018:ffffb654436d3d88 EFLAGS: 00010246
RAX: ffff91a9d50f3d80 RBX: ffff91a9d50f3d80 RCX: ffff91a9d50f3d80
RDX: 00000000000006a4 RSI: ffff91a9de5a60e0 RDI: ffff91a9d9803500
RBP: ffffffff8d267c80 R08: 00000000000260e0 R09: ffffffff8c1a56be
R10: fffff0d404543cc0 R11: 0000000000000389 R12: ffffffff8c1a56be
R13: ffff91a9d9930e18 R14: ffff91a98c0c2890 R15: ffffffff8d267d00
FS: 00007f363ea64700(0000) GS:ffff91a9de580000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055c1cacc8e10 CR3: 00000000d9b46003 CR4: 00000000001606e0
Call Trace:
event_trigger_callback+0xee/0x1d0
event_trigger_write+0xfc/0x1a0
__vfs_write+0x33/0x190
? handle_mm_fault+0x115/0x230
? _cond_resched+0x16/0x40
vfs_write+0xb0/0x190
ksys_write+0x52/0xc0
do_syscall_64+0x5a/0x160
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f363e16ab50
Code: 73 01 c3 48 8b 0d 38 83 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 79 db 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 1e e3 01 00 48 89 04 24
RSP: 002b:00007fff9a4c6378 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f363e16ab50
RDX: 0000000000000009 RSI: 000055c1cacc8e10 RDI: 0000000000000001
RBP: 000055c1cacc8e10 R08: 00007f363e435740 R09: 00007f363ea64700
R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000009
R13: 0000000000000001 R14: 00007f363e4345e0 R15: 00007f363e4303c0
Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device i915 snd_pcm snd_timer i2c_i801 snd soundcore i2c_algo_bit drm_kms_helper
86_pkg_temp_thermal video kvm_intel kvm irqbypass wmi e1000e
---[ end trace d301afa879ddfa25 ]---
The cause is because the register_snapshot_trigger() call failed to
allocate the snapshot buffer, and then called unregister_trigger()
which freed the data that was passed to it. Then on return to the
function that called register_snapshot_trigger(), as it sees it
failed to register, it frees the trigger_data again and causes
a double free.
By calling event_trigger_init() on the trigger_data (which only ups
the reference counter for it), and then event_trigger_free() afterward,
the trigger_data would not get freed by the registering trigger function
as it would only up and lower the ref count for it. If the register
trigger function fails, then the event_trigger_free() called after it
will free the trigger data normally.
Link: http://lkml.kernel.org/r/20180724191331.738eb819@gandalf.local.home
Cc: stable@vger.kerne.org
Fixes: 93e31ffbf4 ("tracing: Add 'snapshot' event trigger command")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently ftrace displays data in trace output like so:
_-----=> irqs-off
/ _----=> need-resched
| / _---=> hardirq/softirq
|| / _--=> preempt-depth
||| / delay
TASK-PID CPU TGID |||| TIMESTAMP FUNCTION
| | | | |||| | |
bash-1091 [000] ( 1091) d..2 28.313544: sched_switch:
However Android's trace visualization tools expect a slightly different
format due to an out-of-tree patch patch that was been carried for a
decade, notice that the TGID and CPU fields are reversed:
_-----=> irqs-off
/ _----=> need-resched
| / _---=> hardirq/softirq
|| / _--=> preempt-depth
||| / delay
TASK-PID TGID CPU |||| TIMESTAMP FUNCTION
| | | | |||| | |
bash-1091 ( 1091) [002] d..2 64.965177: sched_switch:
From kernel v4.13 onwards, during which TGID was introduced, tracing
with systrace on all Android kernels will break (most Android kernels
have been on 4.9 with Android patches, so this issues hasn't been seen
yet). From v4.13 onwards things will break.
The chrome browser's tracing tools also embed the systrace viewer which
uses the legacy TGID format and updates to that are known to be
difficult to make.
Considering this, I suggest we make this change to the upstream kernel
and backport it to all Android kernels. I believe this feature is merged
recently enough into the upstream kernel that it shouldn't be a problem.
Also logically, IMO it makes more sense to group the TGID with the
TASK-PID and the CPU after these.
Link: http://lkml.kernel.org/r/20180626000822.113931-1-joel@joelfernandes.org
Cc: jreck@google.com
Cc: tkjos@google.com
Cc: stable@vger.kernel.org
Fixes: 441dae8f2f ("tracing: Add support for display of tgid in trace output")
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
We don't release tk->tp.call.print_fmt when destroying
local uprobe. Also there's missing print_fmt kfree in
create_local_trace_kprobe error path.
Link: http://lkml.kernel.org/r/20180709141906.2390-1-jolsa@kernel.org
Cc: stable@vger.kernel.org
Fixes: e12f03d703 ("perf/core: Implement the 'perf_kprobe' PMU")
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since @blk_debugfs_root couldn't be configured dynamically, we can
save a few memory allocation if it's not there.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The commit 719f6a7040 ("printk: Use the main logbuf in NMI
when logbuf_lock is available") brought back the possible deadlocks
in printk() and NMI.
The check of logbuf_lock is done only in printk_nmi_enter() to prevent
mixed output. But another CPU might take the lock later, enter NMI, and:
+ Both NMIs might be serialized by yet another lock, for example,
the one in nmi_cpu_backtrace().
+ The other CPU might get stopped in NMI, see smp_send_stop()
in panic().
The only safe solution is to use trylock when storing the message
into the main log-buffer. It might cause reordering when some lines
go to the main lock buffer directly and others are delayed via
the per-CPU buffer. It means that it is not useful in general.
This patch replaces the problematic NMI deferred context with NMI
direct context. It can be used to mark a code that might produce
many messages in NMI and the risk of losing them is more critical
than problems with eventual reordering.
The context is then used when dumping trace buffers on oops. It was
the primary motivation for the original fix. Also the reordering is
even smaller issue there because some traces have their own time stamps.
Finally, nmi_cpu_backtrace() need not longer be serialized because
it will always us the per-CPU buffers again.
Fixes: 719f6a7040 ("printk: Use the main logbuf in NMI when logbuf_lock is available")
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20180627142028.11259-1-pmladek@suse.com
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
clear_ftrace_function is not used outside of ftrace.c and is not help to
use a function, so nuke it per Steve's suggestion.
Link: http://lkml.kernel.org/r/1517537689-34947-1-git-send-email-xieyisheng1@huawei.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Silence warnings (triggered at W=1) by adding relevant __printf attributes.
CC kernel/trace/trace.o
kernel/trace/trace.c: In function ‘__trace_array_vprintk’:
kernel/trace/trace.c:2979:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
len = vscnprintf(tbuffer, TRACE_BUF_SIZE, fmt, args);
^~~
AR kernel/trace/built-in.o
Link: http://lkml.kernel.org/r/20180308205843.27447-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Simplify and optimize the logic in trace_buffer_iter() to use a conditional
operation instead of an if conditional.
Link: http://lkml.kernel.org/r/20180408113631.3947-1-cugyly@163.com
Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The comment in create_filter() states that the passed in filter pointer
(filterp) will either be NULL or contain an error message stating why the
filter failed. But it also expects the filter pointer to point to NULL when
passed in. If it is not, the function create_filter_start() will warn and
return an error message without updating the filter pointer. This is not
what the comment states.
As we always expect the pointer to point to NULL, if it is not, trigger a
WARN_ON(), set it to NULL, and then continue the path as the rest will work
as the comment states. Also update the comment to state it must point to
NULL.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
'err' is used as a NUL-terminated string, but using strncpy() with the length
equal to the buffer size may result in lack of the termination:
kernel/trace/trace_events_hist.c: In function 'hist_err_event':
kernel/trace/trace_events_hist.c:396:3: error: 'strncpy' specified bound 256 equals destination size [-Werror=stringop-truncation]
strncpy(err, var, MAX_FILTER_STR_VAL);
This changes it to use the safer strscpy() instead.
Link: http://lkml.kernel.org/r/20180328140920.2842153-1-arnd@arndb.de
Cc: stable@vger.kernel.org
Fixes: f404da6e1d ("tracing: Add 'last error' error facility for hist triggers")
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Despite being mentioned at Documentation/trace/ftrace.rst as a
rst file, this file was still a text one, with several issues.
Convert it to ReST and add it to the trace index:
- Mark the document title as such;
- Identify and indent the literal blocks;
- Use the proper markups for table.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Make use of the swap macro and remove unnecessary variable _buf_.
This makes the code easier to read and maintain. Also, reduces the
stack usage.
This code was detected with the help of Coccinelle.
Link: http://lkml.kernel.org/r/20180209175316.GA18720@embeddedgus
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The syzkaller detected a out-of-bounds issue with the events filter code,
specifically here:
prog[N].pred = NULL; /* #13 */
prog[N].target = 1; /* TRUE */
prog[N+1].pred = NULL;
prog[N+1].target = 0; /* FALSE */
-> prog[N-1].target = N;
prog[N-1].when_to_branch = false;
As that's the first reference to a "N-1" index, it appears that the code got
here with N = 0, which means the filter parser found no filter to parse
(which shouldn't ever happen, but apparently it did).
Add a new error to the parsing code that will check to make sure that N is
not zero before going into this part of the code. If N = 0, then -EINVAL is
returned, and a error message is added to the filter.
Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Reported-by: air icy <icytxw@gmail.com>
bugzilla url: https://bugzilla.kernel.org/show_bug.cgi?id=200019
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Clear current_kprobe and enable preemption in kprobe
even if pre_handler returns !0.
This simplifies function override using kprobes.
Jprobe used to require to keep the preemption disabled and
keep current_kprobe until it returned to original function
entry. For this reason kprobe_int3_handler() and similar
arch dependent kprobe handers checks pre_handler result
and exit without enabling preemption if the result is !0.
After removing the jprobe, Kprobes does not need to
keep preempt disabled even if user handler returns !0
anymore.
But since the function override handler in error-inject
and bpf is also returns !0 if it overrides a function,
to balancing the preempt count, it enables preemption
and reset current kprobe by itself.
That is a bad design that is very buggy. This fixes
such unbalanced preempt-count and current_kprobes setting
in kprobes, bpf and error-inject.
Note: for powerpc and x86, this removes all preempt_disable
from kprobe_ftrace_handler because ftrace callbacks are
called under preempt disabled.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: https://lore.kernel.org/lkml/152942494574.15209.12323837825873032258.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As we move stuff around, some doc references are broken. Fix some of
them via this script:
./scripts/documentation-file-ref-check --fix
Manually checked if the produced result is valid, removing a few
false-positives.
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support live
patching again.
- Add support for patching barrier_nospec in copy_from_user() and syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu Malaterre.
- Several series optimising various parts of the 32-bit code from Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K" ("GEFanuc,C2K"),
which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by Steve, and
a fix to our powernv cpuidle driver. Then there's a series touching mm, x86 and
fs/proc/task_mmu.c, which cleans up some details around pkey support. It was
ack'ed/reviewed by Ingo & Dave and has been in next for several weeks.
Thanks to:
Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al Viro, Andrew
Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Balbir Singh,
Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King, Dave
Hansen, Fabio Estevam, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Haren
Myneni, Hari Bathini, Ingo Molnar, Jonathan Neuschäfer, Josh Poimboeuf,
Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu
Malaterre, Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica Gupta, Ravi
Bangoria, Russell Currey, Sam Bobroff, Samuel Mendoza-Jonas, Segher
Boessenkool, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stewart Smith,
Thiago Jung Bauermann, Torsten Duwe, Vaibhav Jain, Wei Yongjun, Wolfram Sang,
Yisheng Xie, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJbGQKBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBq
TRAAioK7rz5xYMkxaM3Ng3ybobEeNAwQqOolz98xvmnB9SfDWNuc99vf8cGu0/fQ
zc8AKZ5RcnwipOjyGlxW9oa1ZhVq0xtYnQPiYLEKMdLQmh5D+C7+KpvAd1UElweg
ub40/xDySWfMujfuMSF9JDCWPIXyojt4Xg5nJKIVRrAm/3YMe/+i5Am7NWHuMCEb
aQmZtlYW5Mz81XY0968hjpUO6eKFRmsaM7yFAhGTXx6+oLRpGj1PZB4AwdRIKS2L
Ak7q/VgxtE4W+s3a0GK2s+eXIhGKeFuX9AVnx3nti+8/K1OqrqhDcLMUC/9JpCpv
EvOtO7dxPnZujHjdu4Eai/xNoo4h6zRy7bWqve9LoBM40CP5jljKzu1lwqqb5yO0
jC7/aXhgiSIxxcRJLjoI/TYpZPu40MifrkydmczykdPyPCnMIWEJDcj4KsRL/9Y8
9SSbJzRNC/SgQNTbUYPZFFi6G0QaMmlcbCb628k8QT+Gn3Xkdf/ZtxzqEyoF4Irq
46kFBsiSSK4Bu0rVlcUtJQLgdqytWULO6NKEYnD67laxYcgQd8pGFQ8SjZhRZLgU
q5LA3HIWhoAI4M0wZhOnKXO6JfiQ1UbO8gUJLsWsfF0Fk5KAcdm+4kb4jbI1H4Qk
Vol9WNRZwEllyaiqScZN9RuVVuH0GPOZeEH1dtWK+uWi0lM=
=ZlBf
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support
live patching again.
- Add support for patching barrier_nospec in copy_from_user() and
syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu
Malaterre.
- Several series optimising various parts of the 32-bit code from
Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K"
("GEFanuc,C2K"), which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by
Steve, and a fix to our powernv cpuidle driver. Then there's a series
touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details
around pkey support. It was ack'ed/reviewed by Ingo & Dave and has
been in next for several weeks.
Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al
Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd
Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe
Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain,
Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo
Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal,
Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre,
Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica
Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel
Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo,
Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe,
Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing"
* tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits)
powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
cpuidle: powernv: Fix promotion from snooze if next state disabled
powerpc: fix build failure by disabling attribute-alias warning in pci_32
ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"
powerpc: fix spelling mistake: "Usupported" -> "Unsupported"
powerpc/pkeys: Detach execute_only key on !PROT_EXEC
powerpc/powernv: copy/paste - Mask SO bit in CR
powerpc: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove support for Marvell mv64x60 i2c controller
powerpc/boot: Remove support for Marvell MPSC serial controller
powerpc/embedded6xx: Remove C2K board support
powerpc/lib: optimise PPC32 memcmp
powerpc/lib: optimise 32 bits __clear_user()
powerpc/time: inline arch_vtime_task_switch()
powerpc/Makefile: set -mcpu=860 flag for the 8xx
powerpc: Implement csum_ipv6_magic in assembly
powerpc/32: Optimise __csum_partial()
powerpc/lib: Adjust .balign inside string functions for PPC32
...
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
triggers. For example:
# cd /sys/kernel/debug/tracing
# echo 'snapshot' > events/ftrace/print/trigger
# echo 'cause snapshot' > trace_marker
The rest of the changes are various clean ups and also one stable fix that
was added late in the cycle.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWxftjBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qosyAQCgmhb+/HgRYIbsKa8Q70E2pMx5A/za
4rQ37Q9PhDUJlAD/aWXfrqcMm+buAU0yY4K+eb+f2feKXWXxdZ/fuh7+nA4=
=lpXM
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"One new feature was added to ftrace, which is the trace_marker now
supports triggers. For example:
# cd /sys/kernel/debug/tracing
# echo 'snapshot' > events/ftrace/print/trigger
# echo 'cause snapshot' > trace_marker
The rest of the changes are various clean ups and also one stable fix
that was added late in the cycle"
* tag 'trace-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (21 commits)
tracing: Use match_string() instead of open coding it in trace_set_options()
branch-check: fix long->int truncation when profiling branches
ring-buffer: Fix typo in comment
ring-buffer: Fix a bunch of typos in comments
tracing/selftest: Add test to test simple snapshot trigger for trace_marker
tracing/selftest: Add test to test hist trigger between kernel event and trace_marker
tracing/selftest: Add selftests to test trace_marker histogram triggers
ftrace/selftest: Fix reset_trigger() to handle triggers with filters
ftrace/selftest: Have the reset_trigger code be a bit more careful
tracing: Document trace_marker triggers
tracing: Allow histogram triggers to access ftrace internal events
tracing: Prevent further users of zero size static arrays in trace events
tracing: Have zero size length in filter logic be full string
tracing: Add trigger file for trace_markers tracefs/ftrace/print
tracing: Do not show filter file for ftrace internal events
tracing: Add brackets in ftrace event dynamic arrays
tracing: Have event_trace_init() called by trace_init_tracefs()
tracing: Add __find_event_file() to find event files without restrictions
tracing: Do not reference event data in post call triggers
tracepoints: Fix the descriptions of tracepoint_probe_register{_prio}
...
match_string() returns the index of an array for a matching string,
which can be used to simplify the code.
Link: http://lkml.kernel.org/r/1526546163-4609-1-git-send-email-xieyisheng1@huawei.com
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-06-05
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add a new BPF hook for sendmsg similar to existing hooks for bind and
connect: "This allows to override source IP (including the case when it's
set via cmsg(3)) and destination IP:port for unconnected UDP (slow path).
TCP and connected UDP (fast path) are not affected. This makes UDP support
complete, that is, connected UDP is handled by connect hooks, unconnected
by sendmsg ones.", from Andrey.
2) Rework of the AF_XDP API to allow extending it in future for type writer
model if necessary. In this mode a memory window is passed to hardware
and multiple frames might be filled into that window instead of just one
that is the case in the current fixed frame-size model. With the new
changes made this can be supported without having to add a new descriptor
format. Also, core bits for the zero-copy support for AF_XDP have been
merged as agreed upon, where i40e bits will be routed via Jeff later on.
Various improvements to documentation and sample programs included as
well, all from Björn and Magnus.
3) Given BPF's flexibility, a new program type has been added to implement
infrared decoders. Quote: "The kernel IR decoders support the most
widely used IR protocols, but there are many protocols which are not
supported. [...] There is a 'long tail' of unsupported IR protocols,
for which lircd is need to decode the IR. IR encoding is done in such
a way that some simple circuit can decode it; therefore, BPF is ideal.
[...] user-space can define a decoder in BPF, attach it to the rc
device through the lirc chardev.", from Sean.
4) Several improvements and fixes to BPF core, among others, dumping map
and prog IDs into fdinfo which is a straight forward way to correlate
BPF objects used by applications, removing an indirect call and therefore
retpoline in all map lookup/update/delete calls by invoking the callback
directly for 64 bit archs, adding a new bpf_skb_cgroup_id() BPF helper
for tc BPF programs to have an efficient way of looking up cgroup v2 id
for policy or other use cases. Fixes to make sure we zero tunnel/xfrm
state that hasn't been filled, to allow context access wrt pt_regs in
32 bit archs for tracing, and last but not least various test cases
for fixes that landed in bpf earlier, from Daniel.
5) Get rid of the ndo_xdp_flush API and extend the ndo_xdp_xmit with
a XDP_XMIT_FLUSH flag instead which allows to avoid one indirect
call as flushing is now merged directly into ndo_xdp_xmit(), from Jesper.
6) Add a new bpf_get_current_cgroup_id() helper that can be used in
tracing to retrieve the cgroup id from the current process in order
to allow for e.g. aggregation of container-level events, from Yonghong.
7) Two follow-up fixes for BTF to reject invalid input values and
related to that also two test cases for BPF kselftests, from Martin.
8) Various API improvements to the bpf_fib_lookup() helper, that is,
dropping MPLS bits which are not fully hashed out yet, rejecting
invalid helper flags, returning error for unsupported address
families as well as renaming flowlabel to flowinfo, from David.
9) Various fixes and improvements to sockmap BPF kselftests in particular
in proper error detection and data verification, from Prashant.
10) Two arm32 BPF JIT improvements. One is to fix imm range check with
regards to whether immediate fits into 24 bits, and a naming cleanup
to get functions related to rsh handling consistent to those handling
lsh, from Wang.
11) Two compile warning fixes in BPF, one for BTF and a false positive
to silent gcc in stack_map_get_build_id_offset(), from Arnd.
12) Add missing seg6.h header into tools include infrastructure in order
to fix compilation of BPF kselftests, from Mathieu.
13) Several formatting cleanups in the BPF UAPI helper description that
also fix an error during rst2man compilation, from Quentin.
14) Hide an unused variable in sk_msg_convert_ctx_access() when IPv6 is
not built into the kernel, from Yue.
15) Remove a useless double assignment in dev_map_enqueue(), from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull RCU updates from Ingo Molnar:
- updates to the handling of expedited grace periods
- updates to reduce lock contention in the rcu_node combining tree
[ These are in preparation for the consolidation of RCU-bh,
RCU-preempt, and RCU-sched into a single flavor, which was
requested by Linus in response to a security flaw whose root cause
included confusion between the multiple flavors of RCU ]
- torture-test updates that save their users some time and effort
- miscellaneous fixes
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
rcu/x86: Provide early rcu_cpu_starting() callback
torture: Make kvm-find-errors.sh find build warnings
rcutorture: Abbreviate kvm.sh summary lines
rcutorture: Print end-of-test state in kvm.sh summary
rcutorture: Print end-of-test state
torture: Fold parse-torture.sh into parse-console.sh
torture: Add a script to edit output from failed runs
rcu: Update list of rcu_future_grace_period() trace events
rcu: Drop early GP request check from rcu_gp_kthread()
rcu: Simplify and inline cpu_needs_another_gp()
rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp()
rcu: Make rcu_start_this_gp() check for out-of-range requests
rcu: Add funnel locking to rcu_start_this_gp()
rcu: Make rcu_start_future_gp() caller select grace period
rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp()
rcu: Clear request other than RCU_GP_FLAG_INIT at GP end
rcu: Cleanup, don't put ->completed into an int
rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs()
rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition
rcu: Make rcu_migrate_callbacks wake GP kthread when needed
...
An anonymous source sent me a bunch of typo fixes in the comments of
ring_buffer.c file. That source did not want to be associated to this patch
because they don't want to be known as "one of those" commiters (you know who
you are!). They gave me permission to sign this off in my own name.
Suggested-by: One-of-those-commiters@YouKnowWhoYouAre.org
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Commit bf6fa2c893 ("bpf: implement bpf_get_current_cgroup_id()
helper") introduced a new helper bpf_get_current_cgroup_id().
The helper has a dependency on CONFIG_CGROUPS.
When CONFIG_CGROUPS is not defined, using the helper will result
the following verifier error:
kernel subsystem misconfigured func bpf_get_current_cgroup_id#80
which is hard for users to interpret.
Guarding the reference to bpf_get_current_cgroup_id_proto with
CONFIG_CGROUPS will result in below better message:
unknown func bpf_get_current_cgroup_id#80
Fixes: bf6fa2c893 ("bpf: implement bpf_get_current_cgroup_id() helper")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
bpf has been used extensively for tracing. For example, bcc
contains an almost full set of bpf-based tools to trace kernel
and user functions/events. Most tracing tools are currently
either filtered based on pid or system-wide.
Containers have been used quite extensively in industry and
cgroup is often used together to provide resource isolation
and protection. Several processes may run inside the same
container. It is often desirable to get container-level tracing
results as well, e.g. syscall count, function count, I/O
activity, etc.
This patch implements a new helper, bpf_get_current_cgroup_id(),
which will return cgroup id based on the cgroup within which
the current task is running.
The later patch will provide an example to show that
userspace can get the same cgroup id so it could
configure a filter or policy in the bpf program based on
task cgroup id.
The helper is currently implemented for tracing. It can
be added to other program types as well when needed.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT
program type in test_verifier report the following errors on x86_32:
172/p unpriv: spill/fill of different pointers ldx FAIL
Unexpected error message!
0: (bf) r6 = r10
1: (07) r6 += -8
2: (15) if r1 == 0x0 goto pc+3
R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
3: (bf) r2 = r10
4: (07) r2 += -76
5: (7b) *(u64 *)(r6 +0) = r2
6: (55) if r1 != 0x0 goto pc+1
R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp
7: (7b) *(u64 *)(r6 +0) = r1
8: (79) r1 = *(u64 *)(r6 +0)
9: (79) r1 = *(u64 *)(r1 +68)
invalid bpf_context access off=68 size=8
378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
Failed to load prog 'Permission denied'!
0: (b7) r0 = 0
1: (71) r0 = *(u8 *)(r1 +68)
invalid bpf_context access off=68 size=1
379/p check bpf_perf_event_data->sample_period half load permitted FAIL
Failed to load prog 'Permission denied'!
0: (b7) r0 = 0
1: (69) r0 = *(u16 *)(r1 +68)
invalid bpf_context access off=68 size=2
380/p check bpf_perf_event_data->sample_period word load permitted FAIL
Failed to load prog 'Permission denied'!
0: (b7) r0 = 0
1: (61) r0 = *(u32 *)(r1 +68)
invalid bpf_context access off=68 size=4
381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
Failed to load prog 'Permission denied'!
0: (b7) r0 = 0
1: (79) r0 = *(u64 *)(r1 +68)
invalid bpf_context access off=68 size=8
Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte
boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok()
will then bail out saying that off & (size_default - 1) which is 68 & 7
doesn't cleanly align in the case of sample_period access from struct
bpf_perf_event_data, hence verifier wrongly thinks we might be doing an
unaligned access here though underlying arch can handle it just fine.
Therefore adjust this down to machine size and check and rewrite the
offset for narrow access on that basis. We also need to fix corresponding
pe_prog_is_valid_access(), since we hit the check for off % size != 0
(e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs
for tracing work on x86_32.
Reported-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Filling in the padding slot in the bpf structure as a bug fix in 'ne'
overlapped with actually using that padding area for something in
'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes is it possible for bpf prog detach to return -ENOENT.
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Now that trace_marker can have triggers, including a histogram triggers, the
onmatch() and onmax() access the trace event. To do so, the search routine
to find the event file needs to use the raw __find_event_file() that does
not filter out ftrace events.
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As strings in trace events may not have a nul terminating character, the
filter string compares use the defined string length for the field for the
compares.
The trace_marker records data slightly different than do normal events. It's
size is zero, meaning that the string is the rest of the array, and that the
string also ends with '\0'.
If the size is zero, assume that the string is nul terminated and read the
string in question as is.
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Allow writing to the trace_markers file initiate triggers defined in
tracefs/ftrace/print/trigger file. This will allow of user space to trigger
the same type of triggers (including histograms) that the trace events use.
Had to create a ftrace_event_register() function that will become the
trace_marker print event's reg() function. This is required because of how
triggers are enabled:
event_trigger_write() {
event_trigger_regex_write() {
trigger_process_regex() {
for p in trigger_commands {
p->func(); /* trigger_snapshot_cmd->func */
event_trigger_callback() {
cmd_ops->reg() /* register_trigger() */ {
trace_event_trigger_enable_disable() {
trace_event_enable_disable() {
call->class->reg();
Without the reg() function, the trigger code will call a NULL pointer and
crash the system.
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Karim Yaghmour <karim.yaghmour@opersys.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Suggested-by: Joel Fernandes <joelaf@google.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The filter file in the ftrace internal events, like in
/sys/kernel/tracing/events/ftrace/function/filter is not attached to any
functionality. Do not create them as they are meaningless.
In the future, if an ftrace internal event gets filter functionality, then
it will need to create it directly.
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The dynamic arrays defined for ftrace internal events, such as the buf field
for trace_marker (ftrace/print) did not have brackets which makes the filter
code not accept it as a string. This is not currently an issues because the
filter code doesn't do anything for these events, but they will in the
future, and this needs to be fixed for when it does.
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of having both trace_init_tracefs() and event_trace_init() be called
by fs_initcall() routines, have event_trace_init() called directly by
trace_init_tracefs(). This will guarantee order of how the events are
created with respect to the rest of the ftrace infrastructure. This is
needed to be able to assoctiate event files with ftrace internal events,
such as the trace_marker.
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
By adding the function __find_event_file() that can search for files without
restrictions, such as if the event associated with the file has a reg
function, or if it has the "ignore" flag set, the files that are associated
to ftrace internal events (like trace_marker and function events) can be
found and used.
find_event_file() still returns a "filtered" file, as most callers need a
valid trace event file. One created by the trace_events.h macros and not one
created for parsing ftrace specific events.
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Trace event triggers can be called before or after the event has been
committed. If it has been called after the commit, there's a possibility
that the event no longer exists. Currently, the two post callers is the
trigger to disable tracing (traceoff) and the one that will record a stack
dump (stacktrace). Neither of them reference the trace event entry record,
as that would lead to a race condition that could pass in corrupted data.
To prevent any other users of the post data triggers from using the trace
event record, pass in NULL to the post call trigger functions for the event
record, as they should never need to use them in the first place.
This does not fix any bug, but prevents bugs from happening by new post call
trigger users.
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The snapshot trigger currently only affects the main ring buffer, even when
it is used by the instances. This can be confusing as the snapshot trigger
is listed in the instance.
> # cd /sys/kernel/tracing
> # mkdir instances/foo
> # echo snapshot > instances/foo/events/syscalls/sys_enter_fchownat/trigger
> # echo top buffer > trace_marker
> # echo foo buffer > instances/foo/trace_marker
> # touch /tmp/bar
> # chown rostedt /tmp/bar
> # cat instances/foo/snapshot
# tracer: nop
#
#
# * Snapshot is freed *
#
# Snapshot commands:
# echo 0 > snapshot : Clears and frees snapshot buffer
# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
# Takes a snapshot of the main buffer.
# echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
# (Doesn't have to be '2' works with any number that
# is not a '0' or '1')
> # cat snapshot
# tracer: nop
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
bash-1189 [000] .... 111.488323: tracing_mark_write: top buffer
Not only did the snapshot occur in the top level buffer, but the instance
snapshot buffer should have been allocated, and it is still free.
Cc: stable@vger.kernel.org
Fixes: 85f2b08268 ("tracing: Add basic event trigger framework")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If a instance has an event trigger enabled when it is freed, it could cause
an access of free memory. Here's the case that crashes:
# cd /sys/kernel/tracing
# mkdir instances/foo
# echo snapshot > instances/foo/events/initcall/initcall_start/trigger
# rmdir instances/foo
Would produce:
general protection fault: 0000 [#1] PREEMPT SMP PTI
Modules linked in: tun bridge ...
CPU: 5 PID: 6203 Comm: rmdir Tainted: G W 4.17.0-rc4-test+ #933
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
RIP: 0010:clear_event_triggers+0x3b/0x70
RSP: 0018:ffffc90003783de0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800c7130ba0
RBP: ffffc90003783e00 R08: ffff8801131993f8 R09: 0000000100230016
R10: ffffc90003783d80 R11: 0000000000000000 R12: ffff8800c7130ba0
R13: ffff8800c7130bd8 R14: ffff8800cc093768 R15: 00000000ffffff9c
FS: 00007f6f4aa86700(0000) GS:ffff88011eb40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6f4a5aed60 CR3: 00000000cd552001 CR4: 00000000001606e0
Call Trace:
event_trace_del_tracer+0x2a/0xc5
instance_rmdir+0x15c/0x200
tracefs_syscall_rmdir+0x52/0x90
vfs_rmdir+0xdb/0x160
do_rmdir+0x16d/0x1c0
__x64_sys_rmdir+0x17/0x20
do_syscall_64+0x55/0x1a0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This was due to the call the clears out the triggers when an instance is
being deleted not removing the trigger from the link list.
Cc: stable@vger.kernel.org
Fixes: 85f2b08268 ("tracing: Add basic event trigger framework")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, suppose a userspace application has loaded a bpf program
and attached it to a tracepoint/kprobe/uprobe, and a bpf
introspection tool, e.g., bpftool, wants to show which bpf program
is attached to which tracepoint/kprobe/uprobe. Such attachment
information will be really useful to understand the overall bpf
deployment in the system.
There is a name field (16 bytes) for each program, which could
be used to encode the attachment point. There are some drawbacks
for this approaches. First, bpftool user (e.g., an admin) may not
really understand the association between the name and the
attachment point. Second, if one program is attached to multiple
places, encoding a proper name which can imply all these
attachments becomes difficult.
This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
Given a pid and fd, if the <pid, fd> is associated with a
tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return
. prog_id
. tracepoint name, or
. k[ret]probe funcname + offset or kernel addr, or
. u[ret]probe filename + offset
to the userspace.
The user can use "bpftool prog" to find more information about
bpf program itself with prog_id.
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Updates to the handling of expedited grace periods, perhaps most
notably parallelizing their initialization. Other changes
include fixes from Boqun Feng.
- Miscellaneous fixes. These include an nvme fix from Nitzan Carmi
that I am carrying because it depends on a new SRCU function
cleanup_srcu_struct_quiesced(). This branch also includes fixes
from Byungchul Park and Yury Norov.
- Updates to reduce lock contention in the rcu_node combining tree.
These are in preparation for the consolidation of RCU-bh,
RCU-preempt, and RCU-sched into a single flavor, which was
requested by Linus Torvalds in response to a security flaw
whose root cause included confusion between the multiple flavors
of RCU.
- Torture-test updates that save their users some time and effort.
Conflicts:
drivers/nvme/host/core.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit e31d28b6ab ("trace: Eliminate cond_resched_rcu_qs() in favor
of cond_resched()") substituted cond_resched() for the earlier call
to cond_resched_rcu_qs(). However, the new-age cond_resched() does
not do anything to help RCU-tasks grace periods because (1) RCU-tasks
is only enabled when CONFIG_PREEMPT=y and (2) cond_resched() is a
complete no-op when preemption is enabled. This situation results
in hangs when running the trace benchmarks.
A number of potential fixes were discussed on LKML
(https://lkml.kernel.org/r/20180224151240.0d63a059@vmware.local.home),
including making cond_resched() not be a no-op; making cond_resched()
not be a no-op, but only when running tracing benchmarks; reverting
the aforementioned commit (which works because cond_resched_rcu_qs()
does provide an RCU-tasks quiescent state; and adding a call to the
scheduler/RCU rcu_note_voluntary_context_switch() function. All were
deemed unsatisfactory, either due to added cond_resched() overhead or
due to magic functions inviting cargo culting.
This commit renames cond_resched_rcu_qs() to cond_resched_tasks_rcu_qs(),
which provides a clear hint as to what this function is doing and
why and where it should be used, and then replaces the call to
cond_resched() with cond_resched_tasks_rcu_qs() in the trace benchmark's
benchmark_event_kthread() function.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
The bpf syscall and selftests conflicts were trivial
overlapping changes.
The r8169 change involved moving the added mdelay from 'net' into a
different function.
A TLS close bug fix overlapped with the splitting of the TLS state
into separate TX and RX parts. I just expanded the tests in the bug
fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf
== X".
Signed-off-by: David S. Miller <davem@davemloft.net>
The regex match function regex_match_front() in the tracing filter logic,
was fixed to test just the pattern length from testing the entire test
string. That is, it went from strncmp(str, r->pattern, len) to
strcmp(str, r->pattern, r->len).
The issue is that str is not guaranteed to be nul terminated, and if r->len
is greater than the length of str, it can access more memory than is
allocated.
The solution is to add a simple test if (len < r->len) return 0.
Cc: stable@vger.kernel.org
Fixes: 285caad415 ("tracing/filters: Fix MATCH_FRONT_ONLY filter matching")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Minor conflict, a CHECK was placed into an if() statement
in net-next, whilst a newline was added to that CHECK
call in 'net'. Thanks to Daniel for the merge resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
when they are writable by root. To fix the confusion, they should
be 0644. Note, either case root can still write to them.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWuyBchQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qmDLAQDyddL4DS480WXv3t3I/ZPwjHVuI4qS
cPUsAsjn3Xs9wAD+O6/rE8SL/Q2tUIWlWk9wC4YpGqEoR6R3x98qpnGP3gA=
=L/Kw
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.17-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Some of the files in the tracing directory show file mode 0444 when
they are writable by root. To fix the confusion, they should be 0644.
Note, either case root can still write to them.
Zhengyuan asked why I never applied that patch (the first one is from
2014!). I simply forgot about it. /me lowers head in shame"
* tag 'trace-v4.17-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix the file mode of stack tracer
ftrace: Have set_graph_* files have normal file modes
It looks weird that the stack_trace_filter file can be written by root
but shows that it does not have write permission by ll command.
Link: http://lkml.kernel.org/r/1518054113-28096-1-git-send-email-liuzhengyuan@kylinos.cn
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The set_graph_function and set_graph_notrace file mode should be 0644
instead of 0444 as they are writeable. Note, the mode appears to be ignored
regardless, but they should at least look sane.
Link: http://lkml.kernel.org/r/1409725869-4501-1-git-send-email-linx.z.chen@intel.com
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Chen LinX <linx.z.chen@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
config TRACING_SUPPORT has an exception for PPC32, because PPC32
didn't have irqflags tracing support.
But that hasn't been true since commit 5d38902c48 ("powerpc: Add
irqtrace support for 32-bit powerpc") (Jun 2009).
So remove the exception for PPC32 and the comment.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Tracepoints should not give warning on OOM failures
- Use special field for function pointer in trace event
- Fix igrab issues in uprobes
- Fixes to the new histogram triggers
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWuoYdBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtFnAP9X4+AVDQH0VfsMLSc9D+rK6WmcRIhv
q8J2gNPv3anM+AD/SFXWGO4ihN+0KDw/TqmJxESNEybq47vTZ/s5lM6A4gQ=
=fQbj
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fixes in tracing:
- Tracepoints should not give warning on OOM failures
- Use special field for function pointer in trace event
- Fix igrab issues in uprobes
- Fixes to the new histogram triggers"
* tag 'trace-v4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoint: Do not warn on ENOMEM
tracing: Add field modifier parsing hist error for hist triggers
tracing: Add field parsing hist error for hist triggers
tracing: Restore proper field flag printing when displaying triggers
tracing: initcall: Ordered comparison of function pointers
tracing: Remove igrab() iput() call from uprobes.c
tracing: Fix bad use of igrab in trace_uprobe.c
Currently, the bpf_current_task_under_cgroup helper has a check where if
the BPF program is running in_interrupt(), it will return -EINVAL. This
prevents the helper to be used in many useful scenarios, particularly
BPF programs attached to Perf Events.
This commit removes the check. Tested a few NMI (Perf Event) and some
softirq context, the helper returns the correct result.
Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull timer fixes from Thomas Gleixner:
"Two fixes from the timer departement:
- Fix a long standing issue in the NOHZ tick code which causes RB
tree corruption, delayed timers and other malfunctions. The cause
for this is code which modifies the expiry time of an enqueued
hrtimer.
- Revert the CLOCK_MONOTONIC/CLOCK_BOOTTIME unification due to
regression reports. Seems userspace _is_ relying on the documented
behaviour despite our hope that it wont"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME
tick/sched: Do not mess with an enqueued hrtimer
Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.
This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If the user specifies an invalid field modifier for a hist trigger,
the current code correctly flags that as an error, but doesn't tell
the user what happened.
Fix this by invoking hist_err() with an appropriate message when
invalid modifiers are specified.
Before:
# echo 'hist:keys=pid:ts0=common_timestamp.junkusecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
-su: echo: write error: Invalid argument
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/hist
After:
# echo 'hist:keys=pid:ts0=common_timestamp.junkusecs' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
-su: echo: write error: Invalid argument
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/hist
ERROR: Invalid field modifier: junkusecs
Last command: keys=pid:ts0=common_timestamp.junkusecs
Link: http://lkml.kernel.org/r/b043c59fa79acd06a5f14a1d44dee9e5a3cd1248.1524790601.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If the user specifies a nonexistent field for a hist trigger, the
current code correctly flags that as an error, but doesn't tell the
user what happened.
Fix this by invoking hist_err() with an appropriate message when
nonexistent fields are specified.
Before:
# echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
-su: echo: write error: Invalid argument
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
After:
# echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
-su: echo: write error: Invalid argument
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
ERROR: Couldn't find field: pid
Last command: keys=pid:ts0=common_timestamp.usecs
Link: http://lkml.kernel.org/r/fdc8746969d16906120f162b99dd71c741e0b62c.1524790601.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The flag-printing code used when displaying hist triggers somehow got
dropped during refactoring of the inter-event patchset. This restores
it.
Below are a couple examples - in the first case, .usecs wasn't being
displayed properly for common_timestamps and the second illustrates
the same for other flags such as .execname.
Before:
# echo 'hist:key=common_pid.execname:val=count:sort=count' > /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
# cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
hist:keys=common_pid:vals=hitcount,count:sort=count:size=2048 [active]
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
hist:keys=pid:vals=hitcount:ts0=common_timestamp:sort=hitcount:size=2048:clock=global if comm=="cyclictest" [active]
After:
# echo 'hist:key=common_pid.execname:val=count:sort=count' > /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
# cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
hist:keys=common_pid.execname:vals=hitcount,count:sort=count:size=2048 [active]
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global if comm=="cyclictest" [active]
Link: http://lkml.kernel.org/r/492bab42ff21806600af98a8ea901af10efbee0c.1524790601.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Add workqueue forward declaration (for new work, but a nice clean up)
- seftest fixes for the new histogram code
- Print output fix for hwlat tracer
- Fix missing system call events - due to change in x86 syscall naming
- Fix kprobe address being used by perf being hashed
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWuIMShQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qkrdAQDRrgIGcm4pRGrvPiGhp4FeQKUx3woM
LY10qMYo3St7zwEAn5oor/e/7KQaQSdKQ7QkL690QU2bTO6FXz4VwE1OcgM=
=OHJk
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Add workqueue forward declaration (for new work, but a nice clean up)
- seftest fixes for the new histogram code
- Print output fix for hwlat tracer
- Fix missing system call events - due to change in x86 syscall naming
- Fix kprobe address being used by perf being hashed
* tag 'trace-v4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix missing tab for hwlat_detector print format
selftests: ftrace: Add a testcase for multiple actions on trigger
selftests: ftrace: Fix trigger extended error testcase
kprobes: Fix random address output of blacklist file
tracing: Fix kernel crash while using empty filter with perf
tracing/x86: Update syscall trace events to handle new prefixed syscall func names
tracing: Add missing forward declaration
As Miklos reported and suggested:
This pattern repeats two times in trace_uprobe.c and in
kernel/events/core.c as well:
ret = kern_path(filename, LOOKUP_FOLLOW, &path);
if (ret)
goto fail_address_parse;
inode = igrab(d_inode(path.dentry));
path_put(&path);
And it's wrong. You can only hold a reference to the inode if you
have an active ref to the superblock as well (which is normally
through path.mnt) or holding s_umount.
This way unmounting the containing filesystem while the tracepoint is
active will give you the "VFS: Busy inodes after unmount..." message
and a crash when the inode is finally put.
Solution: store path instead of inode.
This patch fixes two instances in trace_uprobe.c. struct path is added to
struct trace_uprobe to keep the inode and containing mount point
referenced.
Link: http://lkml.kernel.org/r/20180423172135.4050588-1-songliubraving@fb.com
Fixes: f3f096cfed ("tracing: Provide trace events interface for uprobes")
Fixes: 33ea4b2427 ("perf/core: Implement the 'perf_uprobe' PMU")
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Howard McLauchlan <hmclauchlan@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Revert commits
92af4dcb4e ("tracing: Unify the "boot" and "mono" tracing clocks")
127bfa5f43 ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior")
7250a4047a ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior")
d6c7270e91 ("timekeeping: Remove boot time specific code")
f2d6fdbfd2 ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior")
d6ed449afd ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock")
72199320d4 ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock")
As stated in the pull request for the unification of CLOCK_MONOTONIC and
CLOCK_BOOTTIME, it was clear that we might have to revert the change.
As reported by several folks systemd and other applications rely on the
documented behaviour of CLOCK_MONOTONIC on Linux and break with the above
changes. After resume daemons time out and other timeout related issues are
observed. Rafael compiled this list:
* systemd kills daemons on resume, after >WatchdogSec seconds
of suspending (Genki Sky). [Verified that that's because systemd uses
CLOCK_MONOTONIC and expects it to not include the suspend time.]
* systemd-journald misbehaves after resume:
systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal
corrupted or uncleanly shut down, renaming and replacing.
(Mike Galbraith).
* NetworkManager reports "networking disabled" and networking is broken
after resume 50% of the time (Pavel). [May be because of systemd.]
* MATE desktop dims the display and starts the screensaver right after
system resume (Pavel).
* Full system hang during resume (me). [May be due to systemd or NM or both.]
That happens on debian and open suse systems.
It's sad, that these problems were neither catched in -next nor by those
folks who expressed interest in this change.
Reported-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Reported-by: Genki Sky <sky@genki.is>,
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Easton <kevin@guarana.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
It's been missing for a while but no one is touching that up. Fix it.
Link: http://lkml.kernel.org/r/20180315060639.9578-1-peterx@redhat.com
CC: Ingo Molnar <mingo@kernel.org>
Cc:stable@vger.kernel.org
Fixes: 7b2c862501 ("tracing: Add NMI tracing in hwlat detector")
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Kernel is crashing when user tries to record 'ftrace:function' event
with empty filter:
# perf record -e ftrace:function --filter="" ls
# dmesg
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
Oops: 0000 [#1] SMP PTI
...
RIP: 0010:ftrace_profile_set_filter+0x14b/0x2d0
RSP: 0018:ffffa4a7c0da7d20 EFLAGS: 00010246
RAX: ffffa4a7c0da7d64 RBX: 0000000000000000 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff8c48ffc968f0
...
Call Trace:
_perf_ioctl+0x54a/0x6b0
? rcu_all_qs+0x5/0x30
...
After patch:
# perf record -e ftrace:function --filter="" ls
failed to set filter "" on event ftrace:function with 22 (Invalid argument)
Also, if user tries to echo "" > filter, it used to throw an error.
This behavior got changed by commit 80765597bc ("tracing: Rewrite
filter logic to be simpler and faster"). This patch restores the
behavior as a side effect:
Before patch:
# echo "" > filter
#
After patch:
# echo "" > filter
bash: echo: write error: Invalid argument
#
Link: http://lkml.kernel.org/r/20180420150758.19787-1-ravi.bangoria@linux.ibm.com
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull networking fixes from David Miller:
1) Fix rtnl deadlock in ipvs, from Julian Anastasov.
2) s390 qeth fixes from Julian Wiedmann (control IO completion stalls,
bad MAC address update sequence, request side races on command IO
timeouts).
3) Handle seq_file overflow properly in l2tp, from Guillaume Nault.
4) Fix VLAN priority mappings in cpsw driver, from Ivan Khoronzhuk.
5) Packet scheduler ife action fixes (malformed TLV lengths, etc.) from
Alexander Aring.
6) Fix out of bounds access in tcp md5 option parser, from Jann Horn.
7) Missing netlink attribute policies in rtm_ipv6_policy table, from
Eric Dumazet.
8) Missing socket address length checks in l2tp and pppoe connect, from
Guillaume Nault.
9) Fix netconsole over team and bonding, from Xin Long.
10) Fix race with AF_PACKET socket state bitfields, from Willem de
Bruijn.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (51 commits)
ice: Fix insufficient memory issue in ice_aq_manage_mac_read
sfc: ARFS filter IDs
net: ethtool: Add missing kernel doc for FEC parameters
packet: fix bitfield update race
ice: Do not check INTEVENT bit for OICR interrupts
ice: Fix incorrect comment for action type
ice: Fix initialization for num_nodes_added
igb: Fix the transmission mode of queue 0 for Qav mode
ixgbevf: ensure xdp_ring resources are free'd on error exit
team: fix netconsole setup over team
amd-xgbe: Only use the SFP supported transceiver signals
amd-xgbe: Improve KR auto-negotiation and training
amd-xgbe: Add pre/post auto-negotiation phy hooks
pppoe: check sockaddr length in pppoe_connect()
l2tp: check sockaddr length in pppol2tp_connect()
net: phy: marvell: clear wol event before setting it
ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave
tcp: don't read out-of-bounds opsize
ibmvnic: Clean actual number of RX or TX pools
...
Daniel Borkmann says:
====================
pull-request: bpf 2018-04-21
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a deadlock between mm->mmap_sem and bpf_event_mutex when
one task is detaching a BPF prog via perf_event_detach_bpf_prog()
and another one dumping through bpf_prog_array_copy_info(). For
the latter we move the copy_to_user() out of the bpf_event_mutex
lock to fix it, from Yonghong.
2) Fix test_sock and test_sock_addr.sh failures. The former was
hitting rlimit issues and the latter required ping to specify
the address family, from Yonghong.
3) Remove a dead check in sockmap's sock_map_alloc(), from Jann.
4) Add generated files to BPF kselftests gitignore that were previously
missed, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This warning message is not very helpful, as the return value should
already show information about the error. Also, this message will
spam dmesg if the user space does testing in a loop, like:
for x in {0..5}
do
echo p:xx xx+$x >> /sys/kernel/debug/tracing/kprobe_events
done
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/20180413185513.3626052-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull more perf updates from Thomas Gleixner:
"A rather large set of perf updates:
Kernel:
- Fix various initialization issues
- Prevent creating [ku]probes for not CAP_SYS_ADMIN users
Tooling:
- Show only failing syscalls with 'perf trace --failure' (Arnaldo
Carvalho de Melo)
e.g: See what 'openat' syscalls are failing:
# perf trace --failure -e openat
762.323 ( 0.007 ms): VideoCapture/4566 openat(dfd: CWD, filename: /dev/video2) = -1 ENOENT No such file or directory
<SNIP N /dev/videoN open attempts... sigh, where is that improvised camera lid?!? >
790.228 ( 0.008 ms): VideoCapture/4566 openat(dfd: CWD, filename: /dev/video63) = -1 ENOENT No such file or directory
^C#
- Show information about the event (freq, nr_samples, total
period/nr_events) in the annotate --tui and --stdio2 'perf
annotate' output, similar to the first line in the 'perf report
--tui', but just for the samples for a the annotated symbol
(Arnaldo Carvalho de Melo)
- Introduce 'perf version --build-options' to show what features were
linked, aliased as well as a shorter 'perf -vv' (Jin Yao)
- Add a "dso_size" sort order (Kim Phillips)
- Remove redundant ')' in the tracepoint output in 'perf trace'
(Changbin Du)
- Synchronize x86's cpufeatures.h, no effect on toolss (Arnaldo
Carvalho de Melo)
- Show group details on the title line in the annotate browser and
'perf annotate --stdio2' output, so that the per-event columns can
have headers (Arnaldo Carvalho de Melo)
- Fixup vertical line separating metrics from instructions and
cleaning unused lines at the bottom, both in the annotate TUI
browser (Arnaldo Carvalho de Melo)
- Remove duplicated 'samples' in lost samples warning in
'perf report' (Arnaldo Carvalho de Melo)
- Synchronize i915_drm.h, silencing the perf build process,
automagically adding support for the new DRM_I915_QUERY ioctl
(Arnaldo Carvalho de Melo)
- Make auxtrace_queues__add_buffer() allocate struct buffer, from a
patchkit already applied (Adrian Hunter)
- Fix the --stdio2/TUI annotate output to include group details, be
it for a recorded '{a,b,f}' explicit event group or when forcing
group display using 'perf report --group' for a set of events not
recorded as a group (Arnaldo Carvalho de Melo)
- Fix display artifacts in the ui browser (base class for the
annotate and main report/top TUI browser) related to the extra
title lines work (Arnaldo Carvalho de Melo)
- perf auxtrace refactorings, leftovers from a previously partially
processed patchset (Adrian Hunter)
- Fix the builtin clang build (Sandipan Das, Arnaldo Carvalho de
Melo)
- Synchronize i915_drm.h, silencing a perf build warning and in the
process automagically adding support for a new ioctl command
(Arnaldo Carvalho de Melo)
- Fix a strncpy issue in uprobe tracing"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
perf/core: Need CAP_SYS_ADMIN to create k/uprobe with perf_event_open()
tracing/uprobe_event: Fix strncpy corner case
perf/core: Fix perf_uprobe_init()
perf/core: Fix perf_kprobe_init()
perf/core: Fix use-after-free in uprobe_perf_close()
perf tests clang: Fix function name for clang IR test
perf clang: Add support for recent clang versions
perf tools: Fix perf builds with clang support
perf tools: No need to include namespaces.h in util.h
perf hists browser: Remove leftover from row returned from refresh
perf hists browser: Show extra_title_lines in the 'D' debug hotkey
perf auxtrace: Make auxtrace_queues__add_buffer() do CPU filtering
tools headers uapi: Synchronize i915_drm.h
perf report: Remove duplicated 'samples' in lost samples warning
perf ui browser: Fixup cleaning unused lines at the bottom
perf annotate browser: Fixup vertical line separating metrics from instructions
perf annotate: Show group details on the title line
perf auxtrace: Make auxtrace_queues__add_buffer() allocate struct buffer
perf/x86/intel: Move regs->flags EXACT bit init
perf trace: Remove redundant ')'
...
- Replace open coded "ARRAY_SIZE()" with macro
- Updates to uprobes
- Bug fix for perf event filter on error path
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWs+2YxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsRUAP9okqGRR/01bBLqNKiJ2j5YeBc9YlWl
R2rC0xbwVBLgJQEAwpE5jxahqKutbgrBDalDeCmXmeTOhSbGRJaBxXqwzwE=
=ZAuQ
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"A few clean ups and bug fixes:
- replace open coded "ARRAY_SIZE()" with macro
- updates to uprobes
- bug fix for perf event filter on error path"
* tag 'trace-v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Enforce passing in filter=NULL to create_filter()
trace_uprobe: Simplify probes_seq_show()
trace_uprobe: Use %lx to display offset
tracing/uprobe: Add support for overlayfs
tracing: Use ARRAY_SIZE() macro instead of open coding it
There's some inconsistency with what to set the output parameter filterp
when passing to create_filter(..., struct event_filter **filterp).
Whatever filterp points to, should be NULL when calling this function. The
create_filter() calls create_filter_start() with a pointer to a local
"filter" variable that is set to NULL. The create_filter_start() has a
WARN_ON() if the passed in pointer isn't pointing to a value set to NULL.
Ideally, create_filter() should pass the filterp variable it received to
create_filter_start() and not hide it as with a local variable, this allowed
create_filter() to fail, and not update the passed in filter, and the caller
of create_filter() then tried to free filter, which was never initialized to
anything, causing memory corruption.
Link: http://lkml.kernel.org/r/00000000000032a0c30569916870@google.com
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Reported-by: syzbot+dadcc936587643d7f568@syzkaller.appspotmail.com
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Simplify probes_seq_show() function. No change in output
before and after patch.
Link: http://lkml.kernel.org/r/20180315082756.9050-2-ravi.bangoria@linux.vnet.ibm.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
tu->offset is unsigned long, not a pointer, thus %lx should
be used to print it, not the %px.
Link: http://lkml.kernel.org/r/20180315082756.9050-1-ravi.bangoria@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 0e4d819d08 ("trace_uprobe: Display correct offset in uprobe_events")
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
uprobes cannot successfully attach to binaries located in a directory
mounted with overlayfs.
To verify, create directories for mounting overlayfs
(upper,lower,work,merge), move some binary into merge/ and use readelf
to obtain some known instruction of the binary. I used /bin/true and the
entry instruction(0x13b0):
$ mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merge
$ cd /sys/kernel/debug/tracing
$ echo 'p:true_entry PATH_TO_MERGE/merge/true:0x13b0' > uprobe_events
$ echo 1 > events/uprobes/true_entry/enable
This returns 'bash: echo: write error: Input/output error' and dmesg
tells us 'event trace: Could not enable event true_entry'
This change makes create_trace_uprobe() look for the real inode of a
dentry. In the case of normal filesystems, this simplifies to just
returning the inode. In the case of overlayfs(and similar fs) we will
obtain the underlying dentry and corresponding inode, upon which uprobes
can successfully register.
Running the example above with the patch applied, we can see that the
uprobe is enabled and will output to trace as expected.
Link: http://lkml.kernel.org/r/20180410231030.2720-1-hmclauchlan@fb.com
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
It is useless to re-invent the ARRAY_SIZE macro so let's use it instead
of DATA_CNT.
Found with Coccinelle with the following semantic patch:
@r depends on (org || report)@
type T;
T[] E;
position p;
@@
(
(sizeof(E)@p /sizeof(*E))
|
(sizeof(E)@p /sizeof(E[...]))
|
(sizeof(E)@p /sizeof(T))
)
Link: http://lkml.kernel.org/r/20171016012250.26453-1-jeremy.lefaure@lse.epita.fr
Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
[ Removed useless include of kernel.h ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
The error details:
======================================================
WARNING: possible circular locking dependency detected
4.16.0-rc7+ #3 Not tainted
------------------------------------------------------
syz-executor7/24531 is trying to acquire lock:
(bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
but task is already holding lock:
(&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&mm->mmap_sem){++++}:
__might_fault+0x13a/0x1d0 mm/memory.c:4571
_copy_to_user+0x2c/0xc0 lib/usercopy.c:25
copy_to_user include/linux/uaccess.h:155 [inline]
bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
_perf_ioctl kernel/events/core.c:4750 [inline]
perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
SYSC_ioctl fs/ioctl.c:701 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
-> #0 (bpf_event_mutex){+.+.}:
lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
__mutex_lock_common kernel/locking/mutex.c:756 [inline]
__mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
_free_event+0xbdb/0x10f0 kernel/events/core.c:4116
put_event+0x24/0x30 kernel/events/core.c:4204
perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
remove_vma+0xb4/0x1b0 mm/mmap.c:172
remove_vma_list mm/mmap.c:2490 [inline]
do_munmap+0x82a/0xdf0 mm/mmap.c:2731
mmap_region+0x59e/0x15a0 mm/mmap.c:1646
do_mmap+0x6c0/0xe00 mm/mmap.c:1483
do_mmap_pgoff include/linux/mm.h:2223 [inline]
vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mm->mmap_sem);
lock(bpf_event_mutex);
lock(&mm->mmap_sem);
lock(bpf_event_mutex);
*** DEADLOCK ***
======================================================
The bug is introduced by Commit f371b304f1 ("bpf/tracing: allow
user space to query prog array on the same tp") where copy_to_user,
which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
At the same time, during perf_event file descriptor close,
mm->mmap_sem is held first and then subsequent
perf_event_detach_bpf_prog needs bpf_event_mutex lock.
Such a senario caused a deadlock.
As suggested by Daniel, moving copy_to_user out of the
bpf_event_mutex lock should fix the problem.
Fixes: f371b304f1 ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
- Tom Zanussi's extended histogram work
This adds the synthetic events to have histograms from multiple event data
Adds triggers "onmatch" and "onmax" to call the synthetic events
Several updates to the histogram code from this
- Allow way to nest ring buffer calls in the same context
- Allow absolute time stamps in ring buffer
- Rewrite of filter code parsing based on Al Viro's suggestions
- Setting of trace_clock to global if TSC is unstable (on boot)
- Better OOM handling when allocating large ring buffers
- Added initcall tracepoints (consolidated initcall_debug code with them)
And other various fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlrLoCAUHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUks/QwAn/ky8WgfjcRdjKmBYuEwDedvm9iI
V9G5kpv5JMw5dLz4l1pS3tA3M9Lyuc5z3Shw92FTy36vdU1wxEjQgHa7viB1xk9x
KsiTyNjTsgrRd7GVHMy/8Be2RRiTRLaXKAsLCoj/c7QWzagV1P8XWlWK5mojYkh/
DrSXyg9Avkp30+sU1bvcLWnmmZUFqMxs+bWipD9uFc98USMMyeP25nrnhrj0gDTg
Q93cjXUuyVRC4lJ2YTW0GCSKhMKEw5f/ltEOT1hwScqYkCJj1EubKqS53R/9h21z
IPUrYcqLnMRu0j2ejR+UAy5Vsy3gJUrPMQb0F6hlu1DwbMd0d/9SGh1c+Sm+zorh
yftWTdCZsYrXkaOuB6V5M30X+KBwbWO0Xc9VCvgJ/IU5vMlgLSt5itTWbT/Fmfhb
ll5/RXP7zhSXRv5sdl/BP3/4dd6F8jpyKyaR2Rk2+XjBOGIq5mvqNGr4Vj9AzxW8
E0nvq7l7e0dbxZNM42gEm3cht1VUg7Zz0Y0+
=91oN
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New features:
- Tom Zanussi's extended histogram work.
This adds the synthetic events to have histograms from multiple
event data Adds triggers "onmatch" and "onmax" to call the
synthetic events Several updates to the histogram code from this
- Allow way to nest ring buffer calls in the same context
- Allow absolute time stamps in ring buffer
- Rewrite of filter code parsing based on Al Viro's suggestions
- Setting of trace_clock to global if TSC is unstable (on boot)
- Better OOM handling when allocating large ring buffers
- Added initcall tracepoints (consolidated initcall_debug code with
them)
And other various fixes and clean ups"
* tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
init: Have initcall_debug still work without CONFIG_TRACEPOINTS
init, tracing: Have printk come through the trace events for initcall_debug
init, tracing: instrument security and console initcall trace events
init, tracing: Add initcall trace events
tracing: Add rcu dereference annotation for test func that touches filter->prog
tracing: Add rcu dereference annotation for filter->prog
tracing: Fixup logic inversion on setting trace_global_clock defaults
tracing: Hide global trace clock from lockdep
ring-buffer: Add set/clear_current_oom_origin() during allocations
ring-buffer: Check if memory is available before allocation
lockdep: Add print_irqtrace_events() to __warn
vsprintf: Do not preprocess non-dereferenced pointers for bprintf (%px and %pK)
tracing: Uninitialized variable in create_tracing_map_fields()
tracing: Make sure variable string fields are NULL-terminated
tracing: Add action comparisons when testing matching hist triggers
tracing: Don't add flag strings when displaying variable references
tracing: Fix display of hist trigger expressions containing timestamps
ftrace: Drop a VLA in module_exists()
tracing: Mention trace_clock=global when warning about unstable clocks
tracing: Default to using trace_global_clock if sched_clock is unstable
...
Fix string fetch function to terminate with NUL.
It is OK to drop the rest of string.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: security@kernel.org
Cc: 范龙飞 <long7573@126.com>
Fixes: 5baaa59ef0 ("tracing/probes: Implement 'memory' fetch method for uprobes")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Similarly to the uprobe PMU fix in perf_kprobe_init(), fix error
handling in perf_uprobe_init() as well.
Reported-by: 范龙飞 <long7573@126.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e12f03d703 ("perf/core: Implement the 'perf_kprobe' PMU")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A boot up test function update_pred_fn() dereferences filter->prog without
the proper rcu annotation.
To do this, we must also take the event_mutex first. Normally, this isn't
needed because this test function can not race with other use cases that
touch the event filters (it is disabled if any events are enabled).
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
ftrace_function_set_filter() referenences filter->prog without annotation
and sparse complains about it. It needs a rcu_dereference_protected()
wrapper.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In commit 932066a15335 ("tracing: Default to using trace_global_clock if
sched_clock is unstable"), the logic for deciding to override the
default clock if unstable was reversed from the earlier posting. I was
trying to reduce the width of the message by using an early return
rather than a if-block, but reverted back to using the if-block and
accidentally left the predicate inverted.
Link: http://lkml.kernel.org/r/20180404212450.26646-1-chris@chris-wilson.co.uk
Fixes: 932066a15335 ("tracing: Default to using trace_global_clock if sched_clock is unstable")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Function tracing can trace in NMIs and such. If the TSC is determined
to be unstable, the tracing clock will switch to the global clock on
boot up, unless "trace_clock" is specified on the kernel command line.
The global clock disables interrupts to access sched_clock_cpu(), and in
doing so can be done within lockdep internals (because of function
tracing and NMIs). This can trigger false lockdep splats.
The trace_clock_global() is special, best not to trace the irq logic
within it.
Link: http://lkml.kernel.org/r/20180404145015.77bde42d@gandalf.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As si_mem_available() can say there is enough memory even though the memory
available is not useable by the ring buffer, it is best to not kill innocent
applications because the ring buffer is taking up all the memory while it is
trying to allocate a great deal of memory.
If the allocator is user space (because kernel threads can also increase the
size of the kernel ring buffer on boot up), then after si_mem_available()
says there is enough memory, set the OOM killer to kill the current task if
an OOM triggers during the allocation.
Link: http://lkml.kernel.org/r/20180404062340.GD6312@dhcp22.suse.cz
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The ring buffer is made up of a link list of pages. When making the ring
buffer bigger, it will allocate all the pages it needs before adding to the
ring buffer, and if it fails, it frees them and returns an error. This makes
increasing the ring buffer size an all or nothing action. When this was
first created, the pages were allocated with "NORETRY". This was to not
cause any Out-Of-Memory (OOM) actions from allocating the ring buffer. But
NORETRY was too strict, as the ring buffer would fail to expand even when
there's memory available, but was taken up in the page cache.
Commit 848618857d ("tracing/ring_buffer: Try harder to allocate") changed
the allocating from NORETRY to RETRY_MAYFAIL. The RETRY_MAYFAIL would
allocate from the page cache, but if there was no memory available, it would
simple fail the allocation and not trigger an OOM.
This worked fine, but had one problem. As the ring buffer would allocate one
page at a time, it could take up all memory in the system before it failed
to allocate and free that memory. If the allocation is happening and the
ring buffer allocates all memory and then tries to take more than available,
its allocation will not trigger an OOM, but if there's any allocation that
happens someplace else, that could trigger an OOM, even though once the ring
buffer's allocation fails, it would free up all the previous memory it tried
to allocate, and allow other memory allocations to succeed.
Commit d02bd27bd3 ("mm/page_alloc.c: calculate 'available' memory in a
separate function") separated out si_mem_availble() as a separate function
that could be used to see how much memory is available in the system. Using
this function to make sure that the ring buffer could be allocated before it
tries to allocate pages we can avoid allocating all memory in the system and
making it vulnerable to OOMs if other allocations are taking place.
Link: http://lkml.kernel.org/r/1522320104-6573-1-git-send-email-zhaoyang.huang@spreadtrum.com
CC: stable@vger.kernel.org
Cc: linux-mm@kvack.org
Fixes: 848618857d ("tracing/ring_buffer: Try harder to allocate")
Requires: d02bd27bd3 ("mm/page_alloc.c: calculate 'available' memory in a separate function")
Reported-by: Zhaoyang Huang <huangzhaoyang@gmail.com>
Tested-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Smatch complains that idx can be used uninitialized when we check if
(idx < 0). It has to be the first iteration through the loop and the
HIST_FIELD_FL_STACKTRACE bit has to be clear and the HIST_FIELD_FL_VAR
bit has to be set to reach the bug.
Link: http://lkml.kernel.org/r/20180328114815.GC29050@mwanda
Fixes: 30350d65ac ("tracing: Add variable support to hist triggers")
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The strncpy() currently being used for variable string fields can
result in a lack of termination if the string length is equal to the
field size. Use the safer strscpy() instead, which will guarantee
termination.
Link: http://lkml.kernel.org/r/fb97c1e518fb358c12a4057d7445ba2c46956cd7.1522256721.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Actions also need to be considered when checking for matching triggers
- triggers differing only by action should be allowed, but currently
aren't because the matching check ignores the action and erroneously
returns -EEXIST.
Add and call an actions_match() function to address that.
Here's an example using onmatch() actions. The first -EEXIST shouldn't
occur because the onmatch() is different in the second wakeup_latency()
param. The second -EEXIST shouldn't occur because it's a different
action (in this case, it doesn't have an action, so shouldn't be seen
as being the same and therefore rejected).
In the after case, both are correctly accepted (and trying to add one of
them again returns -EEXIST as it should).
before:
# echo 'wakeup_latency u64 lat; pid_t pid' >> /sys/kernel/debug/tracing/synthetic_events
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0 if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=next_pid:onmatch(sched.sched_wakeup).wakeup_latency(sched.sched_switch.$wakeup_lat,next_pid) if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=next_pid:onmatch(sched.sched_wakeup).wakeup_latency(sched.sched_switch.$wakeup_lat,prev_pid) if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
-su: echo: write error: File exists
# echo 'hist:keys=next_pid if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
-su: echo: write error: File exists
after:
# echo 'wakeup_latency u64 lat; pid_t pid' >> /sys/kernel/debug/tracing/synthetic_events
# echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0 if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=next_pid:onmatch(sched.sched_wakeup).wakeup_latency(sched.sched_switch.$wakeup_lat,next_pid) if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=next_pid:onmatch(sched.sched_wakeup).wakeup_latency(sched.sched_switch.$wakeup_lat,prev_pid) if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
# echo 'hist:keys=next_pid if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
Link: http://lkml.kernel.org/r/a7fd668b87ec10736c8f016ac4279c8480d50c2b.1522256721.git.tom.zanussi@linux.intel.com
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Variable references should never have flags appended when displayed -
prevent that from happening.
Before:
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0.usecs:...
After:
hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:...
Link: http://lkml.kernel.org/r/913318a5610ef6b24af2522575f671fa6ee19b6b.1522256721.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When displaying hist triggers, variable references that have the
timestamp field flag set are erroneously displayed as common_timestamp
rather than the variable reference. Additionally, timestamp
expressions are displayed in the same way. Fix this by forcing the
timestamp flag handling to follow variable reference and expression
handling.
Before:
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs:...
After:
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0.usecs:...
Link: http://lkml.kernel.org/r/92746b06be67499c2a6217bd55395b350ad18fad.1522256721.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>